Wednesday, March 26, 2014

Hive: Lessons learned

The past few days I have been playing with Hive for some data analysis and I wanted to put down what I learned

a) Exporting data from hive to csv

If you are using hue, then it provides a convenient way to export to csv or excel format. But if not then you can use the following preamble before the select statement

INSERT OVERWRITE LOCAL DIRECTORY '/path/out.csv' ROW FORMATTED DELIMITED FIELDS TERMINATED BY ','

b) Hive does not allow "select" statements in the "where" clause. for example

SELECT file_id, file_type
FROM file_metadatas
WHERE file_type IN (select types from allowed_file_types)

WILL NOT WORK in Hive, but will work in MySQL

Instead one can use a join 

SELECT a.file_id, a.file_type
FROM file_metadatas a
JOIN allowed_file_types b
ON b.types = a.file_type

c) Date functions: Computing a date that is ~6 months behind the current date

DATE_SUB(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP( ))),180)

d) hive regex: Regular expressions can be used to extract parts of a text field. Here is an example
 
Extracting the file extension:  regexp_extract(file_name,'\\.[0-9a-zA-Z]+$',0))  extracts the 0th (1st) match of file_name with the input string

e) string manipulation support: Strings can be manipulated in hive. One such example is the "lower" function that converts all characters to lower

1 comment:

  1. Hi Shivani,
    I am currently trying to perform distributed Topic modeling using R and Hadoop.

    I came across your tutorial "Topic Modeling of Large Datasets with R using Amazon EMR" on the net. But no slides were available. I would appreciate if you can share the tutorial slides with me.

    Thank you for all your help.
    Regards,
    Amit Bothra
    Amit.Bothra@outlook.com

    ReplyDelete