The past few days I have been playing with Hive for some data analysis and I wanted to put down what I learned
a) Exporting data from hive to csv
If you are using hue, then it provides a convenient way to export to csv or excel format. But if not then you can use the following preamble before the select statement
INSERT OVERWRITE LOCAL DIRECTORY '/path/out.csv' ROW FORMATTED DELIMITED FIELDS TERMINATED BY ','
a) Exporting data from hive to csv
If you are using hue, then it provides a convenient way to export to csv or excel format. But if not then you can use the following preamble before the select statement
INSERT OVERWRITE LOCAL DIRECTORY '/path/out.csv' ROW FORMATTED DELIMITED FIELDS TERMINATED BY ','
b) Hive does not allow "select" statements in the "where" clause. for example
SELECT file_id, file_type
FROM file_metadatas
WHERE file_type IN (select types from allowed_file_types)
WILL NOT WORK in Hive, but will work in MySQL
Instead one can use a join
SELECT a.file_id, a.file_type
FROM file_metadatas a
JOIN allowed_file_types b
ON b.types = a.file_type
c) Date functions: Computing a date that is ~6 months behind the current date
DATE_SUB(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP( ))),180)
d) hive regex: Regular expressions can be used to extract parts of a text field. Here is an example
Extracting the file extension: regexp_extract(file_name,'\\.[0-9a-zA-Z]+$',0)) extracts the 0th (1st) match of file_name with the input string
e) string manipulation support: Strings can be manipulated in hive. One such example is the "lower" function that converts all characters to lower
Hi Shivani,
ReplyDeleteI am currently trying to perform distributed Topic modeling using R and Hadoop.
I came across your tutorial "Topic Modeling of Large Datasets with R using Amazon EMR" on the net. But no slides were available. I would appreciate if you can share the tutorial slides with me.
Thank you for all your help.
Regards,
Amit Bothra
Amit.Bothra@outlook.com