In continuation of my last post, here are some tips and tricks for HIVE users:
Tricks:
1. Run HIVE script from bash:1.1 To run HIVE script from bash script, use command "hive"
hive -e '{HIVE COMMANDS}'
1.2 To run HIVE in verbose mode (prints progress with query, generally used for logging):
hive -v -e '{HIVE COMMANDS}'
1.3 To pass arguments from bash to hive
hive -d ARG=$ARG (argument passed from bash) -e '{HIVE COMMANDS}'
1.4 To pass results from hive to bash
1.4.1 If you have one or a few values:
VAR= hive -e '{HIVE COMMANDS}'
1.4.2 If you have a big table as a result create an output directory
hive -d DIR=$DIR -e '
INSERT OVERWRITE LOCAL DIRECTORY "${DIR}"
{REST OF THE COMMAND}'
2. Implement MINUS operator in HIVE :
Q1 LEFT OUTER JOIN Q2 ON (Q1.x=Q2.y) WHERE Q2.y IS NULL
where Q stands for query
the left outer join's output is all rows even if they don't have mapping in result of Q2, is null excludes all the rows that doesn't have a mapping in Q2 (same effect as MINUS operator)
Tip :
Use group by and CASE as much as possible: If you have multiple queries that run on same table try converting them into one query using group by or CASE statements. For e.g. lets say for an app, you want to find out how many users have sessions less than an hour, one hour to 3 hours, more than 3 hours, based on their location, use CASE for creating the 3 buckets and use group by to do it on the basis of location.Hope you find these useful. If you have some tips and tricks like this you can share via comments. If you are having some problems with HIVE and want some tricks, do ask.
Thanks for reading :)