Wednesday, 28 March 2018

HIVE Basics


After talking to a lot of aspiring big data developers, I realized basic concepts of big data tools still elude most of the people. HIVE is usually one of the starting points to enter big data domain, so I am starting with some basic concepts of HIVE.

External vs Managed Table:


Concept
External
Managed
Definition
External keyword is used while creating table
No external keyword
Location
Location is provided at the time of table creation
HIVE’s default location is used
Deletion
Only metadata gets deleted when table is dropped
Both metadata and data gets deleted when table is dropped
Need
External table indicates that data doesn’t inherently belong to HIVE. It is either a virtual table created on other tool’s data or other file system or someone is sharing that data with you i.e. you don’t own the right to delete it.
Data entirely belongs to HIVE. You can delete it.
Example
You are using logs to analyze the usage patterns of your cluster, but the logs are also needed to check for errors or load and a lot of other things. So you have to keep them, you can’t delete them after use.
While analyzing usage patterns of your cluster, you create a different table from the data by joining it with the logs about the details of jobs run on your cluster.





 Partitioning vs clustering:


Concept
Partitioning
Clustering(Bucketing)
Definition
It’s a technique to divide the records into different parts based on the value of specific column/s
It’s a technique to divide the records into different parts based on the value of specific column/s
Internal process
Each unique value of the specified column/s leads to creation of new folder. All the records containing same value for the column/s are put into corresponding folder.
Works similar to hash map. A hash function is applied to the value of specified column and the record corresponding to the value is put into the corresponding bucket.
Structure
Records are put into different folders according to the value of specific column
Records are put into different files according the value of column
Number of Partitions
Number of folders is equal to number of unique values for the specified column. If partitioning is done on more than one column, then number of folders is equal to total combinations of the column/s
Number of files is given at creation time.
Multiple columns

Partitions can be done on more than one column, e.g. PARTITIONED BY (state, city). Partitioning is done on state and then every state folder will contain various city folders.
Clustering can be done with partitioning. Clustered files will be put into partitioned folders. Clustering is not possible on more than one column.
Pros
As the data is partitioned according to the actual value of column, it greatly reduces the amount of data to be filtered when the querying on the partitioned column. 
Number of buckets is decided by user, this prevents having endless folders.
Buckets always contain almost equal amount of data.
Cons
If the data is skewed, you may end up having some partitions with a lot of data and some having close to none.
If a column has a lot of unique values you may end up with a lot of folders, which affects both file system and processing
Each bucket contains data from more than one value of the column, so lesser benefit than partitioning in terms of data to be filtered.


Here’s a question:
There is a car company sales data by day and there is an ice cream company’s sales data by day. The data is for one month. The column to partition/bucket is date.
What would you use, partitioning or clustering? In which case? Also give the reason.
I hope this helps. Let me know if I missed something or if you have some doubts.
Also let me know if there is any other concept you want to know about.