After talking to a lot of aspiring big data
developers, I realized basic concepts of big data tools still elude most of the
people. HIVE is usually one of the starting points to enter big data domain, so
I am starting with some basic concepts of HIVE.
External vs Managed Table:
Concept
|
External
|
Managed
|
Definition
|
External keyword is used while
creating table
|
No external keyword
|
Location
|
Location is provided at the time of
table creation
|
HIVE’s default location is used
|
Deletion
|
Only metadata gets deleted when table
is dropped
|
Both metadata and data gets deleted
when table is dropped
|
Need
|
External table indicates that data
doesn’t inherently belong to HIVE. It is either a virtual table created on
other tool’s data or other file system or someone is sharing that data with
you i.e. you don’t own the right to delete it.
|
Data entirely belongs to HIVE. You can
delete it.
|
Example
|
You are using logs to analyze the
usage patterns of your cluster, but the logs are also needed to check for
errors or load and a lot of other things. So you have to keep them, you can’t
delete them after use.
|
While analyzing usage patterns of your
cluster, you create a different table from the data by joining it with the
logs about the details of jobs run on your cluster.
|
Partitioning
vs clustering:
Concept
|
Partitioning
|
Clustering(Bucketing)
|
Definition
|
It’s a
technique to divide the records into different parts based on the value of
specific column/s
|
It’s a
technique to divide the records into different parts based on the value of
specific column/s
|
Internal
process
|
Each unique
value of the specified column/s leads to creation of new folder. All the
records containing same value for the column/s are put into corresponding
folder.
|
Works similar
to hash map. A hash function is applied to the value of specified column and
the record corresponding to the value is put into the corresponding bucket.
|
Structure
|
Records are
put into different folders according to the value of specific column
|
Records are
put into different files according the value of column
|
Number of
Partitions
|
Number of
folders is equal to number of unique values for the specified column. If
partitioning is done on more than one column, then number of folders is equal
to total combinations of the column/s
|
Number of
files is given at creation time.
|
Multiple
columns
|
Partitions can
be done on more than one column, e.g. PARTITIONED BY (state, city).
Partitioning is done on state and then every state folder will contain
various city folders.
|
Clustering can
be done with partitioning. Clustered files will be put into partitioned
folders. Clustering is not possible on more than one column.
|
Pros
|
As the data is
partitioned according to the actual value of column, it greatly reduces the
amount of data to be filtered when the querying on the partitioned
column.
|
Number of
buckets is decided by user, this prevents having endless folders.
Buckets always
contain almost equal amount of data.
|
Cons
|
If the data is
skewed, you may end up having some partitions with a lot of data and some
having close to none.
If a column
has a lot of unique values you may end up with a lot of folders, which
affects both file system and processing
|
Each bucket
contains data from more than one value of the column, so lesser benefit than
partitioning in terms of data to be filtered.
|
Here’s a question:
There is a car company sales data by day and there
is an ice cream company’s sales data by day. The data is for one month. The
column to partition/bucket is date.
What would you use, partitioning or clustering? In which case? Also give the reason.
I hope this helps. Let me know if I missed something
or if you have some doubts.
Also let me know if there is any other concept you
want to know about.