Big Data- Deep Dive: March 2018

After talking to a lot of aspiring big data developers, I realized basic concepts of big data tools still elude most of the people. HIVE is usually one of the starting points to enter big data domain, so I am starting with some basic concepts of HIVE.

External vs Managed Table:

Concept	External	Managed
Definition	External keyword is used while creating table	No external keyword
Location	Location is provided at the time of table creation	HIVE’s default location is used
Deletion	Only metadata gets deleted when table is dropped	Both metadata and data gets deleted when table is dropped
Need	External table indicates that data doesn’t inherently belong to HIVE. It is either a virtual table created on other tool’s data or other file system or someone is sharing that data with you i.e. you don’t own the right to delete it.	Data entirely belongs to HIVE. You can delete it.
Example	You are using logs to analyze the usage patterns of your cluster, but the logs are also needed to check for errors or load and a lot of other things. So you have to keep them, you can’t delete them after use.	While analyzing usage patterns of your cluster, you create a different table from the data by joining it with the logs about the details of jobs run on your cluster.

Partitioning vs clustering:

Concept	Partitioning	Clustering(Bucketing)
Definition	It’s a technique to divide the records into different parts based on the value of specific column/s	It’s a technique to divide the records into different parts based on the value of specific column/s
Internal process	Each unique value of the specified column/s leads to creation of new folder. All the records containing same value for the column/s are put into corresponding folder.	Works similar to hash map. A hash function is applied to the value of specified column and the record corresponding to the value is put into the corresponding bucket.
Structure	Records are put into different folders according to the value of specific column	Records are put into different files according the value of column
Number of Partitions	Number of folders is equal to number of unique values for the specified column. If partitioning is done on more than one column, then number of folders is equal to total combinations of the column/s	Number of files is given at creation time.
Multiple columns	Partitions can be done on more than one column, e.g. PARTITIONED BY (state, city). Partitioning is done on state and then every state folder will contain various city folders.	Clustering can be done with partitioning. Clustered files will be put into partitioned folders. Clustering is not possible on more than one column.
Pros	As the data is partitioned according to the actual value of column, it greatly reduces the amount of data to be filtered when the querying on the partitioned column.	Number of buckets is decided by user, this prevents having endless folders. Buckets always contain almost equal amount of data.
Cons	If the data is skewed, you may end up having some partitions with a lot of data and some having close to none. If a column has a lot of unique values you may end up with a lot of folders, which affects both file system and processing	Each bucket contains data from more than one value of the column, so lesser benefit than partitioning in terms of data to be filtered.

Here’s a question:

There is a car company sales data by day and there is an ice cream company’s sales data by day. The data is for one month. The column to partition/bucket is date.

What would you use, partitioning or clustering? In which case? Also give the reason.

I hope this helps. Let me know if I missed something or if you have some doubts.

Also let me know if there is any other concept you want to know about.

Big Data- Deep Dive

Wednesday 28 March 2018

HIVE Basics

The Definitive Guide

Programming Hive