Big Data- Deep Dive

Wednesday, 28 March 2018

HIVE Basics

After talking to a lot of aspiring big data developers, I realized basic concepts of big data tools still elude most of the people. HIVE is usually one of the starting points to enter big data domain, so I am starting with some basic concepts of HIVE.

External vs Managed Table:

Concept	External	Managed
Definition	External keyword is used while creating table	No external keyword
Location	Location is provided at the time of table creation	HIVE’s default location is used
Deletion	Only metadata gets deleted when table is dropped	Both metadata and data gets deleted when table is dropped
Need	External table indicates that data doesn’t inherently belong to HIVE. It is either a virtual table created on other tool’s data or other file system or someone is sharing that data with you i.e. you don’t own the right to delete it.	Data entirely belongs to HIVE. You can delete it.
Example	You are using logs to analyze the usage patterns of your cluster, but the logs are also needed to check for errors or load and a lot of other things. So you have to keep them, you can’t delete them after use.	While analyzing usage patterns of your cluster, you create a different table from the data by joining it with the logs about the details of jobs run on your cluster.

Partitioning vs clustering:

Concept	Partitioning	Clustering(Bucketing)
Definition	It’s a technique to divide the records into different parts based on the value of specific column/s	It’s a technique to divide the records into different parts based on the value of specific column/s
Internal process	Each unique value of the specified column/s leads to creation of new folder. All the records containing same value for the column/s are put into corresponding folder.	Works similar to hash map. A hash function is applied to the value of specified column and the record corresponding to the value is put into the corresponding bucket.
Structure	Records are put into different folders according to the value of specific column	Records are put into different files according the value of column
Number of Partitions	Number of folders is equal to number of unique values for the specified column. If partitioning is done on more than one column, then number of folders is equal to total combinations of the column/s	Number of files is given at creation time.
Multiple columns	Partitions can be done on more than one column, e.g. PARTITIONED BY (state, city). Partitioning is done on state and then every state folder will contain various city folders.	Clustering can be done with partitioning. Clustered files will be put into partitioned folders. Clustering is not possible on more than one column.
Pros	As the data is partitioned according to the actual value of column, it greatly reduces the amount of data to be filtered when the querying on the partitioned column.	Number of buckets is decided by user, this prevents having endless folders. Buckets always contain almost equal amount of data.
Cons	If the data is skewed, you may end up having some partitions with a lot of data and some having close to none. If a column has a lot of unique values you may end up with a lot of folders, which affects both file system and processing	Each bucket contains data from more than one value of the column, so lesser benefit than partitioning in terms of data to be filtered.

Here’s a question:

There is a car company sales data by day and there is an ice cream company’s sales data by day. The data is for one month. The column to partition/bucket is date.

What would you use, partitioning or clustering? In which case? Also give the reason.

I hope this helps. Let me know if I missed something or if you have some doubts.

Also let me know if there is any other concept you want to know about.

Friday, 27 May 2016

Centralized vs Distributed Control: Which BigData Tool to use?

People often wonder which NoSQL to use, or more importantly which distributed file system to use with hadoop (Yes you can use other distributed file systems with hadoop not just HDFS). I am going to talk about one of the major aspects to base this decision on, whether you want a centralized control or a distributed control (Don't worry they are just fancy words for master slave and peer to peer architecture respectively).

Centralized Control:

This architecture simply means all nodes have a victim node to take all responsibility and blame if something goes wrong, their master or boss (although its other way round for most bosses.. :P ). Actually as the name suggests this architecture has one master node to handle special responsibilities and all other nodes are slave nodes or worker nodes with same responsibilities. Hadoop or more accurately HDFS has this architecture.

Pros of this architecture are:

Searching is easy as all the metadata is in one place. You know which node has your data with just one query to the master.
All the responsibilities starting from handling metadata, detecting and resolving deadlocks, load balancing to detecting nodes(live or dead) and handling dead node's data, everything is handled by the master by default, no need for leader election as you have a king already.
All the algorithms can be centralized, you need not worry how to make them distributed, for example its the master node who will start taking snapshot of the system and every node has to report its local state (snapshot) to it. They don't need to send the information to every other node. Its equivalent to everybody submitting their assignment to teacher instead of giving it to all the classmates for peer review.

Cons of this architecture are:

As all responsibilities are handled by one node, it puts a lot of pressure on that node. That node becomes a bottleneck. Its like a hundred people are standing in queue to buy ticket to a movie but there is just one person to give out ticket, think about the length of the queue (and use book my show :D )
The master node also becomes a single point of failure. If it fails the whole system goes down with it. Although Yarn has solved this problem now with high availability concept where you have more than one master node, out of those one works at a time and others are on standby.
Centralized algorithms again put pressure on master node and as it has become a mediator of each and every decision and communication.

Distributed Control:

This is a more symmetrical architecture like having an assembly instead of a king. Every node has same value and same responsibilities. A popular example of this architecture is cassandra, or more accurately CFS (cassandra file system).

Pros of this architecture are:

The pressure and responsibilities gets distributed. Any node can take a responsibility and any other can take another. Yep people this is book my show, book your own ticket.
No single point of failure. If any node goes down redistribute its responsibility and data among the live nodes.
The system can use distributed algorithms, this reduces time to complete the process.

Cons of this architecture are:

Searching any file is more difficult as you might need to go on asking for directions from 2-3 nodes before you actually reach the one who holds it (Distributed Hash Table), or worse you might have to send a request to a lots of nodes to so that the query to every node and relevant node can answer you back (Query Flooding). Also you need to know the exact data you need as regular expression like queries will take more time.
To run a centralized algorithm, an additional process of leader election has to be done, or the system will end up with more than one snapshot or deadlock correction system will kill more processes than needed. Also writing distributed algorithms isn't exactly a piece of cake, for example there has to be initiator for deadlock detection, if multiple initiators are there then one might resolve it and other might still under the impression that there is a deadlock because it detected a deadlock before first node fixed it.

All in all if you think you can write more distributed sort of algorithms and have more definite or similar searches(you can put search results i.e. the nodes having the data in cache for similar queries), congrats you don't need to bear single point of failure because you can choose distributed control type tool.
But if your searches are very diverse and form the major part of the problem, then use centralized kind of tool, you probably have to endure some bottleneck issues but you can always have a master node with better configuration than the rest.

Remember its just a starting criteria for what NoSQL or file system or distributed tool to use. There are a lot of other factors for example there is CAP theorem to chose the best NoSQL. I am going to talk about CAP theorem in my next post.

I tried to list down the differences and pros and cons as simple as I could make it. Hope it helped. But if you didn't get it or I have overwritten technically fancy words, please ask me again. I'll be happy to explain it in simpler terms. FYI its absolutely no advertisement of any sort :P (you can always use PVR cinemas instead of book my show )

Hope you got some insights about these two architecture styles of distributed systems and it helps you get nearer to the right decision for choosing the right tool according to your need. If you want to know more do tell me. Thanks for reading. Stay tuned for more.....

Thursday, 28 April 2016

Big Data : Is it worth it?

In continuation to my series about what Big Data can do or can't do for you, here's another thought of mine.

First of all I don't want to say that Big Data is not worth it. All this post is about is to make you ask a question to yourself, everyone knows there are risks with big data, just ask yourself are you willing to take the risk, IS IT WORTH IT?

This question is for both companies starting with big data or yet in experimentation phase, and for the people starting with big data ( I have already written http://bigdatabuff.blogspot.in/2015/10/reasons-why-moving-to-big-data-can-be.html for you)

The reason I am writing this post is because I have seen companies starting with big data, experimenting with it for sometime and then just drop it because some mistake convince them that its not worth it. I hope to give insights to what you can or can't expect big data to do for you.

Lets chat then.

The common mistakes or problems are:

Not giving enough time: Time is a huge problem for companies now a days. Companies start on big data, build R&D teams of experts, pay them heavily and expect the outcome in a month. R&D is not just development, it has research in it for a reason. It takes time, it can fail at times; more times than succeeding. Don't expect short term profits from it. If you can't wait and don't want to "waste" your manpower on it just outsource it, because building something innovative takes time.

Misconception: At times people are under the impression that big data is some sort of magic that makes system so fast that you can do everything in milliseconds (Yep people still believe in magic :P). Its big data but its still computer science, all the logics and limitations of computers still apply to it people. Its logic not magic (surprize!!!!). There is a reason that everytime anyone talks about the making a process fast, they also give hardware specifications. Ever heard about terasort benchmark? http://sortbenchmark.org/

Too much experimenting: Well its opposite of your problem1. Just because you are doing R&D doesn't mean you have all the time in the world people!!! Sometimes the problem statements or assumptions you create in your head are wrong. Just because your hypothetical data didn't run as fast as you thought with big data tools, doesn't mean your actual data won't. The millisecond optimizations are to be done with real data. If you go on experimenting with every big data tool on earth with just assumptions (not their architecture) then I am sorry my friend, the list is too long, it'll take a life time to complete.

Too Haphazard architecture: Its a different variant of problem 1. At times people actually don't have a choice other than to use big data. Then at times they make the mistake of taking the first thing they get their hands on and create a system in a haphazard manner. You don't need to look at every big data tool, but atleast look at enough to make sure you tried most of the feasible alternatives.

Big Disclaimer: I am not here to criticize anyone or to say that nobody is using big data the right way. I just wrote this post to point out common mistakes made by companies starting with big data, so that they can be avoided ( I hope I am helping someone :) )

If you can think of some other way someone or even you messed up something like this (we all make mistakes at times), please share in the comments. I would be glad to hear them.
Also if you don't agree with something in my post or if I seem rude at times (although I don't mean to), do tell me.
Thanks for bearing with me. Stay tuned for more :)

Saturday, 6 February 2016

HIVE Tricks and Tips

In continuation of my last post, here are some tips and tricks for HIVE users:

Tricks:

1. Run HIVE script from bash:
   1.1 To run HIVE script from bash script, use command "hive"
          hive -e '{HIVE COMMANDS}'

   1.2 To run HIVE in verbose mode (prints progress with query, generally used for logging):
               hive -v -e '{HIVE COMMANDS}'

     1.3 To pass arguments from bash to hive
           hive -d ARG=$ARG (argument passed from bash) -e '{HIVE COMMANDS}'

     1.4 To pass results from hive to bash
                1.4.1 If you have one or a few values:
                           VAR= hive -e '{HIVE COMMANDS}'
   1.4.2 If you have a big table as a result create an output directory
   hive -d DIR=$DIR -e '
   INSERT OVERWRITE LOCAL DIRECTORY "${DIR}"
                               {REST OF THE COMMAND}'

2. Implement MINUS operator in HIVE :
Q1 LEFT OUTER JOIN Q2 ON (Q1.x=Q2.y) WHERE Q2.y IS NULL
where Q stands for query
the left outer join's output is all rows even if they don't have mapping in result of Q2, is null excludes all the rows that doesn't have a mapping in Q2 (same effect as MINUS operator)

Tip :

Use group by and CASE as much as possible: If you have multiple queries that run on same table try converting them into one query using group by or CASE statements. For e.g. lets say for an app, you want to find out how many users have sessions less than an hour, one hour to 3 hours, more than 3 hours, based on their location, use CASE for creating the 3 buckets and use group by to do it on the basis of location.

Hope you find these useful. If you have some tips and tricks like this you can share via comments. If you are having some problems with HIVE and want some tricks, do ask.

Thanks for reading :)

Sunday, 6 December 2015

HIVE FACTS

HIVE is a big data tool used widely. Its also misunderstood widely.

Here are some facts about HIVE:

1. HIVE is a datawarehouse:
HIVE is a data warehouse not a database. It gives you the capability to put your data (your big data, usually historical data) into files and then analyze it, slice and dice it, drill down, roll up; basically anything except manipulating it.

2. HIVE works in batch mode:
HIVE queries are similar to SQL but their processing is much different than SQL. HIVE queries get converted into MapReduce jobs. It means it takes atleast the amount of time it takes to create a job to run you query (unless its select *).

3. HIVE is not RDBMS:

HIVE has a query language very similar to SQL but that doesn't mean it acts as RDBMS. It lacks many of the important features of RDBMS, like:

Transactions: HIVE doesn't have transactions. It means it has no guarantee that your queries will be atomic, so they can fail at any point leaving your system in inconsistent state.
No locks: There are no locks on tables. That means it is possible that two queries are manipulating your table (a whole partition or a whole table) simultaneously and hence, giving wrong results.

WARNING: Never use HIVE for transactional purposes

4. HIVE is not SQL:

Amount of I/O: HIVE doesn't have tables, it only works on files. It can't read a part of your table, it'll always read the whole file. It doesn't matter whether you want to read one record, or a hundred, the amount of data read is same.
Processing time: If you read the whole "table", processing time is almost none. All it does is read the whole file and return it according to table's schema. But the processing time goes on increasing with each clause in the query.

5. HIVE uses HDFS to store data:
It means if you access a large table (there are solutions for that), you have to account the network time too alongwith I/O time.

There are a lot of features lacking in HIVE, but that doesn't mean that HIVE is useless. It just means its uses are different.
If you have some questions or you have some points where HIVE is different (or lacking), do comment here.

To find out more about HIVE, stay tuned... :)

Thursday, 15 October 2015

Why moving to big data can be a disaster for you

You might have heard about advantages of moving ahead in your career with big data.
Sorry to burst the bubble, but let me tell you the problem and the reality of the so called jobs pouring in for big data developers.

You'll have to work hard:
Big data is an entire new realm. A pretty different way of thinking. I am not saying you can't learn, I am just saying you'll need to work hard, think things differently, and see the problems in the conventional tools and technologies you've been working for so long.
But if you can learn then this is a heaven for you. You'll have a whole lot to explore.

No bed of roses:
Big data is not technology with one direction of focus, full of conventions and rules; its a relatively new technology. You can't be just a user, you need to know it deep before you make any decision about your cluster or the technology you are going to use. There are no rules here, its full of unexpected disasters.

Lots of myths:
Yeah people myths..
This tool is easy, its just your SQL. No buddy its not, its query language is similar to SQL in first glance but it doesn't work like that.
OOOh use this new tool, this will make your jobs faaast. Hey you didn't check its problems, its probably inhibiting some of your jobs and will give you wrong data.
FYI its all from personal experience

False big data:
Companies claim they have loads of data. They need big data, so they hire you. The problem is that even they don't know they don't need big data. They'll make you work on NoSQLs, when MySQL can solve their problem. They'll make you work with real time systems when batch jobs work fine for them. Check out http://bigdatabuff.blogspot.com/2015/07/big-data-do-you-need-it.html
Now you must be thinking how will that affect us, we'll still be working on big data.. right?
Well it will affect you because you never dealt with actual big data problems. You'll never know the problems, so you'll never know the solution. You'll go through a whole lot of problems when you'll face an actual big data expert or system.

Whole lot of false hopes:
If you go for an interview and you are asked pretty basic and easy questions. Don't be happy... Run as hard as you can.. because that simply means even they don't know about big data.
Such companies just hire you in the hopes of getting clients (luring is a better word) to give them work for big data. If they don't get it, they'll ruin your career.

I am not discouraging you from moving to big data, its just a heads-up. Its the story of what every big data developer goes through atleast once ;)
Hope this helps you make the right decision and help you prevent atleast one of the tragedies I have mentioned.

If you want to know about more tragedies or want to send hate mails or comments (Even that is welcome) or want to share your own story; please comment here.. Looking forward to hearing from you.. :)

Sunday, 6 September 2015

HDFS Operations: Replica Placement

Welcome again... Hope you are still interested in finding out functionality of various HDFS operations.
Adding to my last blog post i.e. write operation in HDFS, this post is about a very important part of writing process of HDFS, replica placement. It occurs while writing a file to HDFS, and is very important for failure recovery process of datanodes.

Whenever a file is written in HDFS it is splitted into packets(splits) and each packet is replicated and placed on different nodes, so as to handle failure of nodes.

FYI, I am assuming that replication factor (the parameter to indicate the number of replicas to be created for each split) is set at its default value i.e. 3

Placement of these replicas is a very critical issue:

If you place all replicas on same node, there will be no benefit of replication as all replicas get lost with that namenode failure
If you place your replicas too far, the bandwidth and time needed for one write will be too high, and you have lots of splits to write generally.

This is a tradeoff (hadoop is full of tradeoffs) that had to be considered at HDFS architectural design time.

Distance Measurement:

Distance in hadoop is measured in form of network distance. Lets take the node you are present at (client) as reference. So the distance from a node to the starting point is measured as:

distance from same node is 0 (zero) e.g. Delhi to Delhi distance is taken to be zero (just a concept, don't take it literally ;) )
distance from a node in same rack is 2 e.g. Delhi to Haryana (or any state in India) distance is taken to be 2
distance from a node in different rack but same data center is 4 e.g. India to Japan (or any country in Asia) is taken to be 4
distance from a node in different data center is 6 e.g. India to any state of any country of USA (or any other continent) distance is taken to be 6

Rack is a group of nodes and data center is a group of racks

Replica Placement:

Hadoop's default strategy is:

place the first replica on the same node as client, network distance zero. If the client is not a datanode, then any node which is not too busy or full is chosen at random, network distance 4.
place the second replica on a node on different rack (any rack), network distance 4.
place third replica on any node on same rack as the second replica's node, network distance 2 (now source node is node in point 2)
Further replicas, if created are placed on random nodes.

FYI, you can design your own replica placement policy and write your own code and plug it into hadoop. So, go on and play... but be careful about tradeoffs.

Hope you got it... If not you can ask questions in comment section... Happy learning :)