Big Data- Deep Dive: 2015

Sunday 6 December 2015

HIVE FACTS

HIVE is a big data tool used widely. Its also misunderstood widely.

Here are some facts about HIVE:

1. HIVE is a datawarehouse:
HIVE is a data warehouse not a database. It gives you the capability to put your data (your big data, usually historical data) into files and then analyze it, slice and dice it, drill down, roll up; basically anything except manipulating it.

2. HIVE works in batch mode:
HIVE queries are similar to SQL but their processing is much different than SQL. HIVE queries get converted into MapReduce jobs. It means it takes atleast the amount of time it takes to create a job to run you query (unless its select *).

3. HIVE is not RDBMS:

HIVE has a query language very similar to SQL but that doesn't mean it acts as RDBMS. It lacks many of the important features of RDBMS, like:

Transactions: HIVE doesn't have transactions. It means it has no guarantee that your queries will be atomic, so they can fail at any point leaving your system in inconsistent state.
No locks: There are no locks on tables. That means it is possible that two queries are manipulating your table (a whole partition or a whole table) simultaneously and hence, giving wrong results.

WARNING: Never use HIVE for transactional purposes

4. HIVE is not SQL:

Amount of I/O: HIVE doesn't have tables, it only works on files. It can't read a part of your table, it'll always read the whole file. It doesn't matter whether you want to read one record, or a hundred, the amount of data read is same.
Processing time: If you read the whole "table", processing time is almost none. All it does is read the whole file and return it according to table's schema. But the processing time goes on increasing with each clause in the query.

5. HIVE uses HDFS to store data:
It means if you access a large table (there are solutions for that), you have to account the network time too alongwith I/O time.

There are a lot of features lacking in HIVE, but that doesn't mean that HIVE is useless. It just means its uses are different.
If you have some questions or you have some points where HIVE is different (or lacking), do comment here.

To find out more about HIVE, stay tuned... :)

Thursday 15 October 2015

Why moving to big data can be a disaster for you

You might have heard about advantages of moving ahead in your career with big data.
Sorry to burst the bubble, but let me tell you the problem and the reality of the so called jobs pouring in for big data developers.

You'll have to work hard:
Big data is an entire new realm. A pretty different way of thinking. I am not saying you can't learn, I am just saying you'll need to work hard, think things differently, and see the problems in the conventional tools and technologies you've been working for so long.
But if you can learn then this is a heaven for you. You'll have a whole lot to explore.

No bed of roses:
Big data is not technology with one direction of focus, full of conventions and rules; its a relatively new technology. You can't be just a user, you need to know it deep before you make any decision about your cluster or the technology you are going to use. There are no rules here, its full of unexpected disasters.

Lots of myths:
Yeah people myths..
This tool is easy, its just your SQL. No buddy its not, its query language is similar to SQL in first glance but it doesn't work like that.
OOOh use this new tool, this will make your jobs faaast. Hey you didn't check its problems, its probably inhibiting some of your jobs and will give you wrong data.
FYI its all from personal experience

False big data:
Companies claim they have loads of data. They need big data, so they hire you. The problem is that even they don't know they don't need big data. They'll make you work on NoSQLs, when MySQL can solve their problem. They'll make you work with real time systems when batch jobs work fine for them. Check out http://bigdatabuff.blogspot.com/2015/07/big-data-do-you-need-it.html
Now you must be thinking how will that affect us, we'll still be working on big data.. right?
Well it will affect you because you never dealt with actual big data problems. You'll never know the problems, so you'll never know the solution. You'll go through a whole lot of problems when you'll face an actual big data expert or system.

Whole lot of false hopes:
If you go for an interview and you are asked pretty basic and easy questions. Don't be happy... Run as hard as you can.. because that simply means even they don't know about big data.
Such companies just hire you in the hopes of getting clients (luring is a better word) to give them work for big data. If they don't get it, they'll ruin your career.

I am not discouraging you from moving to big data, its just a heads-up. Its the story of what every big data developer goes through atleast once ;)
Hope this helps you make the right decision and help you prevent atleast one of the tragedies I have mentioned.

If you want to know about more tragedies or want to send hate mails or comments (Even that is welcome) or want to share your own story; please comment here.. Looking forward to hearing from you.. :)

Sunday 6 September 2015

HDFS Operations: Replica Placement

Welcome again... Hope you are still interested in finding out functionality of various HDFS operations.
Adding to my last blog post i.e. write operation in HDFS, this post is about a very important part of writing process of HDFS, replica placement. It occurs while writing a file to HDFS, and is very important for failure recovery process of datanodes.

Whenever a file is written in HDFS it is splitted into packets(splits) and each packet is replicated and placed on different nodes, so as to handle failure of nodes.

FYI, I am assuming that replication factor (the parameter to indicate the number of replicas to be created for each split) is set at its default value i.e. 3

Placement of these replicas is a very critical issue:

If you place all replicas on same node, there will be no benefit of replication as all replicas get lost with that namenode failure
If you place your replicas too far, the bandwidth and time needed for one write will be too high, and you have lots of splits to write generally.

This is a tradeoff (hadoop is full of tradeoffs) that had to be considered at HDFS architectural design time.

Distance Measurement:

Distance in hadoop is measured in form of network distance. Lets take the node you are present at (client) as reference. So the distance from a node to the starting point is measured as:

distance from same node is 0 (zero) e.g. Delhi to Delhi distance is taken to be zero (just a concept, don't take it literally ;) )
distance from a node in same rack is 2 e.g. Delhi to Haryana (or any state in India) distance is taken to be 2
distance from a node in different rack but same data center is 4 e.g. India to Japan (or any country in Asia) is taken to be 4
distance from a node in different data center is 6 e.g. India to any state of any country of USA (or any other continent) distance is taken to be 6

Rack is a group of nodes and data center is a group of racks

Replica Placement:

Hadoop's default strategy is:

place the first replica on the same node as client, network distance zero. If the client is not a datanode, then any node which is not too busy or full is chosen at random, network distance 4.
place the second replica on a node on different rack (any rack), network distance 4.
place third replica on any node on same rack as the second replica's node, network distance 2 (now source node is node in point 2)
Further replicas, if created are placed on random nodes.

FYI, you can design your own replica placement policy and write your own code and plug it into hadoop. So, go on and play... but be careful about tradeoffs.

Hope you got it... If not you can ask questions in comment section... Happy learning :)

Big Data- Deep Dive