Big Data- Deep Dive

Thursday, 28 April 2016

Big Data : Is it worth it?

In continuation to my series about what Big Data can do or can't do for you, here's another thought of mine.

First of all I don't want to say that Big Data is not worth it. All this post is about is to make you ask a question to yourself, everyone knows there are risks with big data, just ask yourself are you willing to take the risk, IS IT WORTH IT?

This question is for both companies starting with big data or yet in experimentation phase, and for the people starting with big data ( I have already written http://bigdatabuff.blogspot.in/2015/10/reasons-why-moving-to-big-data-can-be.html for you)

The reason I am writing this post is because I have seen companies starting with big data, experimenting with it for sometime and then just drop it because some mistake convince them that its not worth it. I hope to give insights to what you can or can't expect big data to do for you.

Lets chat then.

The common mistakes or problems are:

Not giving enough time: Time is a huge problem for companies now a days. Companies start on big data, build R&D teams of experts, pay them heavily and expect the outcome in a month. R&D is not just development, it has research in it for a reason. It takes time, it can fail at times; more times than succeeding. Don't expect short term profits from it. If you can't wait and don't want to "waste" your manpower on it just outsource it, because building something innovative takes time.

Misconception: At times people are under the impression that big data is some sort of magic that makes system so fast that you can do everything in milliseconds (Yep people still believe in magic :P). Its big data but its still computer science, all the logics and limitations of computers still apply to it people. Its logic not magic (surprize!!!!). There is a reason that everytime anyone talks about the making a process fast, they also give hardware specifications. Ever heard about terasort benchmark? http://sortbenchmark.org/

Too much experimenting: Well its opposite of your problem1. Just because you are doing R&D doesn't mean you have all the time in the world people!!! Sometimes the problem statements or assumptions you create in your head are wrong. Just because your hypothetical data didn't run as fast as you thought with big data tools, doesn't mean your actual data won't. The millisecond optimizations are to be done with real data. If you go on experimenting with every big data tool on earth with just assumptions (not their architecture) then I am sorry my friend, the list is too long, it'll take a life time to complete.

Too Haphazard architecture: Its a different variant of problem 1. At times people actually don't have a choice other than to use big data. Then at times they make the mistake of taking the first thing they get their hands on and create a system in a haphazard manner. You don't need to look at every big data tool, but atleast look at enough to make sure you tried most of the feasible alternatives.

Big Disclaimer: I am not here to criticize anyone or to say that nobody is using big data the right way. I just wrote this post to point out common mistakes made by companies starting with big data, so that they can be avoided ( I hope I am helping someone :) )

If you can think of some other way someone or even you messed up something like this (we all make mistakes at times), please share in the comments. I would be glad to hear them.
Also if you don't agree with something in my post or if I seem rude at times (although I don't mean to), do tell me.
Thanks for bearing with me. Stay tuned for more :)

Saturday, 6 February 2016

HIVE Tricks and Tips

In continuation of my last post, here are some tips and tricks for HIVE users:

Tricks:

1. Run HIVE script from bash:
   1.1 To run HIVE script from bash script, use command "hive"
          hive -e '{HIVE COMMANDS}'

   1.2 To run HIVE in verbose mode (prints progress with query, generally used for logging):
               hive -v -e '{HIVE COMMANDS}'

     1.3 To pass arguments from bash to hive
           hive -d ARG=$ARG (argument passed from bash) -e '{HIVE COMMANDS}'

     1.4 To pass results from hive to bash
                1.4.1 If you have one or a few values:
                           VAR= hive -e '{HIVE COMMANDS}'
   1.4.2 If you have a big table as a result create an output directory
   hive -d DIR=$DIR -e '
   INSERT OVERWRITE LOCAL DIRECTORY "${DIR}"
                               {REST OF THE COMMAND}'

2. Implement MINUS operator in HIVE :
Q1 LEFT OUTER JOIN Q2 ON (Q1.x=Q2.y) WHERE Q2.y IS NULL
where Q stands for query
the left outer join's output is all rows even if they don't have mapping in result of Q2, is null excludes all the rows that doesn't have a mapping in Q2 (same effect as MINUS operator)

Tip :

Use group by and CASE as much as possible: If you have multiple queries that run on same table try converting them into one query using group by or CASE statements. For e.g. lets say for an app, you want to find out how many users have sessions less than an hour, one hour to 3 hours, more than 3 hours, based on their location, use CASE for creating the 3 buckets and use group by to do it on the basis of location.

Hope you find these useful. If you have some tips and tricks like this you can share via comments. If you are having some problems with HIVE and want some tricks, do ask.

Thanks for reading :)

Sunday, 6 December 2015

HIVE FACTS

HIVE is a big data tool used widely. Its also misunderstood widely.

Here are some facts about HIVE:

1. HIVE is a datawarehouse:
HIVE is a data warehouse not a database. It gives you the capability to put your data (your big data, usually historical data) into files and then analyze it, slice and dice it, drill down, roll up; basically anything except manipulating it.

2. HIVE works in batch mode:
HIVE queries are similar to SQL but their processing is much different than SQL. HIVE queries get converted into MapReduce jobs. It means it takes atleast the amount of time it takes to create a job to run you query (unless its select *).

3. HIVE is not RDBMS:

HIVE has a query language very similar to SQL but that doesn't mean it acts as RDBMS. It lacks many of the important features of RDBMS, like:

Transactions: HIVE doesn't have transactions. It means it has no guarantee that your queries will be atomic, so they can fail at any point leaving your system in inconsistent state.
No locks: There are no locks on tables. That means it is possible that two queries are manipulating your table (a whole partition or a whole table) simultaneously and hence, giving wrong results.

WARNING: Never use HIVE for transactional purposes

4. HIVE is not SQL:

Amount of I/O: HIVE doesn't have tables, it only works on files. It can't read a part of your table, it'll always read the whole file. It doesn't matter whether you want to read one record, or a hundred, the amount of data read is same.
Processing time: If you read the whole "table", processing time is almost none. All it does is read the whole file and return it according to table's schema. But the processing time goes on increasing with each clause in the query.

5. HIVE uses HDFS to store data:
It means if you access a large table (there are solutions for that), you have to account the network time too alongwith I/O time.

There are a lot of features lacking in HIVE, but that doesn't mean that HIVE is useless. It just means its uses are different.
If you have some questions or you have some points where HIVE is different (or lacking), do comment here.

To find out more about HIVE, stay tuned... :)

Thursday, 15 October 2015

Why moving to big data can be a disaster for you

You might have heard about advantages of moving ahead in your career with big data.
Sorry to burst the bubble, but let me tell you the problem and the reality of the so called jobs pouring in for big data developers.

You'll have to work hard:
Big data is an entire new realm. A pretty different way of thinking. I am not saying you can't learn, I am just saying you'll need to work hard, think things differently, and see the problems in the conventional tools and technologies you've been working for so long.
But if you can learn then this is a heaven for you. You'll have a whole lot to explore.

No bed of roses:
Big data is not technology with one direction of focus, full of conventions and rules; its a relatively new technology. You can't be just a user, you need to know it deep before you make any decision about your cluster or the technology you are going to use. There are no rules here, its full of unexpected disasters.

Lots of myths:
Yeah people myths..
This tool is easy, its just your SQL. No buddy its not, its query language is similar to SQL in first glance but it doesn't work like that.
OOOh use this new tool, this will make your jobs faaast. Hey you didn't check its problems, its probably inhibiting some of your jobs and will give you wrong data.
FYI its all from personal experience

False big data:
Companies claim they have loads of data. They need big data, so they hire you. The problem is that even they don't know they don't need big data. They'll make you work on NoSQLs, when MySQL can solve their problem. They'll make you work with real time systems when batch jobs work fine for them. Check out http://bigdatabuff.blogspot.com/2015/07/big-data-do-you-need-it.html
Now you must be thinking how will that affect us, we'll still be working on big data.. right?
Well it will affect you because you never dealt with actual big data problems. You'll never know the problems, so you'll never know the solution. You'll go through a whole lot of problems when you'll face an actual big data expert or system.

Whole lot of false hopes:
If you go for an interview and you are asked pretty basic and easy questions. Don't be happy... Run as hard as you can.. because that simply means even they don't know about big data.
Such companies just hire you in the hopes of getting clients (luring is a better word) to give them work for big data. If they don't get it, they'll ruin your career.

I am not discouraging you from moving to big data, its just a heads-up. Its the story of what every big data developer goes through atleast once ;)
Hope this helps you make the right decision and help you prevent atleast one of the tragedies I have mentioned.

If you want to know about more tragedies or want to send hate mails or comments (Even that is welcome) or want to share your own story; please comment here.. Looking forward to hearing from you.. :)

Sunday, 6 September 2015

HDFS Operations: Replica Placement

Welcome again... Hope you are still interested in finding out functionality of various HDFS operations.
Adding to my last blog post i.e. write operation in HDFS, this post is about a very important part of writing process of HDFS, replica placement. It occurs while writing a file to HDFS, and is very important for failure recovery process of datanodes.

Whenever a file is written in HDFS it is splitted into packets(splits) and each packet is replicated and placed on different nodes, so as to handle failure of nodes.

FYI, I am assuming that replication factor (the parameter to indicate the number of replicas to be created for each split) is set at its default value i.e. 3

Placement of these replicas is a very critical issue:

If you place all replicas on same node, there will be no benefit of replication as all replicas get lost with that namenode failure
If you place your replicas too far, the bandwidth and time needed for one write will be too high, and you have lots of splits to write generally.

This is a tradeoff (hadoop is full of tradeoffs) that had to be considered at HDFS architectural design time.

Distance Measurement:

Distance in hadoop is measured in form of network distance. Lets take the node you are present at (client) as reference. So the distance from a node to the starting point is measured as:

distance from same node is 0 (zero) e.g. Delhi to Delhi distance is taken to be zero (just a concept, don't take it literally ;) )
distance from a node in same rack is 2 e.g. Delhi to Haryana (or any state in India) distance is taken to be 2
distance from a node in different rack but same data center is 4 e.g. India to Japan (or any country in Asia) is taken to be 4
distance from a node in different data center is 6 e.g. India to any state of any country of USA (or any other continent) distance is taken to be 6

Rack is a group of nodes and data center is a group of racks

Replica Placement:

Hadoop's default strategy is:

place the first replica on the same node as client, network distance zero. If the client is not a datanode, then any node which is not too busy or full is chosen at random, network distance 4.
place the second replica on a node on different rack (any rack), network distance 4.
place third replica on any node on same rack as the second replica's node, network distance 2 (now source node is node in point 2)
Further replicas, if created are placed on random nodes.

FYI, you can design your own replica placement policy and write your own code and plug it into hadoop. So, go on and play... but be careful about tradeoffs.

Hope you got it... If not you can ask questions in comment section... Happy learning :)