Big Data- Deep Dive: 2016

Friday 27 May 2016

Centralized vs Distributed Control: Which BigData Tool to use?

People often wonder which NoSQL to use, or more importantly which distributed file system to use with hadoop (Yes you can use other distributed file systems with hadoop not just HDFS). I am going to talk about one of the major aspects to base this decision on, whether you want a centralized control or a distributed control (Don't worry they are just fancy words for master slave and peer to peer architecture respectively).

Centralized Control:

This architecture simply means all nodes have a victim node to take all responsibility and blame if something goes wrong, their master or boss (although its other way round for most bosses.. :P ). Actually as the name suggests this architecture has one master node to handle special responsibilities and all other nodes are slave nodes or worker nodes with same responsibilities. Hadoop or more accurately HDFS has this architecture.

Pros of this architecture are:

Searching is easy as all the metadata is in one place. You know which node has your data with just one query to the master.
All the responsibilities starting from handling metadata, detecting and resolving deadlocks, load balancing to detecting nodes(live or dead) and handling dead node's data, everything is handled by the master by default, no need for leader election as you have a king already.
All the algorithms can be centralized, you need not worry how to make them distributed, for example its the master node who will start taking snapshot of the system and every node has to report its local state (snapshot) to it. They don't need to send the information to every other node. Its equivalent to everybody submitting their assignment to teacher instead of giving it to all the classmates for peer review.

Cons of this architecture are:

As all responsibilities are handled by one node, it puts a lot of pressure on that node. That node becomes a bottleneck. Its like a hundred people are standing in queue to buy ticket to a movie but there is just one person to give out ticket, think about the length of the queue (and use book my show :D )
The master node also becomes a single point of failure. If it fails the whole system goes down with it. Although Yarn has solved this problem now with high availability concept where you have more than one master node, out of those one works at a time and others are on standby.
Centralized algorithms again put pressure on master node and as it has become a mediator of each and every decision and communication.

Distributed Control:

This is a more symmetrical architecture like having an assembly instead of a king. Every node has same value and same responsibilities. A popular example of this architecture is cassandra, or more accurately CFS (cassandra file system).

Pros of this architecture are:

The pressure and responsibilities gets distributed. Any node can take a responsibility and any other can take another. Yep people this is book my show, book your own ticket.
No single point of failure. If any node goes down redistribute its responsibility and data among the live nodes.
The system can use distributed algorithms, this reduces time to complete the process.

Cons of this architecture are:

Searching any file is more difficult as you might need to go on asking for directions from 2-3 nodes before you actually reach the one who holds it (Distributed Hash Table), or worse you might have to send a request to a lots of nodes to so that the query to every node and relevant node can answer you back (Query Flooding). Also you need to know the exact data you need as regular expression like queries will take more time.
To run a centralized algorithm, an additional process of leader election has to be done, or the system will end up with more than one snapshot or deadlock correction system will kill more processes than needed. Also writing distributed algorithms isn't exactly a piece of cake, for example there has to be initiator for deadlock detection, if multiple initiators are there then one might resolve it and other might still under the impression that there is a deadlock because it detected a deadlock before first node fixed it.

All in all if you think you can write more distributed sort of algorithms and have more definite or similar searches(you can put search results i.e. the nodes having the data in cache for similar queries), congrats you don't need to bear single point of failure because you can choose distributed control type tool.
But if your searches are very diverse and form the major part of the problem, then use centralized kind of tool, you probably have to endure some bottleneck issues but you can always have a master node with better configuration than the rest.

Remember its just a starting criteria for what NoSQL or file system or distributed tool to use. There are a lot of other factors for example there is CAP theorem to chose the best NoSQL. I am going to talk about CAP theorem in my next post.

I tried to list down the differences and pros and cons as simple as I could make it. Hope it helped. But if you didn't get it or I have overwritten technically fancy words, please ask me again. I'll be happy to explain it in simpler terms. FYI its absolutely no advertisement of any sort :P (you can always use PVR cinemas instead of book my show )

Hope you got some insights about these two architecture styles of distributed systems and it helps you get nearer to the right decision for choosing the right tool according to your need. If you want to know more do tell me. Thanks for reading. Stay tuned for more.....

Thursday 28 April 2016

Big Data : Is it worth it?

In continuation to my series about what Big Data can do or can't do for you, here's another thought of mine.

First of all I don't want to say that Big Data is not worth it. All this post is about is to make you ask a question to yourself, everyone knows there are risks with big data, just ask yourself are you willing to take the risk, IS IT WORTH IT?

This question is for both companies starting with big data or yet in experimentation phase, and for the people starting with big data ( I have already written http://bigdatabuff.blogspot.in/2015/10/reasons-why-moving-to-big-data-can-be.html for you)

The reason I am writing this post is because I have seen companies starting with big data, experimenting with it for sometime and then just drop it because some mistake convince them that its not worth it. I hope to give insights to what you can or can't expect big data to do for you.

Lets chat then.

The common mistakes or problems are:

Not giving enough time: Time is a huge problem for companies now a days. Companies start on big data, build R&D teams of experts, pay them heavily and expect the outcome in a month. R&D is not just development, it has research in it for a reason. It takes time, it can fail at times; more times than succeeding. Don't expect short term profits from it. If you can't wait and don't want to "waste" your manpower on it just outsource it, because building something innovative takes time.

Misconception: At times people are under the impression that big data is some sort of magic that makes system so fast that you can do everything in milliseconds (Yep people still believe in magic :P). Its big data but its still computer science, all the logics and limitations of computers still apply to it people. Its logic not magic (surprize!!!!). There is a reason that everytime anyone talks about the making a process fast, they also give hardware specifications. Ever heard about terasort benchmark? http://sortbenchmark.org/

Too much experimenting: Well its opposite of your problem1. Just because you are doing R&D doesn't mean you have all the time in the world people!!! Sometimes the problem statements or assumptions you create in your head are wrong. Just because your hypothetical data didn't run as fast as you thought with big data tools, doesn't mean your actual data won't. The millisecond optimizations are to be done with real data. If you go on experimenting with every big data tool on earth with just assumptions (not their architecture) then I am sorry my friend, the list is too long, it'll take a life time to complete.

Too Haphazard architecture: Its a different variant of problem 1. At times people actually don't have a choice other than to use big data. Then at times they make the mistake of taking the first thing they get their hands on and create a system in a haphazard manner. You don't need to look at every big data tool, but atleast look at enough to make sure you tried most of the feasible alternatives.

Big Disclaimer: I am not here to criticize anyone or to say that nobody is using big data the right way. I just wrote this post to point out common mistakes made by companies starting with big data, so that they can be avoided ( I hope I am helping someone :) )

If you can think of some other way someone or even you messed up something like this (we all make mistakes at times), please share in the comments. I would be glad to hear them.
Also if you don't agree with something in my post or if I seem rude at times (although I don't mean to), do tell me.
Thanks for bearing with me. Stay tuned for more :)

Saturday 6 February 2016

HIVE Tricks and Tips

In continuation of my last post, here are some tips and tricks for HIVE users:

Tricks:

1. Run HIVE script from bash:
   1.1 To run HIVE script from bash script, use command "hive"
          hive -e '{HIVE COMMANDS}'

   1.2 To run HIVE in verbose mode (prints progress with query, generally used for logging):
               hive -v -e '{HIVE COMMANDS}'

     1.3 To pass arguments from bash to hive
           hive -d ARG=$ARG (argument passed from bash) -e '{HIVE COMMANDS}'

     1.4 To pass results from hive to bash
                1.4.1 If you have one or a few values:
                           VAR= hive -e '{HIVE COMMANDS}'
   1.4.2 If you have a big table as a result create an output directory
   hive -d DIR=$DIR -e '
   INSERT OVERWRITE LOCAL DIRECTORY "${DIR}"
                               {REST OF THE COMMAND}'

2. Implement MINUS operator in HIVE :
Q1 LEFT OUTER JOIN Q2 ON (Q1.x=Q2.y) WHERE Q2.y IS NULL
where Q stands for query
the left outer join's output is all rows even if they don't have mapping in result of Q2, is null excludes all the rows that doesn't have a mapping in Q2 (same effect as MINUS operator)

Tip :

Use group by and CASE as much as possible: If you have multiple queries that run on same table try converting them into one query using group by or CASE statements. For e.g. lets say for an app, you want to find out how many users have sessions less than an hour, one hour to 3 hours, more than 3 hours, based on their location, use CASE for creating the 3 buckets and use group by to do it on the basis of location.

Hope you find these useful. If you have some tips and tricks like this you can share via comments. If you are having some problems with HIVE and want some tricks, do ask.

Thanks for reading :)

Big Data- Deep Dive