Big Data- Deep Dive: May 2016

People often wonder which NoSQL to use, or more importantly which distributed file system to use with hadoop (Yes you can use other distributed file systems with hadoop not just HDFS). I am going to talk about one of the major aspects to base this decision on, whether you want a centralized control or a distributed control (Don't worry they are just fancy words for master slave and peer to peer architecture respectively).

Centralized Control:

This architecture simply means all nodes have a victim node to take all responsibility and blame if something goes wrong, their master or boss (although its other way round for most bosses.. :P ). Actually as the name suggests this architecture has one master node to handle special responsibilities and all other nodes are slave nodes or worker nodes with same responsibilities. Hadoop or more accurately HDFS has this architecture.

Pros of this architecture are:

Searching is easy as all the metadata is in one place. You know which node has your data with just one query to the master.
All the responsibilities starting from handling metadata, detecting and resolving deadlocks, load balancing to detecting nodes(live or dead) and handling dead node's data, everything is handled by the master by default, no need for leader election as you have a king already.
All the algorithms can be centralized, you need not worry how to make them distributed, for example its the master node who will start taking snapshot of the system and every node has to report its local state (snapshot) to it. They don't need to send the information to every other node. Its equivalent to everybody submitting their assignment to teacher instead of giving it to all the classmates for peer review.

Cons of this architecture are:

As all responsibilities are handled by one node, it puts a lot of pressure on that node. That node becomes a bottleneck. Its like a hundred people are standing in queue to buy ticket to a movie but there is just one person to give out ticket, think about the length of the queue (and use book my show :D )
The master node also becomes a single point of failure. If it fails the whole system goes down with it. Although Yarn has solved this problem now with high availability concept where you have more than one master node, out of those one works at a time and others are on standby.
Centralized algorithms again put pressure on master node and as it has become a mediator of each and every decision and communication.

Distributed Control:

This is a more symmetrical architecture like having an assembly instead of a king. Every node has same value and same responsibilities. A popular example of this architecture is cassandra, or more accurately CFS (cassandra file system).

Pros of this architecture are:

The pressure and responsibilities gets distributed. Any node can take a responsibility and any other can take another. Yep people this is book my show, book your own ticket.
No single point of failure. If any node goes down redistribute its responsibility and data among the live nodes.
The system can use distributed algorithms, this reduces time to complete the process.

Cons of this architecture are:

Searching any file is more difficult as you might need to go on asking for directions from 2-3 nodes before you actually reach the one who holds it (Distributed Hash Table), or worse you might have to send a request to a lots of nodes to so that the query to every node and relevant node can answer you back (Query Flooding). Also you need to know the exact data you need as regular expression like queries will take more time.
To run a centralized algorithm, an additional process of leader election has to be done, or the system will end up with more than one snapshot or deadlock correction system will kill more processes than needed. Also writing distributed algorithms isn't exactly a piece of cake, for example there has to be initiator for deadlock detection, if multiple initiators are there then one might resolve it and other might still under the impression that there is a deadlock because it detected a deadlock before first node fixed it.

All in all if you think you can write more distributed sort of algorithms and have more definite or similar searches(you can put search results i.e. the nodes having the data in cache for similar queries), congrats you don't need to bear single point of failure because you can choose distributed control type tool.
But if your searches are very diverse and form the major part of the problem, then use centralized kind of tool, you probably have to endure some bottleneck issues but you can always have a master node with better configuration than the rest.

Remember its just a starting criteria for what NoSQL or file system or distributed tool to use. There are a lot of other factors for example there is CAP theorem to chose the best NoSQL. I am going to talk about CAP theorem in my next post.

I tried to list down the differences and pros and cons as simple as I could make it. Hope it helped. But if you didn't get it or I have overwritten technically fancy words, please ask me again. I'll be happy to explain it in simpler terms. FYI its absolutely no advertisement of any sort :P (you can always use PVR cinemas instead of book my show )

Hope you got some insights about these two architecture styles of distributed systems and it helps you get nearer to the right decision for choosing the right tool according to your need. If you want to know more do tell me. Thanks for reading. Stay tuned for more.....

Big Data- Deep Dive

Friday, 27 May 2016

Centralized vs Distributed Control: Which BigData Tool to use?

Centralized Control:

Distributed Control:

The Definitive Guide

Programming Hive