Thursday, 30 July 2015

Big Data: Hadoop Distributed Framework

Lets move to various technologies that come under big data.

I am starting with hadoop, one of the most used (and criticized, not sure why) distributed frameworks for big data.
(FYI, As I am starting with the basics I am explaining hadoop 0.x and 1.x,  NOT 2.x or YARN)

Distributed Frameworks are building blocks of distributed systems and define their behavior in terms of various important aspects like architecture, synchronization algorithms, load balancing, fault tolerance , failure recovery algorithms, programming paradigms and a lot more...
 
Hadoop is composed of two components:
  1. HDFS (Hadoop Distributed File System) : Its a distributed file system based on GFS (Google File System).
  2. Map Reduce: Programming paradigm (way of programming). It has got two phases map and reduce, hence the name.

Hadoop as a distributed framework:

Architecture:
Hadoop has master slave architecture i.e. one node (I hope you understand what node is) acts as a master or a manager (an efficient one, not the one everyone's bitching about, just kidding) and all other nodes act as slave or worker.
  1.  HDFS: For HDFS, namenode is master and datanode is slave. Datanode keeps the data, and namenode acts as index for the data.
  2. Map Reduce: For map reduce, master is jobtracker and slave is tasktracker. Tasktracker runs the processes for map phase and reduce phase. Jobtracker acts as the manger and decides where to run which process.


Synchronization Algorithms:
As hadoop is a master slave architecture, obviously synchronization is done by master. Slaves send messages to master after some fixed interval. These messages are called heart beat messages, as they tell master that a slave is alive. These messages also contain other information like process completion (on tasktracker), space remaining (on datanode) etc.


Fault Tolerance and failure recovery:
As these nodes are commodity hardware, they are bound to fail.
  1. Slave's data is replicated (copied) on multiple nodes to make the cluster fault tolerant. Whenever any one of them fails, master just redirects the reads to other node's copies. 
  2. Master's data (index of the files on slaves) is copied on another node. On failure the copied data on the other node has to be copied back on master and has to be restarted by admin. (FYI this problem has been solved in hadoop 2.x)

Programming paradigm:
As hadoop needs to run the programs as distributed processes, it has a different way of programming. It is called map reduce. Everything is seen as a key value pair.
Map is used to create the right key value pair and reduce, as the name suggests aggregates the data. (Again FYI, this map is not the data structure map, its a phase)
e.g. bloggers log has to be examined to find the referring urls and number of references per url. Map phase will create each referring url as key and one(1) as value and reduce phase will add all the occurrences (or 1s) of each key or referring url
Map: <URL1,1>;<URL1,1>;<URL2,1>
Reduce:<URL1,1+1>;<URL2,1>


This is just the basic idea of hadoop. Each process and component will take 2-3 whole posts.

Ask questions if you don't understand something, you can even disagree with me. Looking forward to some response....
Stay tuned for more elaborate explanations and examples...

4 comments:

  1. good one... moving on to the next blog :)

    ReplyDelete
  2. Good work.
    You should look to adding an example of combiner & partitioner.

    ReplyDelete
  3. Good work.
    You should look to adding an example of combiner & partitioner.

    ReplyDelete
  4. Actually I would add one when I specifically write about map reduce... At this stage that will confuse people... Thanks for the suggestion :)

    ReplyDelete