Sunday 6 September 2015

HDFS Operations: Replica Placement

Welcome again... Hope you are still interested in finding out functionality of various HDFS operations.
Adding to my last blog post i.e. write operation in HDFS, this post is about a very important part of writing process of HDFS, replica placement. It occurs while writing a file to HDFS, and is very important for failure recovery process of datanodes.

 Whenever a file is written in HDFS it is splitted into packets(splits) and each packet is replicated and placed on different nodes, so as to handle failure of nodes.

















FYI, I am assuming that replication factor (the parameter to indicate the number of replicas to be created for each split) is set at its default value i.e. 3

Placement of these replicas is a very critical issue:

  • If you place all replicas on same node, there will be no benefit of replication as all replicas get lost with that namenode failure
  • If you place your replicas too far, the bandwidth and time needed for one write will be too high, and you have lots of splits to write generally.
This is a tradeoff (hadoop is full of tradeoffs) that had to be considered at HDFS architectural design time.


Distance Measurement:

Distance in hadoop is measured in form of network distance. Lets take the node you are present at (client) as reference. So the distance from a node to the starting point is measured as:
  1. distance from same node is 0 (zero) e.g. Delhi to Delhi distance is taken to be zero (just a concept, don't take it literally ;) )
  2. distance from a node in same rack is 2 e.g. Delhi to Haryana (or any state in India) distance is taken to be 2
  3. distance from a node  in different rack but same data center is 4 e.g. India to Japan (or any country in Asia) is taken to be 4
  4. distance from a node in different data center is 6 e.g. India to any state of any country of USA (or any other continent) distance is taken to be 6
Rack is a group of nodes and data center is a group of racks



Replica Placement:

Hadoop's default strategy is:
  1.  place the first replica on the same node as client, network distance zero. If the client is not a datanode, then any node which is not too busy or full is chosen at random, network distance 4.
  2. place the second replica on a node on different rack (any rack), network distance 4.
  3. place third replica on any node on same rack as the second replica's node, network distance 2 (now source node is node in point 2)
  4. Further replicas, if created are placed on random nodes.

FYI, you can design your own replica placement policy and write your own code and plug it into hadoop. So, go on and play... but be careful about tradeoffs.

Hope you got it... If not you can ask questions in comment section... Happy learning :)