Big Data- Deep Dive: Big Data: Hadoop Distributed File System

Welcome back.
Last time I talked about hadoop's components, HDFS and MapReduce. This time lets start with the basics of first component: HDFS.

Hadoop Distributed File System or HDFS, as the name suggests its a distributed file system. It is based on master slave architecture. It is composed of three components:

Namenode: It's the master part of the file system. Only one node of the system can serve as a namenode. It not only manages the datanodes, but also keeps the index of the file system. Just like any other file system, this file system crashes if the index is unavailable and hence is the most important part of file system.
Datanode: It's the slave part of the filesystem. One can add as many datanodes as they want (ideally, but just like any other system its not ideal). They all should be same in terms of hardware configurations to balance the system. (Note that namenode should be of superior hardware configuration but same will also do) They are the means to horizontally scale your system. Crashing of a datanode can be easily handled by the system.
Secondary Namenode: It keeps a file system image of namenode and all the edit log. It periodically merges the filesystem image with the edit logs to prevent edit logs from becoming too large. So it is a copy of namenode's image merged with edit logs and can be used in the event of namenode failure. But it can't act as namenode. Also the image on secondary namenode is updated on intervals so the edits since last interval to failure will be lost, hence the data will be most certainly always lost if namenode goes down.

HDFS has some base assumptions:

Its a write once, read many times system: First and foremost thing to understand is that HDFS can keep all the data you need it to store but it can't change your data. To be more precise, once you keep a file in HDFS, it can't be modified because that will be a very costly operation.
Datanodes are bound to fail: As HDFS uses commodity hardware, it assumes that datanodes will fail.

To handle the failure HDFS keeps multiple replicas of same file on different data nodes. The number of replicas is configurable. It depends on a number of factors, but most simply it is a trade off of how much space do you have to waste on replicas and how much available you want that file to be. (Please NOTE that more is the number of replicas of a file more is the chance that you will find one of the replicas on the same node or neighboring nodes where you are working).
To detect it, datanodes send heart beat signals to namenode after every few seconds (configurable, default is 3). If namenode doesn't get these signals for some interval (again configurable), then it believes that datanode to be dead (crashed) and start taking appropriate measures (Will explain the measures in a later post).

Small number of large files: Its better to have small number of large files in HDFS rather than large number of small files. Namenode acts as index for all the files in HDFS. Index for large number of files will take up more space and fill namenode memory rapidly. There are workarounds in HDFS for this situation but its still better to avoid it.
File corruption can be detected not corrected: HDFS can detect corrupted files but can't correct them.

To detect corruption: HDFS uses cyclic redundancy check to determine whether a file is correct or corrupted. HDFS keeps a .crc file with the original file for this check. This .crc file is really small and doesn't take up a lot of space. Its storage overhead is less than 1%
To handle corruption: HDFS simply put that file in corrupted file list, so that it doesn't divert any client to it. Then HDFS simply create one more replica of the file using other uncorrupted, healthy replica.

Hardware configurations and job requirements are known well: HDFS (Infact whole hadoop framework) has a lot of configuration parameters. Every configuration parameter has different impact on the system as a whole. Every configuration has a tradeoff (a situation in which you must choose between or balance two things that are opposite or cannot be had at the same time, dictionary definition) for e.g.

There is a configuration parameter to indicate whether files has to be compressed or not. Tradeoff is between space and processing. If you compress your files they will take up less space but more processing, as you have to decompress them to read them.
There is a configuration parameter for setting heartbeat signal interval. If you set it too less then your network is used too much. If you set it too high, your namenode might miss that a datanode has crashed and may divert some clients to the datanode which is not available.

This was the basic idea of HDFS. In coming posts, I'll explain different algorithms used by HDFS, failures and recovery methods in HDFS and Read and Write process of HDFS etc.

Stay tuned.. Keep learning... Meanwhile please let me know whether you like it or not... :)

Big Data- Deep Dive

Tuesday, 4 August 2015

Big Data: Hadoop Distributed File System

HDFS has some base assumptions:

2 comments:

The Definitive Guide

Programming Hive