Most Important functions provided by any file system is reading and writing. Obviously whats the point of having a storage house if you can't keep things and take them out.
Lets start with keeping things i.e Writing:
DataFlow:
- Creating a new File:
- Client creates a file, i.e makes an entry for a new file at its end. Like you decide that you need to put something in the storage house.
- Client makes a RPC (Remote Procedure Call) call to namenode to create a new file with no blocks associated with it. Its like calling the manager of the storage house to give you permission to store something.
- The namenode performs checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file, i.e. the manager (or whoever does that) checks that you don't already have a space for the same item number, and that you are allowed to have a storage space (you can be a criminal... right????). If all checks are done and ok'ed, your package's entry is put into the file (lets say manager uses a entry file or paper book with receipts). Now manager has no role in your storing of the item, as he has other calls to attend too. He is just responsible for allocating space.
- Writing Data to it:
- As the client writes data, it is first broken down into packets (called blocks) of fixed size as configured (default is 64 MB) and added to data queue. Its like the package you want to put in storage house is collected from you and is divided into fixed size smaller packages and added to a queue for sending it to appropriate storage house unit.
- Now new blocks are allocated by the namenode on appropriate datanodes for the packets and their replicas. (As I told earlier hdfs keeps replicas of files for handling datanode failures.) Its like the manager checks out the space in each storage house unit and allocate the space for putting the packets. And gives the addresses for the units.
- For each packet: (Taking default replication factor 3)
- First the packet is written to the first datanode. Then it is sent to the second datanode and then to third datanode.
- Then third datanode sends acknowledgement to second, which inturn sends it to first and then first one send back the acknowledgement, and write process for that packet finishes.
- Failure Handling: If a datanode fails in writing process
- All the packets waiting to be acknowledged are cancelled to be written again.
- Present packet is left as it is. Later when file system finds out that the block is "under replicated" (having less than configured number of copies), it creates its replicas on its own.
- The failing datanode is reported to the namenode. When the datanode recovers, the partially written block is deleted. Also the datanode is removed from the pipeline for all writes.
- If the replicas created is less than dfs.replication.min, then the packet is written again.
API Version:
- Creating a file:
- The client creates the file by calling create() on DistributedFileSystem
- DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it
- The namenode performs various checks to make sure the file doesn’t already exist, and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
- The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.
- Writing Data to it:
- DFSOutputStream splits it into packets.
- Then DataStreamer asks the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
- For each packet:
- The DataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline and then the second datanode stores the packet and forwards it to the third.
- DFSOutputStream maintains an acknowledgement queue. The third datanode sends acknowledgement to second, which in turn sends it to first and then first one send back the acknowledgement, and write process for that packet finishes.
- When the write process is done client calls close() on DistributedFileSystem.
Hope its all clear to you. Again.. If something is not clear then just comment and ask.
One very important topic related to file writing in hdfs is "Replica Placement".
I will explain it in my next blog... Until then, stay tuned.... :)