Lets start where we left last time.
Lets elaborate how big data solves the following problems:
Lets elaborate how big data solves the following problems:
For e.g. Lets suppose you want to write a novel and you start writing it in a cheap notebook. But eventually notebook is not sufficient for your whole story. There are two ways to solve the problem:
- Add more pages to the notebook, and when that is not possible just buy a new bigger, thicker and expensive notebook. Then copy everything from original notebook to the new notebook, and then resume your novel writing. This is called vertical scaling. Pretty silly way, but that's what the traditional approach with the servers.
- Buy a new cheap notebook, just resume your writing. Go on adding as many notebooks as you want, and just create an index. Index will tell which notebook has which part of novel or whats the sequence of the notebooks. That is called horizontal scaling. Pretty obvious and efficient way and is used by big data systems.
2. The problem of handling requests:As big data systems are composed of multiple computers (called as nodes), the handlers for requests increase. Now your single system is not your bottleneck. Big data synchronizer just sends your requests to the nearest node which is free or the best case scenario nearest node which is free and has the data that the request is to access.
Lets suppose a person grows different kinds of vegetables, and has different people working on different vegetables.
- If he has one farm with one entry and one exit, there is a queue of workers waiting to enter the farm. Besides, to reach a particular vegetable section, workers has to go through all the sections in between. These workers are requests, data access requests to be more precise and those sections of different vegetables can be different databases or directories. And hence that way traditional systems cause long delays
- If he has multiple smaller farms, lets say one for each type of vegetable (best case scenario). There are multiple entries and exists, and each worker can simply enter and start working, no delays of going through other sections to get the desired section.
3. The problem of distributing computation: Big data systems not just distribute the requests, but also distributes the data. They not only distribute directory or database, they distribute the single units of computation like file or table. So, even if your file is too large to fit in the memory of one system, it doesn't matter because now it is not in one system but is split and distributed in different systems with indexes. So if you want to access only a part of your file or your table, then you don't need to read the whole file and seek the point. Also, if you want to do some manipulation, your execution is working on different parts of your file and hence it is being worked on in not only distributed but also in parallel manner.
Lets say there is a log of all the events happening on amazon for one day. Now you want to find out the average sales of each product per hour.
- Traditional systems will read the file in chunks (of the size which can fit into memory at a time), then will do grouping and aggregations on each chunk sequentially. This results in lot of I/O delays and processing delays, and hence is really time consuming.
- Big data systems will just split the log file into some fixed size chunks, and will give each chunk to different nodes. Now the groupings and aggregation is done on different nodes in parallel. Now the log chunks can fit into memory, so no I/O delays, and the process is not sequential, hence much lesser processing delays.
Hope this time you got the whole concept, not just the gist. If you are not convinced, just comment and ask. I'll be more than happy to help.
Also please provide your valuable suggestions as well.
Stay Tuned for more...
No comments:
Post a Comment