Sunday, 12 July 2015

Why Big data?

After my last post if you decide that you actually need big data, then lets start the deep dive level I.

When do you need big data or when does your traditional systems fail:

As the data size increases, the problem of storing it arises. This is usually solved by increasing your memory size. Then comes the problem of computation on such huge data, you increase your RAM. When that stops working, you switch to servers with loads of memory and RAM.

Then starts a problem of handling requests. As the size of organization or website popularity increases, number of users increase and ultimately number of requests for data access increase. But your server no matter how powerful it is, is after all a single device and  there's a limit on how much it can handle at once. It then becomes a bottleneck (the most used and boring term in the books). Its as if hundreds of cars are trying to enter a city which is big enough to accommodate the cars and its people (no problem there), but the highway to enter the city is jam packed.

Now you need a system that distributes the requests and data. Here comes your distributed RDBMS, you still have the benefits of RDBMS and you have distributed you databases across different systems, your requests will now be distributed to different systems which will process your request and will provide you with data you want. Its like you create different entrance routes to different parts of the city, all the traffic gets distributed between these entrances on the basis of the part of city a person wants to go to. But it distributes the requests, not computation.

Now starts the ultimate problem, distributing your computation or data manipulation. Lets say you have a table of 100 GB of data, and you have to do a computation on it, e.g. you want do a sum along with group by on it.While distributed RDBMS distributes the databases, it doesn't distribute your table. As your table is on a single machine, your computation, no matter how big is also on a single machine. Your system goes madly slow and you ofcourse go mad. So,none of the above solutions can help you. And here comes the ultimate rescuer Big data. Say Hellooo people :)



Why Big Data:

Here's how big data will solve the problem. It will break your table into multiple splits to be put on multiple systems of your cluster (nodes of the cluster to be precise) and the computation will be done on multiple nodes instead of one. Now you have data at multiple nodes, and each part is being processed by one node. Now, not only your data, or your requests, but also your computation is distributed.
Result: Neither you nor your system goes mad, as its not one system at all, so no limitation or bottleneck

Its ok if you don't fully understand it until now, its enough to get the gist of it. Stay Tuned for more...

No comments:

Post a Comment