Wednesday 8 July 2015

Big Data: Do you need it

Big data is one of the most misunderstood terms now a days. Every article and book on big data starts with how much data has become available. Facts and figures are given on how dramatically data volume has increased over the years with internet boom. And all this is what created the need for big data technologies. Its not fully wrong but its misdirecting, and is misinterpreted big time.

It is often interpreted that need for big data arose as a result of the need to store that much volume of data. But actually it wasn’t just the storage, but the computation that was biggest problem. Need for big data is defined by the amount of data you need to process at a time.

Lets start with an example, there is a e-commerce company, having lots of data about everything starting from products, their prices, categories etc. to user profiles, user logs. They are using a RDBMS for storing their product and user data, and simple log files for user logs.
They have loads of data, probably a few hundred GBs, for products and user data. Now do they need big data for this? The answer surprisingly is no. Data is in hundreds of GBs but they don’t need to process this data all at once. At most they probably need to join 4-5 tables(or less if they have efficient schema). Big data will harm them more than helping them, they can simply use distributed RDBMS(no synchronization needed).
On the other hand if they decide to analyze their user logs, find out patterns to decide marketing strategies. Thats a real choice, whether to use big data or not, based on the volume of logs per day and how far in the history they want to go. If they are popular and have GBs of data per day, and:
  1. want to analyze data for just one day: just load the relevant part to RDBMS every hour or so 
  2. want to analyze data for a month: probably won’t work with RDBMS in their case (will still work if logs are small, just 2-3 GBs)
  3. want to analyze data for one or more years: Its better to switch to big data tools, too cumbersome to handle computations on such huge data by one system.

Thus the biggest question before starting the switch to big data every one need to ask is whether it is actually needed. If you answer right it might save a whole lot of trouble.

1 comment: