A few days ago, I had written a post on The Big Data Problem which attempted to understand why we need big data and what the fuss is all about. You may want to read it here.
Having understood why we need big data, let’s understand how we can go about analyzing the same. What is the way out to do analysis on big data? The solution is Streaming…Hadoop Streaming. James Bond style
To recount a personal experience, I was faced with the following Data Analysis task — Scaling a particular analytics technique on retail data from one store to all stores across the US. Sounds interesting so far, doesn’t it? The catch however, was that the program took 2 days just to compute the results for 1 department.
Imagine, the time it would take if we were restricted to a single machine/Rstudio.
Let’s understand the gravity of the problem and get an idea of a rough time estimate. So for one department the time taken was two days. Let us assume, each store has close to 100 departments and we have a total of 5000 stores. So the total time taken for all of the stores would be roughly 2 X 100 X 5000 = 1 million days = 2740 years. This is just a rough estimate assuming a linear relationship between the number of departments and the time taken.
Obviously, using RStudio stand alone was not a feasible solution. We had at our disposal Hadoop cluster. We wanted to see if we could exploit the number of machines in our cluster to solve this Herculean challenge of the Big Data world.
A fresh college graduate, I was a complete stranger to the world of Hadoop and out of anxiety, I purchased a copy of Hadoop-The definitive guide. I started going through the chapters. It was in no time that I realized this book was meant for someone with an understanding of Java. Unfortunately, I was not one of them. Saddened and almost defeated. What other choice do I have? I came across this wonderful concept — Hadoop streaming.
Wow! This looks like it could solve the problem. I started reading more about it on the internet. There were not many resources on this technique. Fortunately, a colleague of mine had worked on Hadoop Streaming earlier and with his help we were able to accomplish the task successfully.
Now that you have the context, let me try to answer some basic technical questions.
So what exactly is Hadoop streaming?
Hadoop streaming — ‘hadoop’ and ‘streaming’. We use the availability of machines in the cluster to process data in parallel. Let me elaborate. When we do our analysis on Rstudio, we just have one machine (our laptop). Now imagine we have 100s of such machines in our cluster. We can take advantage of this and distribute our input data in such a way that each machine gets some portion of your input data and they are processed in parallel.
Now, you may ask what is streaming here.
The input data is sent to each of the machines in the cluster via stdin() (standard input) and the analyzed output is thrown at stdout() (standard output). Your input data is streamed via stdin and the output gets flushed out to stdout.
What languages can I use for Hadoop streaming?
You can use any scripting language — R, Python, Ruby. Care should be taken to choose your scripting language. The analysis you are planning to do on your data should decide your choice of language. Say, we want to do a forecasting of sales at item level for a retail store. We know that R is the suitable choice to use when it comes to do forecasting as we have well-developed packages in R which is not the case with Ruby. So do put some thought into choosing the language.
I hope we are now clear on the basics of Hadoop Streaming and its benefits.
Hadoop Streaming is not the answer to all your Big Data analysis problems. It can be used only for cases where each machine can independently perform its analysis with a small portion of input data that is fed to it.
Example where Hadoop Streaming can be used
Sales forecasting at item level. Say, we have weekly data for 2 years at item level. And say there are 10,000 items for which we want to do the forecasting. Each item has close to 104 rows (2 years weekly data). So our input data has close to 104 X 10,000 = 1,040,000 ~ 1 million rows.
Assume we have 100 machines in our cluster. What we do next is pretty intuitive. We distribute our data such that each machine receives 1 item at a time, and once it is done processing that item, we send the next item to it. So, in a single go, we will have 100 items processed across the cluster. In, the next go, again more 100 items will be processed and this goes on until all the 10,000 items are processed.
Example where Hadoop Streaming won’t work
Clustering. In clustering we need to find the hidden patterns that are there in the complete dataset. A smaller portion of the dataset can’t be sent to a machine to do the analysis because the purpose will not be served.
I think this is worth mentioning.
In Hadoop Streaming, you are not writing an org.apache.hadoop.mapred.Mapper class! This is just a simple script that reads rows from stdin (columns separated by ‘\t’ or any delimiter) and should write rows to stdout (again, columns separated by ‘\t’ or other delimiter). It’s probably worth mentioning this again but you shouldn’t be thinking in traditional map-reduce Key Value terms, you need to think about columns.
You can write your script in any language you want, but it needs to be available on all machines in the cluster. Any easy way to do this is to take advantage of the Hadoop distributed cache support, and just use add file /path/to/script within hive. The script will then be distributed and can be run as just ./script (assuming it is executable) Enough theoretical stuff. The interesting part is the code framework of mapper and reducer that I will explain in the next post.
I hope you feel more educated on big data after reading this. Please leave a comment if you have questions/insights. I will reply as soon as possible.