Big data has become a sensation these days. Anyone and everyone wants to use this in their discussions. When I was still in my college and preparing for campus placements, I had attended almost all the pre-placement talks that companies gave to its prospective candidates.
American Express was one such company that had talked extensively about big data and hadoop in their presentation. I remember clearly, the blank faces that all of us had. We were not able to follow a single term. Hadoop, clusters, big data, distributed environment and many other terms were just bouncers for me. I did try to google them up later but these terms were still alien to me.
I have worked at WalmartLabs for close to 2 years now and the work has mostly been on the big data side. Walmart is one of the largest companies when it comes to capturing the transactions data that we have on daily basis both at the stores and at the e-commerce. Having worked on the big data technologies a little, I felt I would give it a try to make big data and related questions a little easy to understand.
Following are few of the questions that I will try to explain:
- What is Big Data? And why should I care about it?
- What is the problem with a single laptop?
- What is a cluster?
- The data is distributed across the cluster. What is meant by this?
What is Hadoop and Spark?
Everyone is generating data. Data is the side-product of any work we do. Just like smoke is emitted when things burn, data is created when machines work or interact with each other. These machines used to emit data in the past as well, but we were not advanced enough to collect the data. The technologies weren’t capable of capturing these.
Today, every other thing is emitting data. Be it the signals from your refrigerator, the air-conditioners in your room, the cars on the street, the videos that you like or dislike on YouTube, your transactions at the nearby Super-market, the posts you make on Facebook. These are all emissions of your day-to-day work.
Companies and analysts want to understand you and your behavior using the data you generate. To give you an idea of how much data is generated, let’s look at the ridiculous amount of data some of the big giants produce on daily basis.
Facebook’s daily logs are more than 60 terabytes every day.
- Google’s web index is more than 10 petabytes of information.
- Millions of customers visit Walmart stores on daily basis. Imagine the size of these transactions data.
Do we need this data? Of course!
I strongly believe this.
Every data has a story. With proper analysis and techniques, we can get a lot from data. Now, there will be times when you won’t get anything insightful from it. I consider this also good. Maybe, there is something wrong with the way we are collecting the data and we need to have a look at it.
So far we know that every other thing is puking data. And they are puking in bulk. To get something good out of these pukes, we need to store them somewhere, do our analysis on them and get insights. How do we go about it?
Storage of this data has never been a problem. The disks are relatively cheap. A terabyte of disk costs ~ $35. So storing them on physical disks is not an issue. What is the problem then?
Do you have an idea how fast the data can be read from a disk? It is still in Mbs/sec. Let us assume a speed of 100 Mbs/sec. With this speed, the time taken to read 1 terabyte of data would be ~ 3 hours! Below is the calculation for 3 hours.
1 terabytes = 1000,000 Mb. So the total time ~ 1000,000/100 ~ 10000 seconds ~ 10000/3600 ~ 3 hours
3 hours is too huge a time to read 1 terabyte of data given the amount of data giants like Facebook generates on daily basis. Also, a single machine won’t be able to handle this much data. So what is the solution?
What if we could distribute the data across multiple machines? Yes, that sounds promising. The data that is big enough to handle by one machine can be distributed across multiple machines such that each machine has some portion of the actual data. Now these machines are interconnected with each other and so there is no issue of interactions between these machines. And that is exactly what a cluster is.
A cluster is a combination of many machines connected to each other. Think of it like this. There is a process by which, my computer and your computer can be connected to each other and we can have a cluster with two nodes or machines.
Now, that you have your machines connected with each other in the cluster, you can take advantage of the computation power of each of the machines. And thus, the task which one couldn’t even think of accomplishing by a single machine can now be very easily be done by a cluster of machines.
What is hadoop and Spark? And how are they different?
Hadoop is a framework that supports the storage and processing of large data sets in a distributed computing environment.It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Spark is a powerful analytics engine. It is a fast and general engine for large-scale data processing.
Hadoop provides both the data storage and processing power whereas Spark is meant for doing analysis and processing of big data.
One thing to notice is, Spark is about 10 to 100 times faster than the Hadoop MapReduce framework by making use of in-memory processing compared to persistence storage used by Hadoop.
Spark is a Swiss army knife of analytics world. There are various APIs- Python, R, Scala thorough which one can interact with the Spark framework. For machine learning algorithms, there is MlLib which can be used to perform some common analysis like regression, K-means clustering, and classification.
We will talk more about Spark in future posts. I am exploring it rigorously and plan to write out my understanding of it soon.
I hope you enjoyed reading this post and feel a little more familiar with big data now. Hit the share button if you would like your friends to read this.