I often hear this question from my customers and over the internet, especially in the last year. So what is the myth and what is real about the Spark and its place in the “Big Data” ecosystem?
To be honest, the question I put as the title is wrong, but it is usually the way it is asked. Hadoop is not a single product, it is an ecosystem. Same for Spark. Let’s cover them one by one. The main pillars of the Hadoop ecosystem at the moment are:
- HDFS – Hadoop Distributed Filesystem. It is the distributed block-oriented non-updatable highly-scalable filesystem capable of running on the clusters of commodity hardware. It is a standalone tool and can be used without any other ecosystem components (except by Zookeeper and Journal Manager if you want your HDFS to be highly available, but it is a different story);
- MapReduce Framework – a basic framework for distributed computing on a cluster of commodity hardware. You are not obligated to work only with HDFS – filesystem is pluggable. You are also not obligated to use YARN with it: you can replace it with Mesos, for instance – resource manager is also pluggable;
- YARN – the default resource manager for Apache Hadoop cluster. But again, you can run the cluster without YARN by starting your MR jobs under Mesos or just running HBase which does not require YARN for running;
- Hive – SQL-like query engine on top of MapReduce framework that is capable of translating the HiveQL queries in a series of MapReduce jobs running on the cluster. But again, you are not obligated to use HDFS as a storage and you are not obligated to use MapReduce framework which in this case can be replaced with Tez;
- HBase – key-value store on top of HDFS that delivers the OLTP capabilities to Hadoop. The only dependencies for HBase are HDFS and Zookeeper. But is HDFS really dependency for HBase? No, it can also run on top of the Tachyon (in-memory filesystem), MapRFS, IBM GPFS and others.
And that’s all. You might also consider Storm for processing streams of data, but it is completely independent from Hadoop and can run separately. You can also mention Mahout for machine learning on top of MapReduce, but it is continuously loosing community attention within the last year:
Now let’s cover Spark. What it offers in general:
- Spark Core – an engine for general-purpose distributed data processing. It has no dependencies and can run on any cluster of the commodity servers;
- Spark SQL – SQL query language on top of Spark, supporting a subset of SQL functionality and HiveQL. Not mature enough to be used in production systems. HiveQL integration requires Hive Metastore and Hive jar libraries;
- Spark Streaming – micro-batch data processing engine on top of Spark, supporting data ingestion from a variety of the source systems. The only dependency is Spark Core engine;
- MLlib – machine learning library built on top of Spark supporting range of data mining algorithms.
And let’s also cover one important myth about the Spark – “Spark is in-memory technology”. Spark is not an in-memory technology. It is pipelined execution engine, and it will write the data to disks during “shuffle” (if you need to aggregate data based on some field, for instance) and will spill in case it does not have enough memory (this can be tuned). So Spark is faster than MapReduce mostly because of its pipelining capabilities and not because of some cool “in-memory” optimizations. Of course, it has in-memory caching that improves the performance, but it is not the main reason why Spark is working fast.
Now let’s see the whole picture.
- MapReduce can be replaced with Spark Core. Yes, it can be replaced over the time and this replacement seems reasonable. But Spark is not yet mature enough to make a full replacement of this technology. Plus no one will completely give up on MapReduce unless all the tools that depend on it will support an alternative execution engine. For instance, there are some works running to support Pig scripts execution on top of Spark;
- Hive can be replaced with Spark SQL. Yes, it is again true. But you should understand that Spark SQL is even younger than the Spark itself, this technology is younger than 1yo. At the moment it can only toy around the mature Hive technology, I will look back at this in 1.5 – 2 years. As you remember, 2-3 years ago Impala was the Hive killer, but now both technologies are living together and Impala still didn’t kill Hive. The same thing would be here;
- Storm can be replaced with Spark Streaming. Yes, it can, but to be fair Storm is not a piece of Hadoop ecosystem as it is completely independent tool. They are targeting a bit different computational models so I don’t think that Storm will disappear, but it will continue to leave as a niche product;
- Mahout can be replaced with MLlib. To be fair, Mahout is already losing the market and over the last year it became obvious that this tool will soon be dropped from the market. And here you can really say that Spark replaced something from Hadoop ecosystem.
So, in general the outcomes of this article are the following:
- Don’t be fooled by the “Big Data” vendors Hadoop distribution packaging. The stacks they are pushing to the market are not the ultimate truth. Hadoop was initially designed as an extensible framework with lots of replaceable parts. Replacing HDFS with Tachyon, YARN with Mesos, MapReduce with Tez and using Hive on top of it – will it be an “alternative” Hadoop stack or a completely separate solution? And what if we will leave MR instead of Tez, will it be “the Hadoop”?
- Spark does not provide you a complete alternative stack. It allows you to integrate its capabilities in your Hadoop cluster and benefit from them, without completely dropping your old solution;
- Spark is not yet mature enough. I think in 3-4 years we would stop to call it “Hadoop Stack” and will call it “Big Data Stack” or something like this, because in this stack we would have a broad choice of the different opensource products that can be binded to work together as a single stack.