I often hear this question from my customers and over the internet, especially in the last year. So what is the myth and what is real about the Spark and its place in the “Big Data” ecosystem?
To be honest, the question I put as the title is wrong, but it is usually the way it is asked. Hadoop is not a single product, it is an ecosystem. Same for Spark. Let’s cover them one by one. The main pillars of the Hadoop ecosystem at the moment are:
- HDFS – Hadoop Distributed Filesystem. It is the distributed block-oriented non-updatable highly-scalable filesystem capable of running on the clusters of commodity hardware. It is a standalone tool and can be used without any other ecosystem components (except by Zookeeper and Journal Manager if you want your HDFS to be highly available, but it is a different story);
- MapReduce Framework – a basic framework for distributed computing on a cluster of commodity hardware. You are not obligated to work only with HDFS – filesystem is pluggable. You are also not obligated to use YARN with it: you can replace it with Mesos, for instance – resource manager is also pluggable;
- YARN – the default resource manager for Apache Hadoop cluster. But again, you can run the cluster without YARN by starting your MR jobs under Mesos or just running HBase which does not require YARN for running;
- Hive – SQL-like query engine on top of MapReduce framework that is capable of translating the HiveQL queries in a series of MapReduce jobs running on the cluster. But again, you are not obligated to use HDFS as a storage and you are not obligated to use MapReduce framework which in this case can be replaced with Tez;
- HBase – key-value store on top of HDFS that delivers the OLTP capabilities to Hadoop. The only dependencies for HBase are HDFS and Zookeeper. But is HDFS really dependency for HBase? No, it can also run on top of the Tachyon (in-memory filesystem), MapRFS, IBM GPFS and others.
And that’s all. You might also consider Storm for processing streams of data, but it is completely independent from Hadoop and can run separately. You can also mention Mahout for machine learning on top of MapReduce, but it is continuously loosing community attention within the last year:
Now let’s cover Spark. What it offers in general:
- Spark Core – an engine for general-purpose distributed data processing. It has no dependencies and can run on any cluster of the commodity servers;
- Spark SQL – SQL query language on top of Spark, supporting a subset of SQL functionality and HiveQL. Not mature enough to be used in production systems. HiveQL integration requires Hive Metastore and Hive jar libraries;
- Spark Streaming – micro-batch data processing engine on top of Spark, supporting data ingestion from a variety of the source systems. The only dependency is Spark Core engine;
- MLlib – machine learning library built on top of Spark supporting range of data mining algorithms.
And let’s also cover one important myth about the Spark – “Spark is in-memory technology”. Spark is not an in-memory technology. It is pipelined execution engine, and it will write the data to disks during “shuffle” (if you need to aggregate data based on some field, for instance) and will spill in case it does not have enough memory (this can be tuned). So Spark is faster than MapReduce mostly because of its pipelining capabilities and not because of some cool “in-memory” optimizations. Of course, it has in-memory caching that improves the performance, but it is not the main reason why Spark is working fast.
Now let’s see the whole picture.
- MapReduce can be replaced with Spark Core. Yes, it can be replaced over the time and this replacement seems reasonable. But Spark is not yet mature enough to make a full replacement of this technology. Plus no one will completely give up on MapReduce unless all the tools that depend on it will support an alternative execution engine. For instance, there are some works running to support Pig scripts execution on top of Spark;
- Hive can be replaced with Spark SQL. Yes, it is again true. But you should understand that Spark SQL is even younger than the Spark itself, this technology is younger than 1yo. At the moment it can only toy around the mature Hive technology, I will look back at this in 1.5 – 2 years. As you remember, 2-3 years ago Impala was the Hive killer, but now both technologies are living together and Impala still didn’t kill Hive. The same thing would be here;
- Storm can be replaced with Spark Streaming. Yes, it can, but to be fair Storm is not a piece of Hadoop ecosystem as it is completely independent tool. They are targeting a bit different computational models so I don’t think that Storm will disappear, but it will continue to leave as a niche product;
- Mahout can be replaced with MLlib. To be fair, Mahout is already losing the market and over the last year it became obvious that this tool will soon be dropped from the market. And here you can really say that Spark replaced something from Hadoop ecosystem.
So, in general the outcomes of this article are the following:
- Don’t be fooled by the “Big Data” vendors Hadoop distribution packaging. The stacks they are pushing to the market are not the ultimate truth. Hadoop was initially designed as an extensible framework with lots of replaceable parts. Replacing HDFS with Tachyon, YARN with Mesos, MapReduce with Tez and using Hive on top of it – will it be an “alternative” Hadoop stack or a completely separate solution? And what if we will leave MR instead of Tez, will it be “the Hadoop”?
- Spark does not provide you a complete alternative stack. It allows you to integrate its capabilities in your Hadoop cluster and benefit from them, without completely dropping your old solution;
- Spark is not yet mature enough. I think in 3-4 years we would stop to call it “Hadoop Stack” and will call it “Big Data Stack” or something like this, because in this stack we would have a broad choice of the different opensource products that can be binded to work together as a single stack.
The fact is, Spark is neither a complete nor independent eco-system. Tachyon is not a file system. SparkSQL doesn’t do everything that Hive, Impala and Drill can do. Hive can be targeted to run over Spark. Drill can be integrated tightly with Spark.
People have adopted the nomenclature “Hadoop” to refer to the entire eco-system and quite frankly, Spark is really just one (or three) components in that eco-system. Over the next few years, some components will ascend and some will disappear, but the eco-system isn’t going to be turned inside out or completely replaced.
“The fact is, Spark is neither a complete nor independent eco-system” – it depends on the understanding of “complete” term. In fact, Hadoop also depends on the operating system – should we think of it as a dependency? Spark can be used in standalone mode when it will not depend on anything of Hadoop ecosystem, reading data from Tachyon or NFS shares. It will depend on the Zookeeper for HA, but Zookeeper is not just a part of Hadoop – now it is much more than this and use cases for Zookeeper are not limited by Hadoop ones
“Tachyon is not a file system” – please, prove. Maybe you can say that “Tachyon is not a POSIX filesystem”, but you cannot say that it is not a filesystem in general. But again, it is still too young to expect more from it.
“SparkSQL doesn’t do everything that Hive, Impala and Drill can do” – ok, but do you expect complete SQL or HiveQL support from a 6-months-old project? I don’t think so. They’ve already done a great job by integrating even subsets of the HiveQL and SQL functionality. Impala, for instance, didn’t support query data spilling for the first 2 years and all the queries requiring much memory were just failing. Be kind and wait, it will mature soon
“Hive can be targeted to run over Spark” – yes, it was called Shark. But then Databricks gave up on Hive query parser and optimizer and rewritten everything, and it became SparkSQL
“Drill can be integrated tightly with Spark” – this is one of the general points of Spark: have as many built-in integrations and functions as possible. It can be integrated with HDFS, Tachyon, HBase, Cassandra, MemSQL, etc. – the list is big and they are adding new integrations
I wouldn’t say that Spark is a part of Hadoop ecosystem. “Hadoop” was a first project that has shown the community they can process huge amounts of data “just like Google” at no cost, and after that it became the synonim of “Big Data”. Now many different vendors provide you “Hadoop Stack”, which in general is a set of independent open-source or proprietary components integrated together, but still referenced as “Hadoop Stack”. For instance, in the offering of your company, MapR, you replace HDFS with MapR FS – should you still call it “Hadoop Stack” after that? And what about modifications of MapReduce for better support of MapRFS, is it still Hadoop?
I think that at the moment there is an intention to rethink the approaches to process “Big Data” and soon Spark might become the next big thing and the new vendors will start to offer “Spark Stack” solutions, packaging it with HDFS and YARN
You are right on. Thanks for clarifying the myth about Spark being some kind of magical in memory processing system. Most people miss the point that real power of Spark comes from pipe lining, which you cannot do in Hadoop Map reduce.
Alexey can you brief “Spark is pipelined execution engine” ?
It just means that one of the essentials of Spark good performance is its ability to pipeline computations. Unlike MapReduce that puts data to the disk after each of the steps and cannot run series of sequential mappers (without materializing intermediate results), Spark can efficiently do it