I’d start with a bold statement: Hadoop is rapidly losing the momentum. We can see it from the following Google Trends chart:
This graph it surprisingly similar to the diagram of the hype lifecycle:
Looks like Hadoop is on a downhill trajectory of the hype graph, going straight to the trough of disillusionment. We all know that one of the most important pillars has fallen recently: Cloudera has bought Hortonworks, which means the two largest Hadoop players on the market are now a single one. And despite this buyout, Cloudera is far away from being successful on the stock market:
Essentially, the market is left with one major Hadoop player, Cloudera. And what if I tell you that Cloudera is not about Hadoop since a long time ago? Here is one interesting thing: number of times the word “Hadoop” appears on the Cloudera front page (http://cloudera.com) over the last few years according to the internet archive:
- 2008 – 4 times
- 2009 – 11 times
- 2010 – 29 times
- 2011 – 37 times
- 2012 – 23 times
- 2013 – 9 times
- 2014 – 4 times
- 2015 – 8 times
- 2016 – 6 times
- 2017 – 1 time (footnote)
- 2018 – 1 time (footnote)
- 2019 – 2 times
Nowadays, Cloudera’s mission listed in bold on the front page of their website is: “We deliver an Enterprise Data Cloud for any data, anywhere, from the Edge to AI“. You can clearly see the shift in focus – no longer on-permise Hadoop and CDH, no longer Big Data. Now they do Enterprise Cloud and AI. The reference to CDH can only be found on the “Downloads” page, under “Quickstart VMs”.
But is Hadoop really this bad? Not at all! And in fact, this is not really Hadoop that is falling down right now, it is a “Big Data” hype. But before proceeding with this, let’s take a short side route and take a look at Apache Spark.
Apache Spark was the last passenger to jump on the departing “Big Data” train:
The start of its rising almost matches the highest point of the hype curve drawn by Hadoop popularity. And based on this graph we can clearly see that it has already reached the cap of the “Big Data” market. This means there is no more space for horizontal growth, and the only approach to move forward is vertical growth. This is why there is no more Spark Summit in 2019, instead we have a shiny new Spark + AI summit. You can see my take on the future of Apache Spark from 2016 here, and make your own judgement of how well my predictions has matched the reality.
Big Data is a problem of processing large amounts of data. But the term was hyped so much that right now it has a clear negative taste. At the peak of the hype, anything could have been labelled “Big Data” to boost the sales. However, it is clear that “Big Data” is not a thing by itself, and has no value in itself.
“Big Data” is a problem faced by a selected few of the large internet companies in 2000 – 2005. At that point in time, this was a very challenging problem – there were no knowledge on how it can be approached, and of course no open source solutions for doing so. Many big internet companies have become visionaries for the industry and gifted us what we now call “Big Data”: Google with its GFS, MapReduce and BigTable, Yahoo with its Hadoop, Facebook with its Cassandra and Hive, Twitter with its Storm, LinkedIn with its Kafka. Large internet companies were driving the revolution by inventing new approaches and tools to harness large amounts of data they had to deal with. And many of them have open sourced their software, making it available to the whole world. This is a pivotal moment, as it has given birth to a set of startups with a mission to sell all these solutions to the conventional enterprises. Cloudera, Hortonworks, MapR and many others were among them.
The hype around “Big Data” were largely driven by the vast investments in its marketing by the above named startups, and the short-sightedness of the upper tiers of IT personnel in conventional enterprises. The marketing has harnessed the association of “Big Data” technology produced by a large internet company with the success of that company. Their marketing materials were not telling this directly, but it literally read like “use Cassandra and become successful like Facebook”, “use Kafka and reach the scale of LinkedIn”, “use Hadoop and become as rich as Google”. Overall, “Big Data” was not about selling technology, it was about selling the success of the large IT giants to conventional companies.
Unsurprisingly, many enterprises were buying into this, and implementing these technologies in their stacks. As a result of this implementation, they usually made a bold announcement that they are harnessing the power of “Big Data” and their enterprise is advanced in this matter. However, usually the implementation itself was more like an experiment – aside from the main data processing pipelines, a small and isolated case that might not even be delivered to the production and remain on the PoC or MVP level.
However, many smaller enterprises were buying into this message of large enterprises and their success stories, and have also invested their money and efforts into “Big Data”. This way, the hype was growing as a large snowflake, with more and more senior people bluntly lying or not telling the complete truth, and marketers utilizing their words (sometimes removing the important context) to promote their solutions further.
The end of an Era
So, I’m not saying that some new breakthrough technology has come to replace the “Big Data”. And I’m not saying that Hadoop is no longer a viable technology and no longer worth the investment. However, I’m saying that the era of “Big Data” is coming to its end, dropping from the heights of the hype down to its bottom. The new trends, AI and ML, have come to replace them, and the cycle of life starts again with the new set of technologies climbing uphill on the hype chart, the marketers promoting new software covering it in a sauce of tech giants’ success, and the conventional enterprises buying into this, blowing up the next tech bubble.
Is this the end of Hadoop?
Not really. Hadoop is a great piece of technology, but it is essentially a niche solution. There is only a selected few of the enterprises who really needs it. And as a technology it competes with major cloud providers offering alternative large-scale storage solutions: AWS with its S3, GCP with its Cloud Storage, Microsoft with its Azure storage. Cloud it little by little eating the on-premise market, and the cloud providers with their distributed storage solutions are the main competitors of the Hadoop in my opinion, which does not make its life easier.
Disclaimer: everything in this article represents my personal and humble opinion, and is not affiliated with any of my employers.