Hadoop
I’d start with a bold statement: Hadoop is rapidly losing the momentum. We can see it from the following Google Trends chart:
This graph it surprisingly similar to the diagram of the hype lifecycle:
Looks like Hadoop is on a downhill trajectory of the hype graph, going straight to the trough of disillusionment. We all know that one of the most important pillars has fallen recently: Cloudera has bought Hortonworks, which means the two largest Hadoop players on the market are now a single one. And despite this buyout, Cloudera is far away from being successful on the stock market:
Essentially, the market is left with one major Hadoop player, Cloudera. And what if I tell you that Cloudera is not about Hadoop since a long time ago? Here is one interesting thing: number of times the word “Hadoop” appears on the Cloudera front page (http://cloudera.com) over the last few years according to the internet archive:
- 2008 – 4 times
- 2009 – 11 times
- 2010 – 29 times
- 2011 – 37 times
- 2012 – 23 times
- 2013 – 9 times
- 2014 – 4 times
- 2015 – 8 times
- 2016 – 6 times
- 2017 – 1 time (footnote)
- 2018 – 1 time (footnote)
- 2019 – 2 times
Nowadays, Cloudera’s mission listed in bold on the front page of their website is: “We deliver an Enterprise Data Cloud for any data, anywhere, from the Edge to AI“. You can clearly see the shift in focus – no longer on-permise Hadoop and CDH, no longer Big Data. Now they do Enterprise Cloud and AI. The reference to CDH can only be found on the “Downloads” page, under “Quickstart VMs”.
But is Hadoop really this bad? Not at all! And in fact, this is not really Hadoop that is falling down right now, it is a “Big Data” hype. But before proceeding with this, let’s take a short side route and take a look at Apache Spark.
Apache Spark
Apache Spark was the last passenger to jump on the departing “Big Data” train:
The start of its rising almost matches the highest point of the hype curve drawn by Hadoop popularity. And based on this graph we can clearly see that it has already reached the cap of the “Big Data” market. This means there is no more space for horizontal growth, and the only approach to move forward is vertical growth. This is why there is no more Spark Summit in 2019, instead we have a shiny new Spark + AI summit. You can see my take on the future of Apache Spark from 2016 here, and make your own judgement of how well my predictions has matched the reality.
Big Data
Big Data is a problem of processing large amounts of data. But the term was hyped so much that right now it has a clear negative taste. At the peak of the hype, anything could have been labelled “Big Data” to boost the sales. However, it is clear that “Big Data” is not a thing by itself, and has no value in itself.
“Big Data” is a problem faced by a selected few of the large internet companies in 2000 – 2005. At that point in time, this was a very challenging problem – there were no knowledge on how it can be approached, and of course no open source solutions for doing so. Many big internet companies have become visionaries for the industry and gifted us what we now call “Big Data”: Google with its GFS, MapReduce and BigTable, Yahoo with its Hadoop, Facebook with its Cassandra and Hive, Twitter with its Storm, LinkedIn with its Kafka. Large internet companies were driving the revolution by inventing new approaches and tools to harness large amounts of data they had to deal with. And many of them have open sourced their software, making it available to the whole world. This is a pivotal moment, as it has given birth to a set of startups with a mission to sell all these solutions to the conventional enterprises. Cloudera, Hortonworks, MapR and many others were among them.
The hype around “Big Data” were largely driven by the vast investments in its marketing by the above named startups, and the short-sightedness of the upper tiers of IT personnel in conventional enterprises. The marketing has harnessed the association of “Big Data” technology produced by a large internet company with the success of that company. Their marketing materials were not telling this directly, but it literally read like “use Cassandra and become successful like Facebook”, “use Kafka and reach the scale of LinkedIn”, “use Hadoop and become as rich as Google”. Overall, “Big Data” was not about selling technology, it was about selling the success of the large IT giants to conventional companies.
Unsurprisingly, many enterprises were buying into this, and implementing these technologies in their stacks. As a result of this implementation, they usually made a bold announcement that they are harnessing the power of “Big Data” and their enterprise is advanced in this matter. However, usually the implementation itself was more like an experiment – aside from the main data processing pipelines, a small and isolated case that might not even be delivered to the production and remain on the PoC or MVP level.
However, many smaller enterprises were buying into this message of large enterprises and their success stories, and have also invested their money and efforts into “Big Data”. This way, the hype was growing as a large snowflake, with more and more senior people bluntly lying or not telling the complete truth, and marketers utilizing their words (sometimes removing the important context) to promote their solutions further.
The end of an Era
So, I’m not saying that some new breakthrough technology has come to replace the “Big Data”. And I’m not saying that Hadoop is no longer a viable technology and no longer worth the investment. However, I’m saying that the era of “Big Data” is coming to its end, dropping from the heights of the hype down to its bottom. The new trends, AI and ML, have come to replace them, and the cycle of life starts again with the new set of technologies climbing uphill on the hype chart, the marketers promoting new software covering it in a sauce of tech giants’ success, and the conventional enterprises buying into this, blowing up the next tech bubble.
Is this the end of Hadoop?
Not really. Hadoop is a great piece of technology, but it is essentially a niche solution. There is only a selected few of the enterprises who really needs it. And as a technology it competes with major cloud providers offering alternative large-scale storage solutions: AWS with its S3, GCP with its Cloud Storage, Microsoft with its Azure storage. Cloud it little by little eating the on-premise market, and the cloud providers with their distributed storage solutions are the main competitors of the Hadoop in my opinion, which does not make its life easier.
Disclaimer: everything in this article represents my personal and humble opinion, and is not affiliated with any of my employers.
Good evening Alexey, and thank you for this new post
please, keep sharing your thoughts, which in my opinion are very accurate.
Kind Regards
Arturo
You’re welcome, I plan to keep on writing.
Spark built on top of hadoop’s input formats and was much, much faster and so stole the torch. Most affluent Hadoop users upgraded to Spark some time ago. Most of the rest of the market is now doing so.
Agree. This is why Spark market capacity has approximately reached Hadoop market capacity, i.e. Spark is running on top of almost every Hadoop deployment nowadays. And this is why I believe Databricks are moving towards trying to tie it with AI and sell it under the new sauce. I plan to soon write more on Spark.
Tell me, how do you train AI and ML without ‘Big Data’ ? Companies started collecting data from 2000-2005. Now they are using that data to train their ML and AI models.
Also, most of the hoopla around Spark is ignoring the fact that most people run Spark on Hadoop YARN and use Hadoop HDFS for storage.
I’ll say one thing – Databricks has done a real good job marketing and instilling the fact that ‘Spark is faster than Hadoop’. Which is a very dishonest statement. Spark is faster than ‘MapReduce’ which is an application that runs on Hadoop.
Spark is also terribly inefficient (cpu and memory utilization of Spark sucks) because it hogs far more memory and until recently – way less scalable, since Spark reduce operators could not spill to disk! – but you run in the era of the cloud, so who cares (until your data grows and your AWS / GCP bill grows proportionally)
Yet, you have an army of so called ‘Data Scientists’ with their notebooks who talk about how spark is the next best thing to sliced bread.
There are multiple options to train the AI/ML model. For example, AlphaGo Zero has been trained without the training set by just playing with itself. And AlphaStar was trained on just 800k game replays which is a matter of gigabytes worth of data.
Usual ML workflow looks like this: collect data, clean data, prepare train/test sets, train model, validate model, repeat. And the data is not always “Big Data”. For example, banks train their credit scoring model on the megabytes / low gigabytes worth of data – this is what you get in a bank with around a million clients. Many systems still operate RDBMS to prepare the data for learning, and it is ok. Most ML algorithms lose precision when running in distributed mode (and some does not have a distributed mode at all), so they don’t really benefit much from the data being big.
Yes, Spark runs best on YARN and many people use it like this, but here I’m telling that Spark is also in the same trouble just like Hadoop – Big Data hype is going away, and it is left with a cold market with negative sentiment towards Big Data and Hadoop.
Databricks has done a great job on marketing Apache Spark, but IMO they forgot that as a company they should have some profits to keep on running, and the Databricks value added offering is, well, mediocre.
About Spark being inefficient – I don’t completely agree, I think Spark has a good balance of functionality vs efficiency. Of course, if you use MPP RDBMS instead of Spark SQL, Apache Storm instead of Spark Streaming, Keras instead of Spark MLlib – you will get much better value using less compute/storage resources. However, understanding all these separate technologies and operating this complex deployment (i.e. OpEx of it) is making up the price of the resources you’ve saved.
AlphaGo, I would say is an outlier – I don’t know about banks but applications of AI / ML I have been involved in still require enormous amounts of data – Self Driving / Business analytics and most other forms of analytics to name a few. Behavioral / ad analytics will not work without training and validating your model over significantly sized datasets. Just because people havn’t figured out distributed training, doesn’t not make it less useful.
By the way, even if Cloudera doesn’t make money from Hadoop, take a look at the how many people use Amazon EMR, Azure HDI and Google DataProc – They are respectively the most used (atleast figure in the top 5) managed service for each of the vendors, outside of their basic container services.
My opinion is that Hadoop / Big data is actually becoming ubiquitous. The thing is, you can’t make money just selling Big data tech, mostly because open source Hadoop / Spark / Kafka etc. are getting better and people understand you don’t have to fork out a lot of cash to a vendor. Operability was an issue before, but the cloud providers have come in to fill that space.
100%. Big data is not going away – it is just becoming commodity. It is now mostly as easy as managing an RDBMS, once you have the experience. Tools are maturing, learnings are shared, etc.
Business Analytics – do you mean BI? BI mainly uses MPP as a data source, which is 40 years old technology, much older than the “Big Data” term.
Self-driving cars – I agree, but how many companies are working on it around the world? 5, maybe 10? It still proves to require a niche solution for niche problem.
Regarding ads business – I agree that it is about the data. However, one of my customers from the Ads business has been using MPP for the click analysis, and a single-server RDBMS for the reporting.
I agree that cloud providers are making money for hosting Hadoop deployments. But I’m telling that the hype is fading away, and Hadoop takes a place of a niche technology most of the companies don’t really need or use now. You see, the hype was boosted by a large marketing budgets to drive the sales of “Big Data” vendor licences. They oversold to them even to the customers who don’t really need the technology, enterprise users have run their PoCs and MVPs and understood technology limitations, and moved on. Big Data and Hadoop has received a negative sentiment, market started to shrink and the vendors started to revoke their offerings one by one (Pivotal, DataStax, HortonWorks and many more like this).
But as the negative sentiment disappears over time, we would likely see Hadoop reaching the “Slope of Enlightenment”. The amounts of data businesses operate are really growing, and over the time more and more companies would really need this technology, so Hadoop adoption and its popularity will go back up.
Excellent writing… I believe the “Hadoop is dying” narrative is built by competitors who took serious hit from Hadoop and during Hadoop’s “Trough of Disillusionment” are trying desperately to grab their market share back.
In my opinion the issue with Hadoop is less from the technical perspective and more from the understanding perspective. You can buy a Ferrari but can’t use it to plough fields. Moreover, a common misconception is that Hadoop is Big Data and Big Data is Hadoop. I consider it an eco-system where better tools and technologies are replacing moderate tools and technologies at a very rapid pace. Hadoop started around 10 years ago and the innovation is at such a rapid pace that we moved from MR to pig to Hive to Spark in this brief period serving use-cases like batch data processing, Data archival, DR, near-realtime/realtime stream processing and so on still working on an open source model. Hadoop tried to be too nice, which in this cruel world backfires most of the time.
Pingback: How Data Scientists Can Become More Marketable - Coiner Blog
Pingback: The Most In-Demand Skills for Data Engineers in 2019 – Viral Cruncher
Pingback: How Data Scientists Can Become More Marketable - The Mazaryn
Pingback: Как стать более востребованным специалистом в сфере Data Science в 2019 – CHEPA website
Pingback: Data Science Skills for Today – Viral Cruncher
Pingback: Tips on how to Grow to be Extra Marketable as a Knowledge Scientist
Pingback: How to Become More Marketable as a Data Scientist – Data Science Outpost
If you look deep into hadoop and spark, their technologies are from various MPP implementations which old database companies have been doing for >40 years. They just packaged and open-sourced them. And their fate is — become an opensource equalvalent of Teredata database…
And we are in 2024 using platforms built mainly around Spark.