Yesterday my blog has got the 100th subscriber. To commemorate this, I prepared the post on the major industry trends happening in the field of “data”. I might miss something, so feel free to comment and extend the article with your opinion!
Big data is falling down the hype curve
Even though Gartner has removed “Big Data” from the last year’s hype diagram, it does not mean it suddenly moved from the peak of the “hype” to the plateau of adoption. Here is how the hype cycle look like:
And here is how the trends look like for Big Data and Hadoop, according to Google Trends:
The diagram of “Big Data” looks exactly as expected by the hyped technology on the rise. Here is my version of what has happened, how it happened and why it happened:
- Hadoop was born by Google’s ideas and Yahoo’s technologies to accommodate the needs for distributed compute and storage frameworks by biggest internet companies. 2003-2008 are the early ages of Hadoop when almost no one knows what it is, why it is and how to use it;
- In 2008, a group of enthusiasts formed a company called Cloudera, to occupy the market niche of “cloud” and “data” by building commercial product on top of open source Hadoop. Later they abandoned the “cloud” and focused solely on “data”. In March 2009 they have released their first Cloudera Hadoop Distribution. You can see this moment on the trends diagram immediately after 2009 mark, the raise of Hadoop trend. This was a huge marketing push related to the first commercial distribution;
- From 2009 to 2011, Cloudera was the one who tried to heat the “Hadoop” market, but it was still too small to create a notable buzz around the technology. But first adopters has proven the value of Hadoop platform, and additional players has joined the race: MapR and Hortonworks. Early adopters among startups and internet companies are starting to play with this technology at this time;
- 2012 – 2014 are the years “Big Data” has became a buzzword, a “must have” thing. This is caused by the massive marketing push by the companies noted above, plus the companies supporting this industry in general. In 2012 alone, major tech companies spent over $15b buying companies doing data processing and analytics. Some of them were bubbles (like Autonomy), some – not. But the demand for “big data” solutions were growing, and the analyst publications were heating the market very hard. Early adopters among enterprises are starting to play with the promising new technology at this time;
- 2014 – 2015 are the years “Big Data” is approaching the hype peak. Intel has invested $760m in Cloudera giving its the valuation of $4.1b, Hortonworks went public with valuation of $1b. Major new data technologies has emerged like Apache Spark, Apache Flink, Apache Kafka and others. IBM invests $300m in Apache Spark technology. This is the peak of the hype. These years a massive adoption of “Big Data” in enterprises has started, architecture concepts of “Data Lake” / “Data Hub” / “Lambda Architecture” have emerged to simplify integration of modern solutions into conventional infrastructures of enterprises:
- 2016 and beyond – this is an interesting timing for “Big Data”. Cloudera’s valuation has dropped by 38%. Hortonworks’s valuation has dropped by almost 40%, forcing them to cut the professional services department. Pivotal has abandoned its Hadoop distribution, going to market jointly with Hortonworks. What happened and why? I think the main driver of this decline is enterprise customers that started adoption of technology in 2014-2015. After a couple of years playing around with “Big Data” they has finally understood that Hadoop is only an instrument for solving specific problems, it is not a turnkey solution to take over your competitors by leveraging the holy power of “Big Data”. Moreover, you don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale. All of this has caused a big wave of priorities re-evaluation by enterprises, shrinking their investments into “Big Data” and focusing on solving specific business problems. “Big Data” market is cooling down:
The emerge of Data in the Cloud
This is the second major trend of “data” industry. IBM acquires Cloudant. Databricks, the company behind Apache Spark, has their product offering for cloud only, in collaboration with AWS. Most common use case for Docker containers is running data services inside of them. All the major public Cloud companies are offering you technologies like “managed databases”, or even analytical databases in the cloud. All the major Hadoop vendors has already pushed their “cloud” offering to the market. DBaaS industry is getting more and more hot, with all the major DBMS vendors offering their solutions in the clouds.
Initially, “cloud” was meant to host applications only (aka 12-factor applications), and the databases had to be managed separately. But the time passes, and now many companies moving to the cloud, hosting their databases in the cloud and even running analytics using cloud-hosted distributed processing engines. Amazon Redshift alone is reportedly running more than 100k of nodes!
Data is going open
If you have seen my visualization of open source data community, you understand what I am talking about:
10 years ago the only open source data processing offerings were Postgres and MySQL. Over the time, the open source industry has emerged, and over the last couple of years you can see more and more companies going open source!
Pivotal open sources Greenplum, HAWQ and Gemfire (aka Geode). DataTorrents open sources its technology as Apache Apex. Cloudera open sources Kudu. Citus Data open sources CitusDB. Google open sources TensorFlow. Google open sources Dataflow as Apache Beam. There are more examples of it, just scroll through the visualization the see how the open source data industry moves from 10 to 100+ projects within the last 10 years.
Artificial Intelligence climbs the hype
“Artificial Intelligence” is starting to climb on the hype hill. Many vendors providing solutions in the field of Machine Learning and Deep Learning are now rebranding their offerings into “AI platforms”. For example, you can take a look at http://www.h2o.ai, http://www.wipro.com/holmes and many others.
IBM is aggressively moving Watson to the market. Google open sources its TensorFlow, Microsoft open sources CNTK. Facebook open sources AI hardware design and Torch. Google’s DeepMind has created AlphaGo, the intelligent computer system playing Go at pro level, beating one of the best Go players in the world.
All the major IT companies are in a race for strong AI, and you can observe consequences of it in the industry. The raise of AI companies supplying enterprise demand in these technologies is yet to come.
Pingback: Data Industry Trends | Filling the gaps in Big Data
Interesting opinion about big data hype.
Btw: in feedly I see that there are 221 readers of your blog there. I read it there myself, so your subscriber statistics can be little bit wrong 🙂
Its my personal opinion, so if you have a different view I’m open for discussion
I think feedly is working by subscribing on RSS stream, and there is completely no way to control how many people are reading your RSS. I was talking about direct email subscriptions
Out curiosity what do you think somebody learn/focus on to stay ahead of other big data engineers and command a high salary? I mainly do hive/spark/sqoop in my current job and I’m afraid that as the technology get easier to use and higher level abstraction the pay for this type of work is going to stagnant or drop off significantly.
I was thinking of maybe just focusing on scala and knowing the jvm from front to end.
Also do you think the cloudera/spark cents still worth it(I know you have them)? What about AWS solutions architect cert I heard it’s highly respected.
You shouldn’t care much about simpler APIs – there is a huge gap between remembering the API and understanding how the technology works from the inside. Considering SQL, there are hundreds of thousands people who knows SQL query language, but how many of them can describe the difference between hash join and sort-merge join? Not much. It is similar for big data – many people can tell you what Hive is, but very few can really understand what is under the hood, how it works and how you can affect the query execution logic to make your query run faster.
If you are a good specialist in specific area and this area is popular, focus on getting a deeper knowledge in it, becoming a subject matter expert. Certificates mainly validate that you know the API. Becoming a contributor to this technology lets everyone know that you really understand it in depth. So I advice you focusing on system internals and becoming a contributor of Spark/Hive/Sqoop
Hi! Do you mind if I share this on my blog with a link pointing to this as source?
Feel free to do so
Thanks! Also, what are your views on enterprises who do not have any financial restrictions (well, kind off), using something like a SAP HANA. So when the need arises to create something sort of a data mart, the lure towards ‘complicating’ things with Hadoop is not really that great. Although, your point with regards to Hadoop on Cloud is great, especially with SAP HANA buying Altiscale just few days ago. It’s seems that is where things are headed to keep things low commitment and still scalable.
Your article is unbiased and great! but would be nice to know your viewpoint from this side of things.
Pingback: Why not so Hadoop? – MerelyData.com
Pingback: Why Not So Hadoop? | Make Data Work
Pingback: Why Not So Hadoop? – Cloud Data Architect
that’s exactly what happened in the financial industry. Big firms rushed to implement Hadoop because everyone else is doing it….
Pingback: Lakehouse | Distributed Systems Architecture