I have just read the “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” paper and decided to write a short blog post going through some of the key moments of the paper’s motivation. Let’s start.
A decade ago, the first generation [data warehouse] systems started to face several challenges. First, they typically coupled compute and storage into an on-premises appliance. This forced enterprises to provision and pay for the peak of user load and data under management, which became very costly as datasets grew.
This is indeed correct. Most EDW platforms preferred selling appliances versus software-only licenses. For some the main driver behind it was custom hardware (Netezza), for some the potential extra price margin. However, “pay for the peak of user load” stays true for all on-premises software deployments, not only EDW ones. The push for “pay-as-you-go” approach was first introduced by the cloud vendors to incentivize cloud transformation of traditional businesses. It is obvious that cloud transformation also made some impact on the EDW market:
- Cloud vendors introduced their own EDW solutions either through purchasing and adjusting the already available ones (Paraccell into AWS Redshift), developing a totally new one (Microsoft’s Parallel Data Warehouse) or porting internal solution for the external use case (Google’s Dremel into BigQuery). All these solutions had the goal of being adjusted to the cloud deployment mode out of the box, allowing the clients to take advantage of the same “pay-as-you-go” model.
- Traditional EDW vendors have added a cloud deployment option to their solutions. However, cloud integration is not simple: solutions designed for on-premise hardware with directly attached storage do not map well into the cloud world with independent storage and compute with different tiers in both. As a result, these ported solutions have fallen behind the native offerings of cloud vendors for a long period of time.
- New Cloud-native EDW offerings have emerged: Snowflake, Databricks Data Lakehouse, Cloudera. It is not totally fair to put them into the same line as each have their own specific background, but they share the same common point – they are designed on top of elastic by design platforms that very well adjust to the cloud deployment.
So the push from “pay for the peak” to “pay as you go” in EDW space was driven by the general cloud transformation trend and not the fact that “pay for the peak” became very costly for data warehouses. Cloud is ultimately just someone else’s computer, and even in the cloud the “pay as you go” approach has limitations: it works well when you’re running at small scale, but the larger your EDW scale is the more likely you will need a custom agreement with cloud vendor to have some guarantees you will be able to scale up when you need to. And at the very large scale you will anyway be asked to provision for the peak.
Second, not only were datasets growing rapidly, but more and more datasets were completely unstructured, e.g., video, audio, and text documents, which data warehouses could not store and query at all.
I was always skeptical about the hype around BigData and the “unstructured data”. You can read some more details in my older post, but let me briefly reiterate how it happened:
- [2003-2004] Google published a set of papers on how it solves data processing at scale.
- [2004-2008] Yahoo and Facebook have developed a set of open-source solutions based on these papers as internally they had the same problem of data processing at scale.
- [2008-2016] Companies that packaged these open source solutions and sold them to traditional businesses under the kool-aid of “be like Big Tech” emerged en masse. Only the lazy one didn’t try to sell Hadoop at least once, right? And to drive the sales, the marketing specialists have invented the mythical BigData and “unstructured data”.
Ultimately, these were just a placeholder for the lack of actual use cases of the Hadoop technology in the traditional enterprise. Data warehouses do not need to query “video, audio, and text documents” kind of information. And other systems do not need to query them as well, because this is raw information not suitable for direct querying. In order to process this kind of information, you need to develop and deploy data processing pipelines that extract properly structured data from it (image recognition, sentiment analysis, etc.). This extracted data can then be placed into EDW and subsequently used for analysis. Hadoop deployment is a good framework that simplifies deployment of such a data processing pipeline that runs at the large scale. But this data just compliments the EDW, it does not replace or extend it.
To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC
This change was not driven by technological reasons. Most of the EDW vendors price their solutions based on the amount of data you store them, so it only makes sense to offload the “cold” data somewhere else to reduce the licensing cost of your EDW. However, ultimately it led to the situation where “data lake” introduced for the sake of cheap storage added more problems than solved. It indeed reduced the EDW license cost, but at the same time it introduced license cost of the “data lake” solution and complicated the deployment landscape increasing the need for engineering support staff to oversee the “data lake” and its integration with EDW. So, cost cutting attempt resulted in cost shifting.
In today’s architectures, data is first ETLed into lakes, and then again ELTed into warehouses, creating complexity, delays, and new failure modes.
Yes, and this is the problem of misusing the “data lake” to cut the costs of EDW license and EDW hardware, as stated above. It is not driven by technical or technological issues. Traditional enterprises still ETL/ELT from operational sources directly into the EDW.
Moreover, enterprise use cases now include advanced analytics such as machine learning, for which neither data lakes nor warehouses are ideal.
I always smiled listening to the sales pitches packaging the “machine learning / data science” buzzword into an EDW solution. They all sell “machine learning at scale” and their know-how ultimate library as the solution for all the data science problems you might face. Unfortunately, machine learning does not work this way. Machine learning is 80% about data processing (which can be solved by any EDW solution) and only 20% about the actual analysis. And for “at scale” problems there is a magical (statistical) solution known as “sampling”. Only a vast minority of problems actually require training the model on a very large data set, and it will not be a surprise if I tell you that every single EDW vendor has their own solution for this problem on top of their EDW offering. And of course, there is a whole class of specialized solutions as well.
The data in the warehouse is stale compared to that of the data lake, with new data frequently taking days to load.
There’s a class of solutions called “Change Data Capture” or CDC that predates Hadoop, and there is a large set of vendors that provide them today. When applied properly, they can reduce the data staleness in EDW to tens of minutes. The decision on the acceptable data staleness in EDW is driven by the stakeholders based on the cost-benefit analysis, and is not limited by the lack of technology to solve this problem.
A straw-man solution that has had limited adoption is to eliminate the data lake altogether and store all the data in a warehouse that has built-in separation of compute and storage. We will argue that this has limited viability, as evidenced by lack of adoption, because it still doesn’t support managing video/audio/text data easily or fast direct access from ML and data science workloads.
Hold it right there, “lack of adoption”? This is basically the modern solution proposed by every cloud vendor and also Snowflake. My viewpoint on “unstructured data” in EDW and “data science” are already highlighted above.
TL;DR. The paper tries to motivate introduction of Lakehouse by high engineering efforts required to maintain “Data Lake” + EDW tandem (that is caused by data lake misuse for cost cutting EDW license), EDW data staleness (which was solved by CDC long ago), EDW lack of advanced analytics support (every EDW vendor provides them), unstructured data processing needs (you don’t want unstructured data inside your EDW). But the actual motivation is the willingness of Databricks to position their solution as a competitor in the cloud EDW market and describe its key features, further generalising their approach to look more comprehensive than their competitors. Unfortunately, I don’t see unique challenges it solves, so for now I will treat Lakehouse as another marketing buzzword. It is also worth noting that 5 years ago I had already assumed that Databricks would enter EDW market, but I greatly underestimated the timeline of this entry: it actually took close to 5 years compared to the original 1 year estimate.
Disclaimer: this post represents my own humble opinion and is not affiliated with any organisation.