Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?
In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.
Recently Cloudera announces new storage engine for fast analytics and fast data called Kudu. This is a very interesting piece of code and I couldn’t withstand an attraction of analyzing this technology deeper and going beyond the marketing.
The faster your data warehousing solution runs, the higher would be the business demand related to the speed of new data availability in their reports. Over the last time I’ve seen a number of attempts to build up a cool thing called “online DWH” – a data warehouse that is almost in sync with data sources and has its data marts and reports dynamically updated as new data flows into it. This is a very great and powerful thing, but unfortunately its implementation is not as straightforward as the business wants it to be.
Recently Databricks announced availability of DataFrames in Spark , which gives you a great opportunity to write even simpler code that would execute faster, especially if you are heavy Python/R user. In this article I would go a bit deeper than the publicly available benchmark results to show you how it really works.
Today I will tell you about the startup called “Splice Machine”. They position themselves as “The Only Hadoop RDBMS”, which is quite bold given the boom we see now in SQL-on-Hadoop solutions field, almost each of the big vendors implemented their own “one and only” solution and claim it to be the best. But let’s take a look at its internals to say whether its design really reflects the marketing slogan they’ve chosen.
MVCC stands for Multi-Version Concurrency Control. It is the basic transaction isolation idea that stands behind many transactional systems and allows different processes see different version of truth for the same data. Considering DBMS system, when you are running the query that performs “update” of a specific number of records in the table, you should guarantee specific transaction isolation: if you run “select” in parallel with this “update”, you most likely want this “select” to see the data that was in the table before the “update” has started and not the “dirty” data that was created by this “update” (that might be rollbacked as well as committed).
The solution to handle this particular problem is MVCC – you need to store a number of versions for each row of the table that got changed. This data should be stored somehow and somehow maintained. I will discuss a number of approaches to make it.