Category Archives: DBMS

Apache Spark Future

Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?

Predicting Apache Spark Future

In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.

Continue reading

The Story of Online Data Warehouse

The faster your data warehousing solution runs, the higher would be the business demand related to the speed of new data availability in their reports. Over the last time I’ve seen a number of attempts to build up a cool thing called “online DWH” – a data warehouse that is almost in sync with data sources and has its data marts and reports dynamically updated as new data flows into it. This is a very great and powerful thing, but unfortunately its implementation is not as straightforward as the business wants it to be.


Continue reading

Spark DataFrames are faster, aren’t they?

Recently Databricks announced availability of DataFrames in Spark , which gives you a great opportunity to write even simpler code that would execute faster, especially if you are heavy Python/R user. In this article I would go a bit deeper than the publicly available benchmark results to show you how it really works.
stupid benchmarking

Continue reading

Modern Data Architecture Podcast

Great news! I have participated in a podcast recorded by Pivotal and published in our official blog. In this podcast I discuss the data architecture in general – how the things started, what was the main driver for its evolution and what we have now as a “modern data architecture”. Come and listen here:

Pivotal Podcast Modern Data Architecture

Text transcript of this talk is also available by the same URL

Splice Machine

Today I will tell you about the startup called “Splice Machine”. They position themselves as “The Only Hadoop RDBMS”, which is quite bold given the boom we see now in SQL-on-Hadoop solutions field, almost each of the big vendors implemented their own “one and only” solution and claim it to be the best. But let’s take a look at its internals to say whether its design really reflects the marketing slogan they’ve chosen.

Continue reading

MVCC in Transactional Systems

MVCC stands for Multi-Version Concurrency Control. It is the basic transaction isolation idea that stands behind many transactional systems and allows different processes see different version of truth for the same data. Considering DBMS system, when you are running the query that performs “update” of a specific number of records in the table, you should guarantee specific transaction isolation: if you run “select” in parallel with this “update”, you most likely want this “select” to see the data that was in the table before the “update” has started and not the “dirty” data that was created by this “update” (that might be rollbacked as well as committed).

The solution to handle this particular problem is MVCC – you need to store a number of versions for each row of the table that got changed. This data should be stored somehow and somehow maintained. I will discuss a number of approaches to make it.


Continue reading