Lakehouse

3 Replies

I have just read the “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” paper and decided to write a short blog post going through some of the key moments of the paper’s motivation. Let’s start.

Continue reading →

Snowflake: The Good, The Bad and The Ugly

34 Replies

Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around the internet. Seeing that, I could not resist the urge to take a closer look at this technology and poke into some of its pain points. What have also stumbled me at first is the lack of SnowflakeDB criticism in the blogs and message boards, which sounds suspicious given the self-proclaimed customer base of more than 1000 enterprises. So, let’s take a closer look at it.

Continue reading →

Hadoop: The end of an Era

19 Replies

Hadoop

I’d start with a bold statement: Hadoop is rapidly losing the momentum. We can see it from the following Google Trends chart:

Continue reading →

Apache Spark Future

45 Replies

Everyone around the internet is constantly talking about the bright future of Apache Spark. How cool it is, how innovative it is, how fast it is moving, how big its community is, how big the investments into it are, etc. But what is really hiding behind this enthusiasm of Spark adepts, and what is the real future of Apache Spark?

Predicting Apache Spark Future

In this article I show you the real data and real trends, trying to be as agnostic and unbiased as possible. This article is not affiliated with any vendor.

Continue reading →

Data Industry Trends

13 Replies

Yesterday my blog has got the 100th subscriber. To commemorate this, I prepared the post on the major industry trends happening in the field of “data”. I might miss something, so feel free to comment and extend the article with your opinion!

Big data is falling down the hype curve

Even though Gartner has removed “Big Data” from the last year’s hype diagram, it does not mean it suddenly moved from the peak of the “hype” to the plateau of adoption. Here is how the hype cycle look like: hype curve

Continue reading →

Apache HAWQ: Next Step in MPP

2 Replies

The first blog post of mine is accepted to official Pivotal blog! Feel free to comment and share your opinion on the subject:

https://blog.pivotal.io/big-data-pivotal/products/apache-hawq-next-step-in-massively-parallel-processing

Modern Data Architecture Talk

2 Replies

Here is the video of my talk on Modern Data Architecture from Java Day Kiev 2015

The slides are available here: Modern Data Architecture – JD Kiev v05

Open Source Data Community Visualization

3 Replies

Open source data community has been rapidly growing over the last 10 years. You can feel this by the emerge of projects like Apache Hadoop, Apache Spark and the likes. It is growing this fast that there is almost no chance of keeping up with its growth without constantly monitoring the related events, announcements and other changes. 10 years ago it was enough to know “just Oracle” or “just MySQL” to make a successful career in data. Now the things has greatly changed, and if you cannot answer questions like “what is the difference between MapReduce and Spark?” and “when would you prefer to use Flink over Storm?” at your job interview you are screwed.

Github Data Community Graph Snapshot

Also, what would be the “next big thing” in data?

Continue reading →

Spark Architecture Video

6 Replies

This is the talk I made on Java Day Kiev 2015. It was a great conference after all

Spark Memory Management

64 Replies

Starting Apache Spark version 1.6.0, memory management model has changed. The old memory management model is implemented by StaticMemoryManager class, and now it is called “legacy”. “Legacy” mode is disabled by default, which means that running the same code on Spark 1.5.x and 1.6.0 would result in different behavior, be careful with that. For compatibility, you can enable the “legacy” model with spark.memory.useLegacyMode parameter, which is turned off by default.

Previously I have described the “legacy” model of memory management in this article about Spark Architecture almost one year ago. Also I have written an article on Spark Shuffle implementations that briefly touches memory management topic as well.

This article describes new memory management model used in Apache Spark starting version 1.6.0, which is implemented as UnifiedMemoryManager.

Continue reading →

Distributed Systems Architecture

brought to you by Alexey Grishchenko

Lakehouse

Snowflake: The Good, The Bad and The Ugly

Hadoop: The end of an Era

Hadoop

Apache Spark Future

Data Industry Trends

Big data is falling down the hype curve

Apache HAWQ: Next Step in MPP

Modern Data Architecture Talk

Open Source Data Community Visualization

Spark Architecture Video

Spark Memory Management