Tag Archives: architecture

Spark Architecture: Shuffle

This is my second article about Apache Spark architecture and today I will be more specific and tell you about the shuffle, one of the most interesting topics in the overall Spark design. The previous part was mostly about general Spark architecture and its memory management. It can be accessed here. The next one is about Spark memory management and it is available here.

Spark Shuffle Design

What is the shuffle in general? Imagine that you have a list of phone call detail records in a table and you want to calculate amount of calls happened each day. This way you would set the “day” as your key, and for each record (i.e. for each call) you would emit “1” as a value. After this you would sum up values for each key, which would be an answer to your question – total amount of records for each day. But when you store the data across the cluster, how can you sum up the values for the same key stored on different machines? The only way to do so is to make all the values for the same key be on the same machine, after this you would be able to sum them up.

Continue reading

Spark Architecture

Edit from 2015/12/17: Memory model described in this article is deprecated starting Apache Spark 1.6+, the new memory model is based on UnifiedMemoryManager and described in this article

Over the recent time I’ve answered a series of questions related to ApacheSpark architecture on StackOverflow. All of them seem to be caused by the absence of a good general description of the Spark architecture in the internet. Even official guide does not have that many details and of cause it lacks good diagrams. Same for the “Learning Spark” book and the materials of official workshops.

In this article I would try to fix this and provide a single-stop shop guide for Spark architecture in general and some most popular questions on its concepts. This article is not for complete beginners – it will not provide you an insight on the Spark main programming abstractions (RDD and DAG), but requires their knowledge as a prerequisite.

This is the first article in a series. The second one regarding shuffle is available here. The third one about new memory management model is available here.

Continue reading

Hadoop MapReduce Comprehensive Description

Map Reduce is a really popular paradigm in distributed computing at the moment. The first paper describing this principle is the one by Google published in 2004. Nowadays Map Reduce is a term that everyone knows and everyone speaks about, because it was put as one of the foundations to the Hadoop project. For most of the people Map Reduce is an equivalent to “Hadoop” and “Big Data”, which is completely wrong. But there are some people that understand the simplest case with WordCount and maybe even building an inverted index using Map Reduce.

But being simple as a concept, it has a kind of complicated implementation in Hadoop. I’ve tried to find a comprehensive description of it with a good diagram over the internet, but failed. All the diagrams keep repeating “Map – Sort – Combine – Shuffle – Reduce”. Of course, it is good to know that the framework works this way, but what about dozens of parameters that are tunable for the framework? What happens if you reduce the buffer size of the Map output or increase it? These diagrams don’t offer any help for this. This was the reason for me to build my own diagram and my own description based on the latest source code available in the Hadoop repository.

Hadoop MapReduce Comprehensive Diagram

Hadoop MapReduce Comprehensive Diagram

Continue reading

Twitter Architecture Analysis (part 1)

Over the time, the more people get internet connectivity, the more complicated internet services become. Twitter is one of the most complicated distributed systems deployed as for now, and it is really interesting to understand how it works under the hood.

If you pretend to be a distributed systems architect, the common question on your interview would looks like this: “Imagine that you need to build a Twitter from scratch. Define the technologies you use for the backend and perform initial system sizing”. In this article I will give you my understanding of this problem and provide an example of the answer that I’d consider to be a good one, even though it might be far from the real state of things. Be aware that I have no relation to the Twitter company itself and everything stated below is just my thoughts on the topic stated above. Continue reading