Over the time, the more people get internet connectivity, the more complicated internet services become. Twitter is one of the most complicated distributed systems deployed as for now, and it is really interesting to understand how it works under the hood.
If you pretend to be a distributed systems architect, the common question on your interview would looks like this: “Imagine that you need to build a Twitter from scratch. Define the technologies you use for the backend and perform initial system sizing”. In this article I will give you my understanding of this problem and provide an example of the answer that I’d consider to be a good one, even though it might be far from the real state of things. Be aware that I have no relation to the Twitter company itself and everything stated below is just my thoughts on the topic stated above.
First of all, let’s start with the requirements for our system and some simple SLA we will try to meet. According to the public information, now the Twitter handles this workload:
- 400m tweets a day on average
- 5k tweets/sec (tweets per second or “tps”) is daily average, 7k tps is daily peak, 12k tps is the peak for some international event and 150k tps is the maximum observed
- 300m read requests per second on average, 50% of them finishes in 3.5 seconds, 99% is up to 5 minutes
- Some of the users have millions of followers
- 150m of active users
The main functions we will consider to be covered in our design are:
- Read your tweet stream
- Read your “friend” stream, which consists of the tweets of the people you follow
- Hash tag search
- Text search
It is obvious that this workload can be handled only by the distributed system, so let’s go through them. Which distributed system can handle serving 300m requests a second with the strong SLA? I think it is again obvious, that we should start by considering in-memory system and perform the sizing based on this approach and if it turns out that the amount of memory we’ve got to have in the cluster is too large, consider different approach with tiering and caching.
For the goal of storing the tweet itself and its meta information it would be most appropriate to use some kind of key-value store, where the key would be tweet ID and the value can be tweet data and metadata. Lets perform the sizing of this key-value store. Each tweet on average would consist of tweet ID (8 bytes), tweet text (200 bytes), tweet timestamp (8 bytes), tweet creator ID (8 bytes), location information and other metainformation (100 bytes). In total one tweet would take 324 bytes, let’s consider it to be 200 bytes on average (not all the tweets are that long and with lots of links and images). Storing all the tweets for 1 day will worth 80 GB of data. How many tweets should we store online? It mostly depends on the amount of tweets you can read (100? 500? 1000?), so taking it as 1000 won’t be that far away from the truth. Next, we don’t need to store the tweets for inactive users in memory, this way we can take only 150m active users with 1000 tweets for each of them and 200 bytes per tweet will give you 30TB worth of data, considering 3x replication it would take 90TB of RAM. If we take 1U blade servers with 256GB of RAM each, this data would occupy 360 servers of this kind or approximately 9 racks of floor space in the datacenter. I think this estimation is approximately 2 times bigger that the real state of things in Twitter, but let’s start with this. Choosing particular key-value store for this task won’t affect the design much, as in general any in-memory key-value store will fit. If the design of k-v store does not allow this big clusters (360 machines) it won’t be too complicated to implement client-level sharding between for instance 6 clusters of 60 machines based on the twitter ID hash code.
Going forward, we need to implement “friend stream” in our system. To implement it will need a solution to store user relation graph. Let’s first estimate its size: considering the user ID being 8 bytes per record, each edge of the graph should be pair of two IDs or 16 bytes. We won’t be that far from truth if we estimate amount of friends for each twitter user as 100, this would give us 1.5b of relations for each of the users. But we won’t store this thing as graph (guess why?), let’s store the list of friends for each user as the list of IDs in key-value store: key is the user ID, values are the list of friend IDs. This gives us approximately 1.5b relations * 8 bytes * 3x replication * 10% overhead = 39 GB, which means it can be easily sort on a small cluster of 3 machines, each storing the full copy of the graph. Again, here we can use any k-v store.
Returning to the friend stream, now for each user we can get its list of friends. But how do you think, what is the most popular thing on twitter? Of course reading your friends stream, as you can see from the input, reading is 60 times more popular than tweeting. This means that we must be lightning-fast on retrieving data for the friend stream. Also we should be fast on posting a tweet, because I think celebrities would be disappointed to wait for a number of minutes while their tweet is being published (and delivered to the millions of their followers). How to approach this issue? The solution would be to make the twitter posting process work this way: synchronously write the post and its metadata to the post key-value store, return “success” to the customer and then make an update of the friend stream in asynchronous way. We need to store friend stream, for each of the users let’s consider it being 1000 records long (are you really going to read more than 1000 tweets?), which means 150m users * 1000 tweets IDs * 8 bytes each = 1.2 TB of data, which with 3x replication would give 3.6 TB of data, which can be stored on ~15 machines, which is another half rack of space.
But what about the fact that the user might want to see his own stream of posts? This can be easily done with another object in key-value store, let’s call it “user stream”. For each user it stores the latest 1000 tweet IDs of him, which gives us 1000 * 150m * 8 bytes = 1.2TB of data, which as in the case before can be stored on ~15 machines.
How should the “post” process work in this scenario? How will we make it handle peak loads?
- Write the post to the queue of incoming posts
- Return success to the customer
- Get the post from incoming queue
- Write the post to the posts key-value store
- Write the post ID to the “user stream” key-value store
- Get the friend list for the user made post from the user graph key-value store
- For each of the friends write pairs (user ID, tweet ID) to the friend stream update queue
Friend stream updater:
- Get the pair (user ID, tweet ID) from the friend stream update queue
- Insert into friend stream key-value store information about new tweet
Quite complicated, huh? Let’s think about the queues. If the input stream queue should handle up to 1 hour of data, it should withstand 3600 seconds * 5000 tps * 200 bytes each tweet = 3.6GB. Not much of the data? Let’s make it even bigger, handle 1 day of the data will cost you 84.6GB of data, which can be duplicated over 4 machines. In this case if the posting rate will increase to 150k tweets per second, you will be able to still handle 29 minutes of data easily.
Talking about the second queue, friend stream update one, it handles much less data: each record is only 16 bytes long, which gives you the requirement to store 86400 seconds * 5000 tps * 100 friends * 16 bytes = 691.2GB ~ 0.7 TB. To store this queue I propose distributed queue with 3x replication, it would take you 9 machines worth of hardware.
What about the most popular read process? It is simple. For reading user stream:
- Get the list of tweet for a given user ID from “user stream” key-value store
- Cache this list
- Retrieve 20 latests tweets from tweet key-value store
- Read ~40 more tweets in the cache in background to speed up the navigation
For reading the friend stream it is exactly the same, except by the fact that in the first step we read the values not from “user stream” store, but from “friend stream” store.
Now let’s think about the data layout. What we have in terms of data:
- Tweet key-value store – 90TB
- Friend stream – 3.6TB
- User stream – 3.6TB
- Stream input queue – 0.25TB
- Friend stream update queue – 2.1TB
Combining tweet key-value store with anything does not look like a good idea. Same for the queues: input stream queue should be really fast, while friend stream update queue might be slightly slower. But for the friend stream and user stream key-values stores I think they should be combined on the same cluster, which would give better performance for user stream lookups as they are more usual cases, while still allowing to store the user stream in the same cluster.
The overall processes might be pictured in the following way:
This is enough for a single article. Next time I will write my thoughts about the search implementation, touch the disaster recovery problem and write down my thougths on the performance that this architecture will provide and how to improve it