When you are completely ready to start your “big data” initiative with Hadoop, one of your first questions would be related to the cluster sizing. What is the right hardware to choose in terms of price/performance? How much hardware you need to handle your data and your workload? I will do my best to answer these questions in my article.
Let’s start with the simplest thing, storage. There are many articles over the internet that would suggest you to size your cluster purely based on its storage requirements, which is wrong, but it is a good starting point to begin your sizing with. So, the cluster you want to use should be planned for X TB of usable capacity, where X is the amount you’ve calculated based on your business needs. Don’t forget to take into account data growth rate and data retention period you need. This is not a complex exercise so I hope you have at least a basic understanding of how much data you want to host on your Hadoop cluster. To host X TB of data with the default replication factor of 3 you would need 3*X TB of raw storage.
Well, you can do it but it is strongly not recommended, and here’s why: First, Hadoop cluster design best practice assumes the use of JBOD drives, so you don’t have RAID data protection. In case of replication factor 2 is used on a small cluster, you are almost guaranteed to lose your data when 2 HDDs failed in different machines. Imagine a cluster for 1PB of data, it would have 576 x 6TB HDDs to store the data and would span 3 racks. Each 6TB HDD would store approximately 30’000 blocks of 128MB, this way the probability that 2 HDDs failed in different racks will not cause data loss is close to 1e-27 percent, which is the probability of data loss of 99.999999999999999999999999999%. Next, the more replicas of data you store, the better would be your data processing performance. First you should consider speculative execution that would allow the “speculative” task to work on a different machine and still use local data. Second is read concurrency – for the data that is concurrently read by many processes they might read this data from different machines and take advantage of parallelism with local reads
Well, you can do it but it is strongly not recommended, and here’s why:
First, Hadoop cluster design best practice assumes the use of JBOD drives, so you don’t have RAID data protection. In case of replication factor 2 is used on a small cluster, you are almost guaranteed to lose your data when 2 HDDs failed in different machines. Imagine a cluster for 1PB of data, it would have 576 x 6TB HDDs to store the data and would span 3 racks. Each 6TB HDD would store approximately 30’000 blocks of 128MB, this way the probability that 2 HDDs failed in different racks will not cause data loss is close to 1e-27 percent, which is the probability of data loss of 99.999999999999999999999999999%.
Next, the more replicas of data you store, the better would be your data processing performance. First you should consider speculative execution that would allow the “speculative” task to work on a different machine and still use local data. Second is read concurrency – for the data that is concurrently read by many processes they might read this data from different machines and take advantage of parallelism with local reads
So replication factor 3 is a recommended one. In future if you are big enough to face storage efficiency problems just like Facebook, you might consider using Erasure Coding for cold historical data (https://issues.apache.org/jira/browse/HDFS-7285).
But putting just 3*X TB of raw capacity is not enough. As you know, Hadoop stores temporary data on local disks when it processes the data, and the amount of this temporary data might be very high. Imagine you store a single table of X TB on your cluster and you want to sort the whole table. How much space do you think you would need? X TB for mapper outputs and X TB for reduce-side merge, and this amount does not consider storing the output of this sorting. This is why the rule of thumb is to leave at least X TB of raw storage for temporary data storage in this case. You might think that you won’t need it, for instance because of using HBase, but the same HBase requires additional storage when it performs region merges, so you won’t get away from temporary storage requirement.
Having said this, my estimation of the raw storage required for storing X TB of data would be 4*X TB.
General advice for systems with <2 racks – don’t put data compression into your sizing estimation. Talking about systems with >2 racks or 40+ servers, you might consider compression, but the only way to be close to the reality here is to run a PoC and load your data into a small cluster (maybe even VM), apply the appropriate data model and compression and see the compression ratio. Divide it by 1.5x – 2x and put into the sizing.
Talking with vendors you would hear different numbers from 2x to 10x and more. I’ve seen the presentation of Oracle where they claimed to apply columnar compression to the customer data and deliver 11x compression ratio to fit all the data into a small Exadata cluster. Of course, main purpose of the sales guys is to enter into account, after this they might claim their presentation was not correct or you didn’t read the comment in small gray font on the footer of the slide.
In fact, compression completely depends on the data. Some data is compressed well while other data won’t be compressed at all. For example, you store CDR data in your cluster. Based on my experience it can be compressed at somewhat 7x. Now imagine you store huge sequencefiles with JPEG images in binary values and unique integer ids as keys. Which compression will you get with this data? Only 1x, i.e. the data won’t be compressed at all or will be compressed very slightly.
Here’s a good article from Facebook where they claim to have 8x compression with ORCfile format (https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/). But this did not come easily – they’ve made a complex research project on this subject and even improved the ORCfile internals for it to deliver them better compression. So if you don’t have as much resources as Facebook, you shouldn’t consider 4x+ compression as a given fact.
One more thing before we go forward. You store the data in a very compressed way, for instance in Parquet files. But what happens with intermediate data produced in mapreduce? In fact, it would be in a sequencefile format with an option to compress it. But sequencefile compression is not on par with columnar compression, so when you would process huge table (for instance with sorting it), you would need much more temporary space than you might initially assume. Consider table size being Y GB and its compression with Parquet is 5x, while compression with sequencefile is 2x, which is usually a fair assumption. Then you would need at least 5*Y GB temporary space to sort this table. So be careful with putting compression in the sizing as it might hurt you later from the place you didn’t expect.
After all these exercises you have a fair sizing of your cluster based on the storage. But what can you do with CPU, memory and cluster bandwidth in general? I would start with the last one, IO bandwidth. This one is simple to calculate. Typical 2.5” SAS 10k rpm HDD would give you somewhat 85 MB/sec sequential scan rate. Typical 3.5” SATA 7.2k rpm HDD would give you ~60 MB/sec sequential scan rate. Now you should go back to the SLAs you have for your system. For instance, you store their CDR data, and you know that both filter query and the aggregation on low cardinality column query should return the result in Z seconds. Given this query would utilize the whole system alone, you can have a high-level estimation of its runtime given the fact that it would scan X TB of data in Z seconds, which implies your system should have a total scan rate at X/Z TB/sec. In case of SATA drives, which is a typical choice for Hadoop, you should have at least (X*1’000’000)/(Z*60) HDDs. It is a fairly simplified picture, here you can find an Excel sheet I’ve created for sizing the Hadoop cluster with more precise calculations.
Now you can imagine your target system both in terms of size and performance, but you still need to know which CPU and RAM to use. For CPU the idea is very simple: at the very minimum you should have 1 CPU core for each 1 HDD, as it would handle the thread processing the data from this HDD. Given compression used in the system and intermediate data compression I would recommend to have at least 2x cores as the amount of HDDs with the data. With the typical 12-HDD server where 10 HDDs are used for data, you would need 20 CPU cores to handle it, or 2 x 6-core CPUs (given hyperthreading). 8+-core CPUs would be more appropriate option in case you plan to use Spark as it will handle more processing in memory and less hit the disks, so the CPU might become the limiting resource.
Regarding the amount of RAM, the more RAM you have the better. Having just more RAM on your servers would give you more OS-level cache. If you start tuning performance, it would allow you to have more HDFS cache available for your queries. Next, with Spark it would allow this engine to store more RDD’s partitions in memory. But the drawback of much RAM is much heating and much power consumption, so consult with the HW vendor about the power and heating requirements of your servers. At the moment of writing the best option seems to be 384GB of RAM per server, i.e. all 24 slots are filled with 16GB memory sticks. 32GB memory sticks are more than 2x more expensive than 16GB ones so this is usually not reasonable to use them.
So here we finish with slave node sizing calculation. As of the master nodes, depending on the cluster size you might have from 3 to 5-6 master nodes. All of them have similar requirements – much CPU resources and RAM, but the storage requirements are lower. You can put 6 x 900GB 2.5” HDDs in RAID10 which would work perfectly fine, give you enough storage and redundancy. The amount of master nodes depend on the cluster size – for small cluster you might like to put both Namenode, Zookeeper, Journal Node and YARN Resource Manager on a single host, while for the bigger cluster you would like NN to leave on the host alone.
In terms of network, it is not feasible anymore to mess up with 1GbE. Going with 10GbE will not drastically increase the price but would leave you a big room to grow for your cluster. In short, network design is not that complex and many companies like Cisco and Arista has reference designs that you might use. Of course, the best option would be the network with no oversubscription as Hadoop heavily uses the network.
So to finish the article, here’s an example of sizing 1PB cluster slave machines in my Excel sheet: