Hadoop Cluster Sizing

When you are completely ready to start your “big data” initiative with Hadoop, one of your first questions would be related to the cluster sizing. What is the right hardware to choose in terms of price/performance? How much hardware you need to handle your data and your workload? I will do my best to answer these questions in my article.

Measuring the Elephant Hadoop

Let’s start with the simplest thing, storage. There are many articles over the internet that would suggest you to size your cluster purely based on its storage requirements, which is wrong, but it is a good starting point to begin your sizing with. So, the cluster you want to use should be planned for X TB of usable capacity, where X is the amount you’ve calculated based on your business needs. Don’t forget to take into account data growth rate and data retention period you need. This is not a complex exercise so I hope you have at least a basic understanding of how much data you want to host on your Hadoop cluster. To host X TB of data with the default replication factor of 3 you would need 3*X TB of raw storage.

LostOutData

Why does default replication factor of 3 used and can we reduce it?

Well, you can do it but it is strongly not recommended, and here’s why:

First, Hadoop cluster design best practice assumes the use of JBOD drives, so you don’t have RAID data protection. In case of replication factor 2 is used on a small cluster, you are almost guaranteed to lose your data when 2 HDDs failed in different machines. Imagine a cluster for 1PB of data, it would have 576 x 6TB HDDs to store the data and would span 3 racks. Each 6TB HDD would store approximately 30’000 blocks of 128MB, this way the probability that 2 HDDs failed in different racks will not cause data loss is close to 1e-27 percent, which is the probability of data loss of 99.999999999999999999999999999%.

Next, the more replicas of data you store, the better would be your data processing performance. First you should consider speculative execution that would allow the “speculative” task to work on a different machine and still use local data. Second is read concurrency – for the data that is concurrently read by many processes they might read this data from different machines and take advantage of parallelism with local reads

[collapse]

So replication factor 3 is a recommended one. In future if you are big enough to face storage efficiency problems just like Facebook, you might consider using Erasure Coding for cold historical data (https://issues.apache.org/jira/browse/HDFS-7285).
Facebook Erasure Coding Experiments

But putting just 3*X TB of raw capacity is not enough. As you know, Hadoop stores temporary data on local disks when it processes the data, and the amount of this temporary data might be very high. Imagine you store a single table of X TB on your cluster and you want to sort the whole table. How much space do you think you would need? X TB for mapper outputs and X TB for reduce-side merge, and this amount does not consider storing the output of this sorting. This is why the rule of thumb is to leave at least X TB of raw storage for temporary data storage in this case. You might think that you won’t need it, for instance because of using HBase, but the same HBase requires additional storage when it performs region merges, so you won’t get away from temporary storage requirement.

Having said this, my estimation of the raw storage required for storing X TB of data would be 4*X TB.

General advice for systems with <2 racks – don’t put data compression into your sizing estimation. Talking about systems with >2 racks or 40+ servers, you might consider compression, but the only way to be close to the reality here is to run a PoC and load your data into a small cluster (maybe even VM), apply the appropriate data model and compression and see the compression ratio. Divide it by 1.5x – 2x and put into the sizing.

If you don’t agree with this, you can read more here

Talking with vendors you would hear different numbers from 2x to 10x and more. I’ve seen the presentation of Oracle where they claimed to apply columnar compression to the customer data and deliver 11x compression ratio to fit all the data into a small Exadata cluster. Of course, main purpose of the sales guys is to enter into account, after this they might claim their presentation was not correct or you didn’t read the comment in small gray font on the footer of the slide.

In fact, compression completely depends on the data. Some data is compressed well while other data won’t be compressed at all. For example, you store CDR data in your cluster. Based on my experience it can be compressed at somewhat 7x. Now imagine you store huge sequencefiles with JPEG images in binary values and unique integer ids as keys. Which compression will you get with this data? Only 1x, i.e. the data won’t be compressed at all or will be compressed very slightly.

Here’s a good article from Facebook where they claim to have 8x compression with ORCfile format (https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/). But this did not come easily – they’ve made a complex research project on this subject and even improved the ORCfile internals for it to deliver them better compression. So if you don’t have as much resources as Facebook, you shouldn’t consider 4x+ compression as a given fact.

[collapse]

One more thing before we go forward. You store the data in a very compressed way, for instance in Parquet files. But what happens with intermediate data produced in mapreduce? In fact, it would be in a sequencefile format with an option to compress it. But sequencefile compression is not on par with columnar compression, so when you would process huge table (for instance with sorting it), you would need much more temporary space than you might initially assume. Consider table size being Y GB and its compression with Parquet is 5x, while compression with sequencefile is 2x, which is usually a fair assumption. Then you would need at least 5*Y GB temporary space to sort this table. So be careful with putting compression in the sizing as it might hurt you later from the place you didn’t expect.

Hadoop MapReduce Comprehensive Diagram

Hadoop MapReduce Comprehensive Diagram

After all these exercises you have a fair sizing of your cluster based on the storage. But what can you do with CPU, memory and cluster bandwidth in general? I would start with the last one, IO bandwidth. This one is simple to calculate. Typical 2.5” SAS 10k rpm HDD would give you somewhat 85 MB/sec sequential scan rate. Typical 3.5” SATA 7.2k rpm HDD would give you ~60 MB/sec sequential scan rate. Now you should go back to the SLAs you have for your system. For instance, you store their CDR data, and you know that both filter query and the aggregation on low cardinality column query should return the result in Z seconds. Given this query would utilize the whole system alone, you can have a high-level estimation of its runtime given the fact that it would scan X TB of data in Z seconds, which implies your system should have a total scan rate at X/Z TB/sec. In case of SATA drives, which is a typical choice for Hadoop, you should have at least (X*1’000’000)/(Z*60) HDDs. It is a fairly simplified picture, here you can find an Excel sheet I’ve created for sizing the Hadoop cluster with more precise calculations.

Now you can imagine your target system both in terms of size and performance, but you still need to know which CPU and RAM to use. For CPU the idea is very simple: at the very minimum you should have 1 CPU core for each 1 HDD, as it would handle the thread processing the data from this HDD. Given compression used in the system and intermediate data compression I would recommend to have at least 2x cores as the amount of HDDs with the data. With the typical 12-HDD server where 10 HDDs are used for data, you would need 20 CPU cores to handle it, or 2 x 6-core CPUs (given hyperthreading). 8+-core CPUs would be more appropriate option in case you plan to use Spark as it will handle more processing in memory and less hit the disks, so the CPU might become the limiting resource.

Regarding the amount of RAM, the more RAM you have the better. Having just more RAM on your servers would give you more OS-level cache. If you start tuning performance, it would allow you to have more HDFS cache available for your queries. Next, with Spark it would allow this engine to store more RDD’s partitions in memory. But the drawback of much RAM is much heating and much power consumption, so consult with the HW vendor about the power and heating requirements of your servers. At the moment of writing the best option seems to be 384GB of RAM per server, i.e. all 24 slots are filled with 16GB memory sticks. 32GB memory sticks are more than 2x more expensive than 16GB ones so this is usually not reasonable to use them.

kernel-memory-table-and-buffer-cache-allocations

So here we finish with slave node sizing calculation. As of the master nodes, depending on the cluster size you might have from 3 to 5-6 master nodes. All of them have similar requirements – much CPU resources and RAM, but the storage requirements are lower. You can put 6 x 900GB 2.5” HDDs in RAID10 which would work perfectly fine, give you enough storage and redundancy. The amount of master nodes depend on the cluster size – for small cluster you might like to put both Namenode, Zookeeper, Journal Node and YARN Resource Manager on a single host, while for the bigger cluster you would like NN to leave on the host alone.

In terms of network, it is not feasible anymore to mess up with 1GbE. Going with 10GbE will not drastically increase the price but would leave you a big room to grow for your cluster. In short, network design is not that complex and many companies like Cisco and Arista has reference designs that you might use. Of course, the best option would be the network with no oversubscription as Hadoop heavily uses the network.

So to finish the article, here’s an example of sizing 1PB cluster slave machines in my Excel sheet:

1PB Hadoop Cluster Sizing

14 thoughts on “Hadoop Cluster Sizing

  1. Ladislav Jech

    First of all thanks a lot for this great article, I am preparing to build experimental 100TB Hadoop cluster in these days, so very handy.

    Question 1:
    There is some issue with Cache Size GB constant value, I set 48 TB target Data Size TB, and configure values to 1 rack usage I get negative value.
    There is formula =C6-((C7-2)*4+12), but my nodes might be sized in different way.
    Why are you multiplicating CPU number -2 by constant 4?
    And why are you adding constant of 12, it means number of discs?
    Not clear to me.
    Thank you for explanation, I am building my own hadoop cluster at my lab, so experiment, but I would like to size it properly from beginning.

    Hint:
    I have found this formula to calculate required storage and required node number:
    https://www.linkedin.com/pulse/how-calculate-hadoop-cluster-size-saket-jain

    Do you have some comments to this formula?

    Regards,

    Ladislav

    Reply
    1. 0x0FFF Post author

      Hi. It is pretty simple. “C7-2” means that you reserve 2 cores per node to be used for OS, YARN NM and HDFS DN. “(C7-2)*4” means that using the cluster for MapReduce, you give 4GB of RAM to each container, and “(C7-2)*4” is the amount of RAM that YARN would operate with. 12 means that you leave 12GB of RAM for OS (2GB), YARN NM (2GB), and HDFS DN (8GB). In total, substracting memory dedicated to YARN, OS and HDFS from the total RAM size, you get the amount of free RAM that would be used as OS cache. Having this number negative means your cluster might suffer from memory pressure, and I personally would not recommend to run such config for Hadoop. My estimation is that you should have at least 4GB of RAM per CPU core

      Regarding the article you referred – the formula is ok, but I don’t like “intermediate factor” without the description of what it is. You can put this formula to C26 cell of my excel if you like it, but I simply put S/c*4 = S/c*(3+1) = S/c*(r+1), because 99% of the clusters run with replication factor of 3

      Reply
  2. Ladislav Jech

    Hi, it is clear now. Thank you very much.

    Do you have any experience with GPU acceleration for Spark processing over Hadoop and how to integrate it into Hadoop cluster, best practice?

    To include GPU directly into Hadoop cluster nodes, I am thinking to go with 4U racks with 24 bays for drives, half drives for each node. I will be able to get inside only 4 GPU’s probably and let it powered by 2x E5-2630L v4 10-core CPUs. The ram I will go with 512GB, maybe later 1TB.

    On the other hand I think that I will just leave one/two PCI slots free and forget about GPUs at all for now and later if the time will come I will go with 40GBe connection to GPU dedicated cluster via MPI.

    What do you think about these GPU openings from your perspective? I of course read many articles on this over internet and see back in 2013 there were multiple scientific projects removed from Hadoop, now we have Aparapi, HeteroSpark, SparkCL, SparkGPU, etc.

    https://github.com/aparapi/aparapi
    https://github.com/kiszk/spark-gpu

    Thanks.

    Ladislav

    Reply
    1. 0x0FFF Post author

      Unfortunately, I cannot give you an advice without knowing your use case – what kind of processing will you do on the cluster, what kind of data you operate and how much of it, etc. If your use case is deep learning, I’d recommend you to find a subject matter expert in this field to advice you on infrastructure. On top of that, you should know that AWS provides instances with GPUs (for example, g2.8xlarge with 4 GPU cards), so you can rent them to validate your cluster design by running a proof of concept on it

      Reply
      1. Ladislav Jech

        Yes, AWS is good place where to run POCs.

        Regarding the DL domain I am in touch with Chris Nicholson from Deeplearning4J project to discuss these specific areas.

        Thanks again.

        Reply
  3. Ladislav Jech

    1. I plan to use HBase for real-time log processing from network devices(1000 to 10k events per second), from the Hadoop locality principle I will install it in HDFS space directly on Data Node servers, that is my assumption to go, correct?

    2. Is your calculator aware of other components from Hadoop ecosystem from CPU and memory resource allocation perspective, or you simply focus on HDFS purely as storage?

    3. What in case of Spark engine sizing? Should I add even more CPUs? 🙂

    Thanks again for any hints.

    Some info from my context.
    To be hones i am looking into this already a week and not sure what hardware to pickup, was looking for old Dell PowerEdge C100 or C200 3-4Node machines and other 2U solutions, but not sure about it 🙂

    As soon as I don’t know in the moment also all the requirements facts to exactly size the cluster, I finally have in place now following custom build:
    – 1x 4U chasis with 24x 4-6TB drives + having space for internal 2-4 drives 2,5 (SSD) drives available for OS (Gentoo)
    – Super Micro X10DRi-T4+ motherboard (4x 10GBase-T NICs, so possible Linux TCP multipath in future)
    – 2x Intel Xeon E5-2698 20C (so 80 cores in total with hyperthreading)
    – 768 GB RAM – it is deadly expensive!!!, so initially probably less, understand the impact already 😀

    I plan to run 2 data node setup on this machine each with 12 drives for HDFS allocation.

    I was thinking about VmWare vSphere(easy to manage, but overhead with OS images for each node (master, slave, etc.) ), but probably rather going with Docker over pure Linux system (Centos or my favourite Gentoo) to let me assign dynamically resources on the fly to tune performance.

    Your blog gave me really great insight into this problematic.

    Reply
    1. 0x0FFF Post author

      Regarding HBase:
      1. Combining HBase with analytical workload (Hive, Spark, MapReduce, etc.) is definitely not the best idea, never do this on production cluster
      2. HBase for log processing? HBase is a key-value store, it is not a processing engine. Of course, you can save your evets to the HBase, and then extract them, but what is the goal? Typical case for log processing is using Flume to consume them, then MapReduce to parse and Hive to analyze, for example. Do you really need real-time record access to specific log entries?
      3. HBase stores data in HDFS, so you cannot install it into specific directories, it would just utilize HDFS, and HDFS in turn would utilize the directories configured for it

      I mainly focus on HDFS as it is the only component responsible for storing the data in Hadoop ecosystem. There might be two types of sizing – by capacity and by throughput. Here I described the sizing by capacity – the simple one, when you just plan to store and process specific amount of data. Sizing for throughput is much more complex, should be done on top of capacity sizing (you would need at least as many machines as capacity sizing estimated to store your data), and on top of your experience. For simplicity, I’ve put “Sizing Multiplier” that allows you to increate cluster size above the one required by capacity sizing.

      For Spark, it really depends on what you want to achieve with this cluster. Of course, Spark would benefit from more CPUs and more RAM if your tasks are CPU-intensive, for example like machine learning

      Please, do whatever you want, but don’t virtualize Hadoop – it is a very, very bad idea. And Docker is not of a big help here. Hadoop is designed to run on top of bare hardware and JBOD drives, so don’t complicate. For the money you put into the platform you described, you’d better buy a bunch of older 1U servers with older CPUs, 6 SATA HDDs (4-6TB each) and 64-96GB RAM. For the same price you would get more processing power and more redundancy.

      And lastly, don’t go with Gentoo. Use one of widely supported distributions – CentOS, Ubuntu or OpenSUSE

      Reply
      1. Ladislav Jech

        1. Hm, will keep this in mind, make sense to me.
        2. I simplified it too much. You are right, but there are 2 aspects of processing:
        – first round of analysis to happen directly after Flume provides data
        – second round once persisted in SQL query-able database (could be even Cassandra), to process log correlations and search for behavioral convergences – this can happen as well in the first round in limited way, not sure about the approach here, but that is the experiment about. Result is to generate firewall rules and apply them on the routers/firewalls or block user identity account, etc. in < 10s after the event is received.

        Of course second round is not meant for < 10 rule in the moment.

        Regarding Sizing – I spent already few days with playing with different configurations and searching for best approach, so against the "big"server I put in fight some 1U servers and ended-up with following table (keep in mind I search for best prices and using ES versions of Xeons for example, etc.)

        Big server
        Desciption Price in GBP
        CPU 2x Intel Xeon ES version E5-2697 v4 20C – 80 threads 1000
        MotherBoard Super Micro X10DRi-T4+ 600
        RAM 512GB 1600
        Chasis 24bay (Netherlands custom build) 400
        TOTAL 3600

        Small bare-metal 1U nodes – each 4 bay
        2 hexa-core, 96GB RAM 300
        TOTAL(6x nodes) 1800

        Drives WD RED 6TB can get for price around 150 GBP making total of 3600, or will go with 4TB for 100 each, so 2400 total cost.

        So the decision has been almost done 🙂

        What remains on my list are possible bottlenecks, issues is:
        – 1Gbit network – there are 2 ports, so I will merge them by MultiPath to help the network throughput little bit by getting 1,8 Gbit, for these boxes I don't consider 10g as it looks like overkill.
        – Custom raid card might be required to support 6TB drives, but will try first upgrade BIOS

        Regarding virtualization
        In case you have big servers, I think that could be the way. In these days virtualization is making very low performance overhead, and give you the dynamic resource allocation management. Also connecting storage with 40Gbits is not big deal. But in general I 100% agree with what you are saying and when going with 1U servers I will stay on bare metal.

        Regarding my favorite Gentoo
        I know, it could be troublesome especially keep up-to-date packages, so I will go with Ubuntu finally.

        Overall, thank you very much for this more than valuable discussion. I think I will come on other of your great blogs.

        Reply
        1. 0x0FFF Post author

          1. ok
          2. For me it looks like the task for a tool like Apache Spark or Apache Flink with a sliding window – analyzing last X seconds to find specific patterns and react in real time. Is it something like DDoS protection?
          Regarding sizing – looks more or less fine. Just make sure you’ve chosen the right tool to do the thing you need – do you really need to store all these data? If you will operate on 10s window, you have absolutely no need in storing months of traffic, and you can get away with a bunch of 1U servers with much RAM and CPU, but small and cheap HDDs in RAID – typical configuration for the hosts doing streaming and in-memory analytics. And even if you need to store it for infrequent access cases, you can just dump it to S3 – Spark integrates with S3 pretty well in case you will like to analyze this data later
          Be careful with networking – with 24 drives per node you would have around 2000MB/sec combined IO for a single node, while 2 x 1GbE would provide you at most 200MB/sec per node, so you can easily hit network IO bottleneck in case of non-local data processing
          Virtualization – I’ve heard many stories about virtualization on Hadoop (and even participated in it), but none of them were success. For toy cases and development clusters it is ok, but not for production ones.

          Reply
          1. Ladislav Jech

            2. You are right with your assumption, but that is not complete picture.

            It is related to not only DDoS protection, but also to other attack types and to intrusion detection and prevention in general, so others:
            – user is logging at same time from 2 or more geographically separated locations
            – Finding a consequences (attack signatures)
            – Joining attack events across devices
            – Try to suggest next attack area/targets based on described patterns – would like to utilize here Deeplearning4J with possibly genetic fuzzy tree systems (these are relatively small on storage requirement better to live in memory with fast processing power either CPU/GPU(Cuda or OpenCL)/AMD APU).
            – The attack itself can be seen as sequence of single or multiple steps originally, but strategies are changing. Army of shadow DDoS attacks are on the way to help hiding real network intrusion point.
            This story to be analyzed in detailed way.

            Historical data could be later potentially used for deep learning purposes of new algorithms in the future, but in general I agree with you, some filtering is going to happen and not storing everything. When the attacks occur during history there is a chance to find similar signatures from events.

            S3 Integration! – This is something for me to explore on next stage, thanks!

            Regarding networking issue as possible bottleneck
            In case of 24bay 4U system selection I would go with 40GBit QSFP straightforward or put 2x 10Gbit NICs into multipath configuration as previously mentioned. But I already abandoned such setup as too expensive.

            Version with 1U servers each having 4 drives can perform at ~333-450MB/sec, but network even in multipath just max 200MB/sec.
            I can extend them for 70 GBP each with 10GBit single port card and it is fixed wile wasting about ~50% of new network capacity potential, so still place for balance.

            I am now looking into 2U server solutions which can server same purpose with either 8 or 12 bay chasis. Will update here, to discuss.

            General question
            How good is Hadoop in balancing the load accross heterogenous server environment – imagine I have mixture of different data nodes. 24TB servers with 2-quad cpus and 96GB and 36TB with 144GB with octa-cpu. Is hadoop ecosystem capable of automatic inteligent load distribution, or it is in hands of administrator and it is better to use same configuration for nodes?

            I am almost there 🙂

            Regards,

            Ladislav

          2. 0x0FFF Post author

            It is much better to have the same configuration for all the nodes. However, you are completely free to configure different nodes in a different way if your computational framework supports it. For example, with HDFS you can define nodes with archival storage, in YARN you can define node labels and in general configure each node’s capacity separately. But be aware that this is a new functionality, and not all the external software supports it. For example, even Cloudera is still shipping Apache Hadoop 2.6.0 (https://archive.cloudera.com/cdh5/cdh/5/hadoop/index.html?_ga=1.98045663.1544221019.1461139296), which does not have this functionality

        2. Ladislav Jech

          Hi Alexey,
          I made a decision and also I think quite good deal.
          5x Data Nodes will be runing on:
          Intel Xeon Hex Core E5645 2.4GHz
          144GB RAM
          6x 6TB drives
          10GBit network SFTP+

          So 760GB ram for 180TB raw capacity, I understand now fully the RAM requirement and how it can affect performance, so it all will depends on data processing configuration.

          It was great discussion and thanks again for all your suggestions to me.

          Reply

Leave a Reply