Hadoop on Remote Storage

The question regarding running Hadoop on a remote storage rises again and again by many independent developers, enterprise users and vendors. And there are still many discussions in community, with completely opposite opinions. I’d like to state here my personal view on this complex problem.

Hadoop elephant balancing on the shared storage ball

In this article I would call remote storage “NAS” for simplicity. I would also take as a given that remote storage is not the same HDFS, but something completely different – from standard storage arrays with LUNs mounted to the servers to different distributed storage systems. For all these systems I assume that they are remote, because unlike HDFS they don’t allow you to run your custom code on the storage nodes. And they are mostly “storages”, so they are using some kind of erasure encoding to save the space and make this solution more competitive.

If you are reading my blog for a long time, you might mention that it is the second version of this article. During the last year I was constantly thinking on this problem, and my position has shifted a bit, mostly based on the real world practice and experience.

mapreduce_with_external_data

  1. Read IO Performance. For most of the Hadoop clusters the limiting factor in performance is IO. The more IO bandwidth you have, the faster your cluster would work. You won’t be surprised if I tell you that the IO bandwidth mostly depends on the amount of disks you have and their type. For example, a single SATA HDD can deliver you somewhat 50MB/sec in sequential scans, SAS HDD can give you 90MB/sec and SSD might achieve 300MB/sec. This is a simple math to calculate the total platform bandwidth given these numbers. Comparing DAS with NAS does not make much sense in this context, because both NAS and cluster with DAS might have the same amount of disks and thus would deliver comparable bandwidth. So again, considering infinite network bandwidth with zero latency, same RAID controllers and same number and type of drives used, DAS and NAS solutions would deliver the same read IO performance.
  1. Write IO Performance. Here the things are getting a bit more complicated, and you should understand how exactly your NAS solution work to be able to compare it with Hadoop on DAS. HDFS stores a number of exact copies of the data, 3 by default. So if you write X GB of data, in fact they would occupy 3*X GB of disk space. And of course, the process of writing 3 copies of the data is 3 times slower than the process of writing a single copy. How does the most NAS storages work? NAS is an old industry and they clearly understood that storing many exact copies of the data is very wasteful, so most of them use some kind of erasure coding (like Reed-Solomon one). This allows you to achieve similar redundancy with storing 3 exact copies of the data with only 40% overhead with RS(10,4). But everything comes at cost, and the cost here is performance. For writing a single block in HDFS you have to just write it 3 times. With RD(10,4) to write a single block you have to calculate erasure codes for it either by reading other 9 blocks and writing out 4 of them, or having some kind of a caching layer with replication and background compaction process. In short, writing to it would always be slower than writing to the cluster with replication, this is like comparing RAID10 with RAID5, same logic of replication vs erasure coding.
  1. Read IO performance (degraded). In case you have lost a single machine or single drive in Hadoop cluster with DAS, your read performance is not affected – you read the same data from a different node that is still alive. But what happens in NAS with RS(10,4)? Right, to restore a single block with RS(10,4) you have to read up to 13 blocks, which would make your system up to 13 times slower! Of course, in most cases you encode sequential blocks and then read sequential blocks, so you can restore the missing one easier. But still, your performance would degrade 2x in best scenario and up to 13x in worst:

degraded_read_performance_replication_vs_reedsolomon

And if you think that the degraded case is not very relevant for you, here is the statistics of Facebook Hadoop cluster:

facebook_node_failures

  1. Data Recovery. When you are losing the node and repliacing it, how long does it take to recover the redundancy of your system? For HDFS with DAS you are just copying the data for under-replicated blocks to the new node. For RS(10,4) you have to restore the missing blocks by reading all the other blocks in its group and performing computations on top of them. Usually it is 5x-10x slower:

repair_traffic_replication_vs_reedsolomon

repair_throughput_replication_vs_reedsolomon

  1. Network. When you run a Hadoop cluster with DAS, Hadoop framework itself tries to schedule executers as close to the data as possible, usually making a preference to local IO. In the cluster with NAS, your IO is always remote, with no exceptions. So the network becomes a big pain point – you should plan it very carefully with no oversubscription both between the compute nodes, and between compute and storage. Network rarely becomes a bottleneck if you have enough 10GbE interfaces, but the switches should be good, and you need much more of them than in solution with DAS. Here’s the slide from Cisco’s presentation regarding this subject:

cisco_das_vs_nas_for_hadoop

  1. Local Storage. Having remote HDFS might look like a good option, but what about the local storage on the “compute” nodes? Usually people forget that the same MapReduce stores intermediate data on the local storage, and the same Spark puts all the shuffle intermediate data to the local storage. Plus the same Hive and Pig are translated into MR or Tez or Spark, storing their intermediate results on local storage as well. Thus even “compute” nodes should have enough local storage, and the safest option is to have the same amount of raw local storage as the amount of data you plan to store in your cluster.
  1. Hardware Amount. To have a Hadoop cluster with 40 working CPUs you usually need 20 servers – each server with 2 CPUs. How many servers you need for a NAS solution? Same 20 servers for computations, plus at least 8 servers for the NAS itself given the same amount of HDDs in both solutions.
  1. Enterprise Features. Usually this is the main idea of NAS solutions, and here they are really better than HDFS in terms of snapshots, replication, backups and so on. But HDFS is designed to be a cheap storage for commodity HW. It should not run in geo-distributed mode, it shouldn’t have a good DR story. You usually invest resources of your developers to implement DR, not using the out-of-the-box enterprise feature. But HDFS is reaching the same targets as the enterprise storage little by little: erasure coding, HDFS snapshots, HDFS storage tiering, geo-distributed HDFS.
  1. Scalability. It is obvious that in Hadoop cluster with DAS you have to scale both compute and storage at the same time. If your cluster uses NAS, you can scale storage and compute independently. This might be useful for solutions like Spark or Storm, when you don’t really need much storage, or the opposite side: storage solutions, with the capability to analyze the data, like storing 10PB of data and having 10 servers to process it from time to time.

So using Hadoop with remote storage might be feasible, but only if:

  • You got a solution with a big skew in compute and storage resources needed. It’s like having the cluster with 10PB of storage and only 10 compute nodes to satisfy the business users’ requirements for “data analysis” solution. Or having a cluster with 50 nodes and 50TB of storage for Spark analytical jobs.
  • You have a development or test cluster. This way you might want to easily provision compute resources with virtualized Hadoop cluster, and use a remote storage to easily provision multitenant access for a number of dev clusters.
  • You need a cloud solution. When you run Hadoop in the cloud, you always have compute and storage resources separated. This is a good option for fast growing startups, when you don’t know how much resource you will need tomorrow. But it is definitely not the best solution in a long run perspective, when your resource consumption becomes pretty stable and big.

7 thoughts on “Hadoop on Remote Storage

  1. Mufeed Usman

    Interesting points. Definitely will depend on the nature of adoption and as you’ve already pointed out – use-cases. Definitely a domain worth following to see how things shape up in the future.

    Reply
    1. 0x0FFF Post author

      I know, but using the DAS solution from cloud provider you are breaking the main idea of the cloud compute – in fact, you are just renting the server. If we take Amazon d2.8xlarge, renting it for 1 year would result in bigger budget than buying the server. Clouds are price-efficient because of multitenancy, which is impossible for dedicated servers

      Reply
  2. Pingback: Hadoop on Remote Storage | Filling the gaps in Big Data

  3. Frans

    For me the main advantage of cloud is ease of provisioning and on demand usage. This is regardless which storage option you choose. You are always able to add DAS servers or shut them down (and don’t pay for it) when needed.
    You are right, running let’s say a 50 node cluster for a year around the clock would cost the same like buying it if you leave out your own hosting costs. But only using 50 in peak times and 30 off-peak does make a difference.
    Of course the feasibility of this flexibility is depending of the amount of data stored on these nodes.

    Reply
    1. 0x0FFF Post author

      I tend to disagree. When we’re speaking about the server with DAS, even cloud providers are pretty strict on “reserving” them – for Amazon reserving d2.8xlarge would cost you >50% of the usage price, it means having one d2.8xlarge in reserve for 2 years would cost you the same money as buying this server
      Next problem here would be that it is not that easy to scale existing cluster – you can add the node to HDFS and add YARN NM, but this node would be empty doing remote reads over the network – not very efficient
      Cloud solutions are working extremely well in case of remote storage that I pointed out here: for instance, you store your data on S3 and allocate some instances for Spark on demand – the more data you have, the more instances you allocate. You pay for them only when they are in use – this is pretty good value for the money. Also you can easily scale it – now you allocate 4 nodes, later 8, then 12 and so on – it is pretty easy.
      For your example with 50 peak and 30 off-peak – the price for running this thing for 2 years on Amazon is the same as buying 80 servers, and at this scale the HW vendor would give you a good discount, so you might even buy 100+ servers. I don’t think that Cloud is good for 30+ servers use cases, it is good for startups using 3-5 VMs from time to time

      Reply

Leave a Reply