Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
Let’s start with the basics. The most important things you need to understand about Hadoop cluster are:
- Hadoop Distributed File System (HDFS) is redundant by default. All the data is stored in 3 copies (replication parameter from hdfs-site.xml), and given the proper topology configuration all 3 copies of the data would never be stored on a single rack, of course given the fact that your system is at least 2 racks in size
- HDFS is the solution that was designed to work within a single datacenter. This is a limitation “by design” inherited from the Google FS and there is no easy way to overcome it. The main problems with this approach are:
- Limited bandwidth between datacenters for HDFS data writes. Each time you write the data to HDFS, you have to put one copy to the remote cluster, and this process is synchronous. When you get a stream of input data, you write it to the HDFS block-by-block, and each block is getting pipelined between 3 datanodes storing them, 1 of them would always be remote. This means that maximum write performance of your cluster would be the same as the channel bandwidth between two sites
- Limited bandwidth between datacenters for mapreduce data shuffle. When you have a distributed YARN cluster, some of your mappers would run in one datacenter, some in another. But reducers read the data from all the mappers executed, which means that if you have your cluster split into two equal parts between two datacenters, 50% of the mappers’ output would be shuffled through the channel connecting two datacenters. For some MR jobs this is completely ok if they have an optimal combiner, but for most of them this would be a huge overkill as sometimes mapper output data size is even greater than its input data size
- Given the fact that you have 2 datacenters and a dedicated channel between them, are you ready to give whole this channel to Hadoop? I don’t think so, usually OLTP applications like different billings, ABSs, CRMs are the first priority and Hadoop is the second or even third, so you are lucky if you would get at least 4-6 Gbit channel, which would definitely be not enough and would be the bottleneck of your cluster limiting its capabilities
- HDFS is designed to run on a commodity hardware and JBOD direct attached storage. It can easily tolerate each components failure, especially HDDs, and automatically recover desired replication rate of your data in case of any component failure. This makes the HDFS be really cheap solution for storing your data and hardly you can imagine any proprietary storage solution that can compete with HDFS in price, even tape storage would give you comparable price. And also you should consider that HDFS servers are used not only for the storage, but as well for data processing, which means that you are reusing cluster resources for computations and don’t need to have a separate compute cluster
Finally, HDFS is a distributed filesystem that runs within a single datacenter on commodity hardware with JBOD drives directly attached to the servers delivering one of the best prices for storing the data across the competitive solutions and also enabling distributed data processing.
Next thing to mention, you will almost never store uncompressed raw data on your HDFS cluster. For anyone managing the cluster it is obvious that data compression is a crucial tool for the current enterprises as it allows you to dramatically reduce the costs of storing the data. At the moment 2 most popular data storage formats for HDFS are Parquet (introduced by Twitter, https://parquet.incubator.apache.org/) and ORCfile (introduced by Facebook, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC), both of them are highly optimized row-columnar formats for storing structured data. For instance, Facebook’s data warehouse uses ORCfile format for storing the data, delivering 8x compression rate on average (https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/).
Now that we know it, let’s focus on our main topic with HDFS data backup. What are the available solutions for making this approach real? Let’s briefly cover them:
- Proprietary backup solutions provided by the storage systems vendors. This might be tape-based systems or big NASs with 4TB/6TB HDD drives to store your backups. Here are the problems you would face:
- Data volume. When you have 1PB Hadoop cluster be ready to have 2PB storage solution for managing Hadoop backups. If you are not new to this topic, you can already imagine that this solution would cost you more than 2 additional Hadoop clusters of the same size as your main production
- Data transfer bottleneck. Imagine that you connect your storage solution with 4 x 10GbE cables to your Hadoop cluster interconnect network. This way your backup speed would be limited by 40Gbit per second or 5GB/sec, which in fact would be 3.6GB/sec. How long will it take to restore all the data back to your 1PB Hadoop cluster in case of disaster? The answer is simple – 3 days and 5 hours. And this is one of the best scenarios, not every storage would deliver you 3.6GB/sec transfer rate, most of them would give you 1GB/sec at max which would turn your data restore exercise to 11 days 14 hours
- Data compression and deduplication. When the data is stored in optimized row-columnar formats like Parquet or ORCfile, no additional compression can be introduced by storage solution, you shouldn’t listen to the marketing here. Data deduplication is reasonable and would work, but only if you store more than 1 backup of your system. If you decide to store only 1 latest copy of the data, you wouldn’t benefit from deduplication in any way
- Secondary Hadoop site. As HDFS is a cheap solution for both storing and processing the data, by introducing secondary cluster you would have both secondary copy of your cluster data and secondary processing system which can be activated in case the primary would fail, or can be used for development purposes while the production is running on another site
As you can understand, first solution is simple and we won’t stop on this, now I will discuss possible options for implementing the secondary Hadoop site:
- HDFS Replication. Native implementation of this feature in HDFS is currently in progress (https://issues.apache.org/jira/browse/HDFS-5442) and it is not planned to be released before Hadoop 3.0, which will not be released anytime soon. At the moment almost the only option that is available to you is WANdisco software (http://www.wandisco.com/system/files/documentation/Technical-Brief-Service-Continuity-NonStop_Hadoop-WEB.pdf) that implements data replication between two clusters. But this is closed-source proprietary hardware, and you have to pay license fee for using it. For enterprise customers this might be a solution, but in general implementing this kind of SW would most likely introduce additional dependencies on the software so be prepared that Hadoop version you are using would be 1-2 versions behind the latest released, just because the proprietary SW does not adapt fast enough to support new releases from day one
- Data Copy. Another popular approach is just to copy the data between 2 clusters. The tool used for this purpose is usually DistCp, orchestrated by Oozie. Some vendors like Cloudera provide proprietary toolkit for data replication built on top of distcp (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Backup-Disaster-Recovery/cm5bdr_overview.html), some like Hortonworks try to push the open source community forward to make it available for every Hadoop user (http://falcon.apache.org/, http://www.slideshare.net/hortonworks/discoverhdp22falcon-for-data-governance). In general this is somewhat preferred approach at the moment that delivers you the best flexibility, performance and stability, but it requires much manual configuration, which wouldn’t surprise you if you are working with Hadoop for some time
- Dual Load. This is the third and the last option available for you in the scenario of working with two Hadoop clusters: primary and secondary. With this approach you load the data to both clusters at the same time, this way you ensure that both of the clusters would contain the same data. This approach might be problematic in case you have a complex ETL running, so even small issues with the source data might cause big difference in the processing results, which would make your secondary cluster completely useless. To avoid this you have to plan for dual data loading from the very beginning and use some distributed queue solutions like Kafka or RabbitMQ to guarantee data delivery to both of your clusters with no data loss. Also you might consider using Flume for putting the data from the source systems to HDFS. Here’s an example of how it works in Paypal:
Now that we know all the available approaches, we can finally come to the scenarios of making the data stored in Hadoop cluster completely redundant, and making it in a most efficient way:
- Backup target datamarts and aggregated reports to the storage solutions. Gigabytes of data. Can be easily managed by traditional backup solutions
- Replicate big result datasets from primary cluster to secondary one using data copy. Terabytes of data. Can be easily copied with DistCp utility. Also this approach allows you to be completely sure that the data your reports are built on is exactly the same on both clusters
- Dual load the raw source data to both of the clusters. Petabytes of data. Not that easy to completely copy it even with DistCp, so it is much easier to dual load it. Of course, another approach would be to incrementally copy the data to the secondary site with DistCp, but this approach would introduce additional latency and will cause data loss in case of switching to the secondary site
Excellent overview.
Pingback: how secure is your Hadoop installation? – Naked Security – Naked Security | Newsist
Good Overview. Has there been any new technologies or architectures in this space since this post.
It seems you don’t know what is backup. So your article is not answering the question.
Backup is not only for loosing the actual version of a data and securing the production.
It is to have an history of a given data through time.
With all your copy and replication and so on, how can give back a data 3 years later if it has changed many times during these 3 years ?
This is what backup is doing : indexing each version of a data and keeping a copy of all these indexed versions.
Your are talking about deduplication at a moment : it seems you are aware backup is using deduplication for nearly 10 years ! !
You are talking about transfert throughput : the problem is on HDS side, not on backup side. Backup array are able to deliver hundreds of streams of more than 100MB/s, which means far over 1 GB/s.
You are talking about 1 PB of data on a single cluster. Backup system of huge company are generating between 2 and 10 PB per week ! So you make me laugthing with only 1PB.
In fact you have no solution for backup. HDFS cannot support legal audit over years.
History of a given data through time is called data versioning, and it is independent concept from backup. Usually data versioning is offered out-of-the-box by some operational data stores (e.g. HBase versions, Oracle Redo log), and requires more complex approaches in other cases (Slowly Changing Dimensions in DWH, and some other cases). However, as I told before, the problems of data versioning and backups are independent.
Regarding deduplication – yes, I’m aware that it is used for many years. Read carefully my words – I’m stating that if you store only a single copy of data (which you do if you have a huge dataset), deduplication won’t help. It comes into play with 2nd and 3rd copies, and only if most of your data didn’t change.
Backup throughput has the problem on HDFS side, are you serious? HDFS is fully parallel, it can deliver you as much throughput as you can consume. A cheap HDFS cluster storing 1PB of data would give you a 28GB/sec scan rate. Hundreds of streams of 100MB/sec is 30-40GB/sec, right? Transferring 1PB of data through it would take more than 8 hours. Also, 40GB/sec is at least 32 independent 10GbE links that should connect your HDFS cluster with the backup solution – can you point me to the solutions that support this kind of thing, please?
1PB is still kind of a lot (and was even more 3 years ago, at the time of posting), and less than 1% of enterprises have this kind of data volumes. I agree that large companies are generating 2-10PB of backup a week – it is 300 – 1400 TB a day, which can be scattered across 10 datacenters and 10 backup solutions, which is ~140TB/day or ~1.7GB/sec per backup solution. What this article is about is a backup of a single HDFS cluster running in a single datacenter.
In your last statement it is completely clear that you mix the concept of backup and data versioning. Again, data versioning is a solution-level concept and is implemented either by your operational data store, or by introducing a specific data model that preserves historical versions. Of course, there are also continuous data protection solutions, but they simply does not work for this kind of volumes of data and HDFS-like deployments.