Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
Let’s start with the basics. The most important things you need to understand about Hadoop cluster are:
- Hadoop Distributed File System (HDFS) is redundant by default. All the data is stored in 3 copies (replication parameter from hdfs-site.xml), and given the proper topology configuration all 3 copies of the data would never be stored on a single rack, of course given the fact that your system is at least 2 racks in size
- HDFS is the solution that was designed to work within a single datacenter. This is a limitation “by design” inherited from the Google FS and there is no easy way to overcome it. The main problems with this approach are:
- Limited bandwidth between datacenters for HDFS data writes. Each time you write the data to HDFS, you have to put one copy to the remote cluster, and this process is synchronous. When you get a stream of input data, you write it to the HDFS block-by-block, and each block is getting pipelined between 3 datanodes storing them, 1 of them would always be remote. This means that maximum write performance of your cluster would be the same as the channel bandwidth between two sites
- Limited bandwidth between datacenters for mapreduce data shuffle. When you have a distributed YARN cluster, some of your mappers would run in one datacenter, some in another. But reducers read the data from all the mappers executed, which means that if you have your cluster split into two equal parts between two datacenters, 50% of the mappers’ output would be shuffled through the channel connecting two datacenters. For some MR jobs this is completely ok if they have an optimal combiner, but for most of them this would be a huge overkill as sometimes mapper output data size is even greater than its input data size
- Given the fact that you have 2 datacenters and a dedicated channel between them, are you ready to give whole this channel to Hadoop? I don’t think so, usually OLTP applications like different billings, ABSs, CRMs are the first priority and Hadoop is the second or even third, so you are lucky if you would get at least 4-6 Gbit channel, which would definitely be not enough and would be the bottleneck of your cluster limiting its capabilities
- HDFS is designed to run on a commodity hardware and JBOD direct attached storage. It can easily tolerate each components failure, especially HDDs, and automatically recover desired replication rate of your data in case of any component failure. This makes the HDFS be really cheap solution for storing your data and hardly you can imagine any proprietary storage solution that can compete with HDFS in price, even tape storage would give you comparable price. And also you should consider that HDFS servers are used not only for the storage, but as well for data processing, which means that you are reusing cluster resources for computations and don’t need to have a separate compute cluster
Finally, HDFS is a distributed filesystem that runs within a single datacenter on commodity hardware with JBOD drives directly attached to the servers delivering one of the best prices for storing the data across the competitive solutions and also enabling distributed data processing.
Next thing to mention, you will almost never store uncompressed raw data on your HDFS cluster. For anyone managing the cluster it is obvious that data compression is a crucial tool for the current enterprises as it allows you to dramatically reduce the costs of storing the data. At the moment 2 most popular data storage formats for HDFS are Parquet (introduced by Twitter, https://parquet.incubator.apache.org/) and ORCfile (introduced by Facebook, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC), both of them are highly optimized row-columnar formats for storing structured data. For instance, Facebook’s data warehouse uses ORCfile format for storing the data, delivering 8x compression rate on average (https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/).
Now that we know it, let’s focus on our main topic with HDFS data backup. What are the available solutions for making this approach real? Let’s briefly cover them:
- Proprietary backup solutions provided by the storage systems vendors. This might be tape-based systems or big NASs with 4TB/6TB HDD drives to store your backups. Here are the problems you would face:
- Data volume. When you have 1PB Hadoop cluster be ready to have 2PB storage solution for managing Hadoop backups. If you are not new to this topic, you can already imagine that this solution would cost you more than 2 additional Hadoop clusters of the same size as your main production
- Data transfer bottleneck. Imagine that you connect your storage solution with 4 x 10GbE cables to your Hadoop cluster interconnect network. This way your backup speed would be limited by 40Gbit per second or 5GB/sec, which in fact would be 3.6GB/sec. How long will it take to restore all the data back to your 1PB Hadoop cluster in case of disaster? The answer is simple – 3 days and 5 hours. And this is one of the best scenarios, not every storage would deliver you 3.6GB/sec transfer rate, most of them would give you 1GB/sec at max which would turn your data restore exercise to 11 days 14 hours
- Data compression and deduplication. When the data is stored in optimized row-columnar formats like Parquet or ORCfile, no additional compression can be introduced by storage solution, you shouldn’t listen to the marketing here. Data deduplication is reasonable and would work, but only if you store more than 1 backup of your system. If you decide to store only 1 latest copy of the data, you wouldn’t benefit from deduplication in any way
- Secondary Hadoop site. As HDFS is a cheap solution for both storing and processing the data, by introducing secondary cluster you would have both secondary copy of your cluster data and secondary processing system which can be activated in case the primary would fail, or can be used for development purposes while the production is running on another site
As you can understand, first solution is simple and we won’t stop on this, now I will discuss possible options for implementing the secondary Hadoop site:
- HDFS Replication. Native implementation of this feature in HDFS is currently in progress (https://issues.apache.org/jira/browse/HDFS-5442) and it is not planned to be released before Hadoop 3.0, which will not be released anytime soon. At the moment almost the only option that is available to you is WANdisco software (http://www.wandisco.com/system/files/documentation/Technical-Brief-Service-Continuity-NonStop_Hadoop-WEB.pdf) that implements data replication between two clusters. But this is closed-source proprietary hardware, and you have to pay license fee for using it. For enterprise customers this might be a solution, but in general implementing this kind of SW would most likely introduce additional dependencies on the software so be prepared that Hadoop version you are using would be 1-2 versions behind the latest released, just because the proprietary SW does not adapt fast enough to support new releases from day one
- Data Copy. Another popular approach is just to copy the data between 2 clusters. The tool used for this purpose is usually DistCp, orchestrated by Oozie. Some vendors like Cloudera provide proprietary toolkit for data replication built on top of distcp (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v5-1-x/Cloudera-Backup-Disaster-Recovery/cm5bdr_overview.html), some like Hortonworks try to push the open source community forward to make it available for every Hadoop user (http://falcon.apache.org/, http://www.slideshare.net/hortonworks/discoverhdp22falcon-for-data-governance). In general this is somewhat preferred approach at the moment that delivers you the best flexibility, performance and stability, but it requires much manual configuration, which wouldn’t surprise you if you are working with Hadoop for some time
- Dual Load. This is the third and the last option available for you in the scenario of working with two Hadoop clusters: primary and secondary. With this approach you load the data to both clusters at the same time, this way you ensure that both of the clusters would contain the same data. This approach might be problematic in case you have a complex ETL running, so even small issues with the source data might cause big difference in the processing results, which would make your secondary cluster completely useless. To avoid this you have to plan for dual data loading from the very beginning and use some distributed queue solutions like Kafka or RabbitMQ to guarantee data delivery to both of your clusters with no data loss. Also you might consider using Flume for putting the data from the source systems to HDFS. Here’s an example of how it works in Paypal:
Now that we know all the available approaches, we can finally come to the scenarios of making the data stored in Hadoop cluster completely redundant, and making it in a most efficient way:
- Backup target datamarts and aggregated reports to the storage solutions. Gigabytes of data. Can be easily managed by traditional backup solutions
- Replicate big result datasets from primary cluster to secondary one using data copy. Terabytes of data. Can be easily copied with DistCp utility. Also this approach allows you to be completely sure that the data your reports are built on is exactly the same on both clusters
- Dual load the raw source data to both of the clusters. Petabytes of data. Not that easy to completely copy it even with DistCp, so it is much easier to dual load it. Of course, another approach would be to incrementally copy the data to the secondary site with DistCp, but this approach would introduce additional latency and will cause data loss in case of switching to the secondary site