Hadoop is becoming crucial tool in big data strategy for any enterprise. The more important it becomes, the more companies start to propose their solutions utilizing its open-source power. Today I will talk about virtualized Hadoop, one of the modern branches of the current market.
Let’s start from the virtualization itself. It begins its story in 60s, but the broader adoption of the virtualization in enterprises started approximately 10 years ago. What is the purpose for virtualization in enterprises?
- Reduce IT costs
- CapEx: reduce the amount of HW servers maintaining the same level of performance
- OpEx: less HW – less HW maintenance efforts, plus easier management of VMs (snapshotting, backupping, etc.)
- Other savings: less HW – less flooring space in DC, less power consumption, etc.
- Increase efficiency
- Automate tasks: simplify and speed up maintenance operations
- Faster deployment: it is times faster to add VM that to add one more HW server
- Redundancy: simpler failover scenarios in case of server failure
It was a really breaking technology and all the point described above just proves it. This is why many of the businesses are following the “Virtualize Everything” concept this strict that they are really virtualizing everything.
And following the “Virtualize Everything” intension the companies start to virtualize Hadoop, without going deeper into its architecture. Just because they used to do it during the 10 years of following “Virtualize Everything” concept. But what is the meaning of this virtualization in terms of Hadoop? First, let’s cover the virtualization reasoning provided above in a context of Hadoop:
- Reduce IT costs
- CapEx: reduce the amount of HW servers maintaining the same level of performance. Ok, but how can it be achieved with Hadoop? Hadoop is an elastically scaling distributed filesystem and computation framework. In terms of storage: If you don’t need 100 servers for Hadoop cluster, you can buy 90, and all the storage on them would be 100% utilized. You shouldn’t pack many Hadoop VMs into a single server because it won’t decrease the storage footprint. In terms of CPU and RAM resources: Hadoop has YARN to manage the memory and CPU usage on the cluster and balance it. Having many Hadoop VMs on a single HW host won’t give you any benefit in terms of resource usage – you will still operate with the same HW. In terms of general performance: in the best scenario with having servers with DAS storage and 1 VM per host you performance will degrade approximately 20% compared to non-virtualized solution. In many datasheets you will see something like “5%”, but it is the marketing – you will see this 5% in case of in-memory computations, while the IO (both disks and network) would introduce greater performance penalty. This means that instead of reducing the amount of HW servers to maintain the same performance, you are increasing it for ~20%;
- OpEx: less HW – less HW maintenance efforts, plus easier management of VMs (snapshotting, backupping, etc.). Ok, but according to the points described above, we would have more HW, which means greater HW maintenance efforts. Next, about the easier VM management. Snapshotting – the is completely no case in snapshotting HDFS, because you won’t get consistent snapshot of it unless you stop the whole HDFS cluster. If you won’t do it, your snapshot won’t worth a cent because you won’t be able to use the data stored in it. Next, imagine the storage size you would need for storing HDFS VMs snapshots – it is huge, at least 20% greater than the storage available for HDFS. Are you ready to buy SAN for 1PB of data? Don’t think so. The only option to backup Hadoop is usually to backup it with HDFS Snapshots and copy the data to another Hadoop cluster. Same for backups – if you need them, you would prefer to use another instance of Hadoop on secondary datacenter, not the proprietary enterprise storage. And one more nasty thing – you will pay for the subscription of the virtualization software, that can be avoided in bare-metal installation scenario. And here I have to note that the price for virtualization SW would be approximately the same as the price of you Hadoop services subscription delivered by any major Hadoop vendor;
- Other savings: less HW – less flooring space in DC, less power consumption, etc. Same as in above questions, you will have more HW and greater flooring space in DC, greater power consumption, etc.
- Increase efficiency
- Automate tasks: simplify and speed up maintenance operations. Ok, but how will it work with Hadoop? Hadoop is not a single VM, it is a cluster solution and you won’t be able to do something only on VM level, you would also need to go to the Hadoop cluster manager (Ambari, Cloudera Manager, Pivotal Command Center, etc.) and perform the maintenance using this tool;
- Faster deployment: it is times faster to add VM that to add one more HW server. But again, if you have some additional HW for your Hadoop cluster, you just add it without waiting and storing it on a shelf. It can be easily added to the cluster with any of the existing Hadoop cluster managers. But if you add one more VM to your virtual cluster, you won’t get away from going to Hadoop cluster manager and initializing this VM the way it would act as a part of the cluster. The only benefit you would have here is some small time saving on installing the OS on the new HW host – it is easier to clone the VM. But it is only 30 mins for a single server, and during the long-term cluster usage you won’t see the difference. Plus, again, if you add a new HW server to the virtualized Hadoop cluster, you again have to install virtualization SW on top of it before cloning the VM on it, so in general with each HW server added to bare metal and virtualized clusters you would spend approximately the same time for moving them to production.
- Redundancy: simpler failover scenarios in case of server failure. As I described above, to be able to utilize this functionality you should have a shared storage. Imagine enterprise SAN solution for 1PB of data, fast enough to work under the Hadoop filesystem, then count its price and you will see the whole picture. In general, nothing stops you from using it, but it would be extremely expensive. HDFS and Hadoop are redundant by design, and virtualization won’t help you here. Disaster recovery is similar, usually it is implemented by incremental copy of the data from primary cluster to the secondary one, virtualized or not – the scenario won’t change and you won’t benefit from virtualization.
But if it would be that bad, why do we have so many startups offering virtualized Hadoop? Because there are some use-cases where you can really benefit from virtualized Hadoop solution:
- Cloud Hadoop. Use-case: proof-of-concept. The idea is that you can easily obtain a running Hadoop cluster in the cloud in a matter of minutes. Load some small amount of test data there (hundreds of megabytes or a couple of gigabytes, no more) and run some sample workload to test how it works and whether it fits your use case or not. Then easily destroy your instance;
- Virtualized Hadoop. Use-case: development environment. You can have a small Hadoop cluster running in 3-5 VM, one cluster for each of the developers. You can give each of them approximately 100-300GB of disk space, set replication coefficient to 1 and use for small subsets of data for development purposes. With a single HW server with 2 x 10-core CPUs, 512GB of RAM and 12 x 4TB HDDs in RAID5 you can host up to 4 Hadoop clusters on a single machine, giving your developers a good sandbox for experiments.
If anyone will tell you about other use cases of Virtualized Hadoop, be skeptical and involve someone with good Hadoop expertise. All these cases need an accurate review, because most likely someone wants to make some more money out of your company by proposing this solution.
In the next article I would try to cover even nastier thing that sometimes occurs on the horizon – using Hadoop on top of some shared storage, which might be enterprise NAS or just a cloud storage solution.