The first blog post of mine is accepted to official Pivotal blog! Feel free to comment and share your opinion on the subject:
This is the talk I made on Java Day Kiev 2015. It was a great conference after all
The question regarding running Hadoop on a remote storage rises again and again by many independent developers, enterprise users and vendors. And there are still many discussions in community, with completely opposite opinions. I’d like to state here my personal view on this complex problem.
Today I had a great talk at the Hadoop User Group Ireland meetup in Dublin, and it was an adapted and refactored version of the article on the same subject, MPP vs Hadoop. Here are the slides:
Feel free to comment and share your opinion on this subject
Finally I have translated my talk from Highload++ 2015 conference in Moscow into English, so now you can enjoy the fresh information about the Apache HAWQ internals!
If you’d like to download the slides, you can find them here: HAWQ Architecture HL++ 2015 Moscow
When you are completely ready to start your “big data” initiative with Hadoop, one of your first questions would be related to the cluster sizing. What is the right hardware to choose in terms of price/performance? How much hardware you need to handle your data and your workload? I will do my best to answer these questions in my article.
Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
Edit from 2015/12/17: Memory model described in this article is deprecated starting Apache Spark 1.6+, the new memory model is based on UnifiedMemoryManager and described in this article
Over the recent time I’ve answered a series of questions related to ApacheSpark architecture on StackOverflow. All of them seem to be caused by the absence of a good general description of the Spark architecture in the internet. Even official guide does not have that many details and of cause it lacks good diagrams. Same for the “Learning Spark” book and the materials of official workshops.
In this article I would try to fix this and provide a single-stop shop guide for Spark architecture in general and some most popular questions on its concepts. This article is not for complete beginners – it will not provide you an insight on the Spark main programming abstractions (RDD and DAG), but requires their knowledge as a prerequisite.
Hadoop is becoming crucial tool in big data strategy for any enterprise. The more important it becomes, the more companies start to propose their solutions utilizing its open-source power. Today I will talk about virtualized Hadoop, one of the modern branches of the current market.
Let’s start from the virtualization itself. It begins its story in 60s, but the broader adoption of the virtualization in enterprises started approximately 10 years ago. What is the purpose for virtualization in enterprises? Continue reading
Spark is rapidly getting popular among the people working with large amounts of data. And it is not a big surprise as it offers up to 100x faster data processing compared to Hadoop MapReduce, works in memory, offers interactive shell and is quite simple to use in general. But in my opinion the main advantage of Spark is its great integration with Hadoop – you don’t need to invent the bycicle to make the use of Spark if you already have a Hadoop cluster. With Spark you can read data from HDFS and submit jobs under YARN resource manager so that they would share resources with MapReduce jobs running in parallel (which might as well be Hive queries or Pig scrips, for instance). All of these makes Spark a great tool that should be considered by any company having some big data strategy.
It is a known fact that Spark is still in early days, even though its getting popular. And mainly it means the lack of well-formed user guide and examples. Of course, there are some on official website, but they don’t cover well the integration with HDFS. I will try to fill this gap by providing examples of interacting with HDFS data using Spark Python interface also known as PySpark. I’m currently using Spark 1.2.0 (the latest one available) on top of Hadoop 2.2.0 and Hive 0.12.0 (which comes with PivotalHD distribution 2.1, also the latest).