I often hear this question from my customers and over the internet, especially in the last year. So what is the myth and what is real about the Spark and its place in the “Big Data” ecosystem?
To be honest, the question I put as the title is wrong, but it is usually the way it is asked. Hadoop is not a single product, it is an ecosystem. Same for Spark. Let’s cover them one by one. The main pillars of the Hadoop ecosystem at the moment are: Continue reading
Hadoop is becoming crucial tool in big data strategy for any enterprise. The more important it becomes, the more companies start to propose their solutions utilizing its open-source power. Today I will talk about virtualized Hadoop, one of the modern branches of the current market.
Let’s start from the virtualization itself. It begins its story in 60s, but the broader adoption of the virtualization in enterprises started approximately 10 years ago. What is the purpose for virtualization in enterprises? Continue reading
Spark is rapidly getting popular among the people working with large amounts of data. And it is not a big surprise as it offers up to 100x faster data processing compared to Hadoop MapReduce, works in memory, offers interactive shell and is quite simple to use in general. But in my opinion the main advantage of Spark is its great integration with Hadoop – you don’t need to invent the bycicle to make the use of Spark if you already have a Hadoop cluster. With Spark you can read data from HDFS and submit jobs under YARN resource manager so that they would share resources with MapReduce jobs running in parallel (which might as well be Hive queries or Pig scrips, for instance). All of these makes Spark a great tool that should be considered by any company having some big data strategy.
It is a known fact that Spark is still in early days, even though its getting popular. And mainly it means the lack of well-formed user guide and examples. Of course, there are some on official website, but they don’t cover well the integration with HDFS. I will try to fill this gap by providing examples of interacting with HDFS data using Spark Python interface also known as PySpark. I’m currently using Spark 1.2.0 (the latest one available) on top of Hadoop 2.2.0 and Hive 0.12.0 (which comes with PivotalHD distribution 2.1, also the latest).
Map Reduce is a really popular paradigm in distributed computing at the moment. The first paper describing this principle is the one by Google published in 2004. Nowadays Map Reduce is a term that everyone knows and everyone speaks about, because it was put as one of the foundations to the Hadoop project. For most of the people Map Reduce is an equivalent to “Hadoop” and “Big Data”, which is completely wrong. But there are some people that understand the simplest case with WordCount and maybe even building an inverted index using Map Reduce.
But being simple as a concept, it has a kind of complicated implementation in Hadoop. I’ve tried to find a comprehensive description of it with a good diagram over the internet, but failed. All the diagrams keep repeating “Map – Sort – Combine – Shuffle – Reduce”. Of course, it is good to know that the framework works this way, but what about dozens of parameters that are tunable for the framework? What happens if you reduce the buffer size of the Map output or increase it? These diagrams don’t offer any help for this. This was the reason for me to build my own diagram and my own description based on the latest source code available in the Hadoop repository.
Hadoop MapReduce Comprehensive Diagram
Imagine that you have a file “data.csv” that lies on Hadoop and you need to split it into a number of smaller files with the different data to process them separately. To do it with Pig or Hive you should specify the file schema to describe it as a table, which might be not the thing you need (for instance, if different rows have different schema). Here’s an example of how it can be done with a MapReduce job utilizing MultipleOutputs. Continue reading
Today I will tell you about the startup called “Splice Machine”. They position themselves as “The Only Hadoop RDBMS”, which is quite bold given the boom we see now in SQL-on-Hadoop solutions field, almost each of the big vendors implemented their own “one and only” solution and claim it to be the best. But let’s take a look at its internals to say whether its design really reflects the marketing slogan they’ve chosen.