I often hear this question from my customers and over the internet, especially in the last year. So what is the myth and what is real about the Spark and its place in the “Big Data” ecosystem?
To be honest, the question I put as the title is wrong, but it is usually the way it is asked. Hadoop is not a single product, it is an ecosystem. Same for Spark. Let’s cover them one by one. The main pillars of the Hadoop ecosystem at the moment are: Continue reading →
Spark is rapidly getting popular among the people working with large amounts of data. And it is not a big surprise as it offers up to 100x faster data processing compared to Hadoop MapReduce, works in memory, offers interactive shell and is quite simple to use in general. But in my opinion the main advantage of Spark is its great integration with Hadoop – you don’t need to invent the bycicle to make the use of Spark if you already have a Hadoop cluster. With Spark you can read data from HDFS and submit jobs under YARN resource manager so that they would share resources with MapReduce jobs running in parallel (which might as well be Hive queries or Pig scrips, for instance). All of these makes Spark a great tool that should be considered by any company having some big data strategy. It is a known fact that Spark is still in early days, even though its getting popular. And mainly it means the lack of well-formed user guide and examples. Of course, there are some on official website, but they don’t cover well the integration with HDFS. I will try to fill this gap by providing examples of interacting with HDFS data using Spark Python interface also known as PySpark. I’m currently using Spark 1.2.0 (the latest one available) on top of Hadoop 2.2.0 and Hive 0.12.0 (which comes with PivotalHD distribution 2.1, also the latest). Continue reading →
MVCC stands for Multi-Version Concurrency Control. It is the basic transaction isolation idea that stands behind many transactional systems and allows different processes see different version of truth for the same data. Considering DBMS system, when you are running the query that performs “update” of a specific number of records in the table, you should guarantee specific transaction isolation: if you run “select” in parallel with this “update”, you most likely want this “select” to see the data that was in the table before the “update” has started and not the “dirty” data that was created by this “update” (that might be rollbacked as well as committed).
The solution to handle this particular problem is MVCC – you need to store a number of versions for each row of the table that got changed. This data should be stored somehow and somehow maintained. I will discuss a number of approaches to make it.