Hadoop is known to be an ideal engine for processing unstructured data. But wait, what do you really mean by “unstructured data”? Can anything be considered as a “data” if it does not have a structure? Let’s start by taking a look at the historical brief.
Over the latest time I’ve heard many discussions on this topic. Also this is a very popular question asked by the customers with not much experience in the field of “big data”. In fact, I dislike this buzzword for ambiguity, but this is what the customers are usually coming to us with, so I got to use it.
If we take a look 5 years back, that was the time when Hadoop was not an option for most of the companies, especially for the enterprises that ask for stable and mature platforms. At that very moment the choice was very simple: when your analytical database grow beyond 5-7 terabytes in size you just initiate an MPP migration project and move to one of the proven enterprise MPP solutions. No one heard about the “unstructured” data – if you got to analyze logs just parse them with Perl/Python/Java/C++ and load into you analytical DBMS. And no one heard about high velocity data – simply use traditional OLTP RDBMS for frequent updates and chunk them for insertion into the analytical DWH.
Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
Edit from 2015/12/17: Memory model described in this article is deprecated starting Apache Spark 1.6+, the new memory model is based on UnifiedMemoryManager and described in this article
Over the recent time I’ve answered a series of questions related to ApacheSpark architecture on StackOverflow. All of them seem to be caused by the absence of a good general description of the Spark architecture in the internet. Even official guide does not have that many details and of cause it lacks good diagrams. Same for the “Learning Spark” book and the materials of official workshops.
In this article I would try to fix this and provide a single-stop shop guide for Spark architecture in general and some most popular questions on its concepts. This article is not for complete beginners – it will not provide you an insight on the Spark main programming abstractions (RDD and DAG), but requires their knowledge as a prerequisite.
This is the first article in a series. The second one regarding shuffle is available here. The third one about new memory management model is available here.
The world is biased. You can find many examples of it everywhere around you. I really like the story about the doctor:
I felt sick and went to the doctor. The doctor prescribed me specific pills that would help me get better. And it’s completely fine, unless I mentioned that this doctor has a pen, notepad and calendar branded by the same pills he prescribed me to take. I’ve never taken this pills.
This is a true story happening everywhere in my home country. The problem is this kind of things happens everywhere, including the IT sector.
At the moment there is a huge buzz in the media about the Apache Spark framework and little by little it becomes next big thing in a field of “Big Data”. The simplest thing we can do to prove this is to look at the google trends diagram:
I often hear this question from my customers and over the internet, especially in the last year. So what is the myth and what is real about the Spark and its place in the “Big Data” ecosystem?
To be honest, the question I put as the title is wrong, but it is usually the way it is asked. Hadoop is not a single product, it is an ecosystem. Same for Spark. Let’s cover them one by one. The main pillars of the Hadoop ecosystem at the moment are: Continue reading
Hadoop is becoming crucial tool in big data strategy for any enterprise. The more important it becomes, the more companies start to propose their solutions utilizing its open-source power. Today I will talk about virtualized Hadoop, one of the modern branches of the current market.
Let’s start from the virtualization itself. It begins its story in 60s, but the broader adoption of the virtualization in enterprises started approximately 10 years ago. What is the purpose for virtualization in enterprises? Continue reading
Spark is rapidly getting popular among the people working with large amounts of data. And it is not a big surprise as it offers up to 100x faster data processing compared to Hadoop MapReduce, works in memory, offers interactive shell and is quite simple to use in general. But in my opinion the main advantage of Spark is its great integration with Hadoop – you don’t need to invent the bycicle to make the use of Spark if you already have a Hadoop cluster. With Spark you can read data from HDFS and submit jobs under YARN resource manager so that they would share resources with MapReduce jobs running in parallel (which might as well be Hive queries or Pig scrips, for instance). All of these makes Spark a great tool that should be considered by any company having some big data strategy.
It is a known fact that Spark is still in early days, even though its getting popular. And mainly it means the lack of well-formed user guide and examples. Of course, there are some on official website, but they don’t cover well the integration with HDFS. I will try to fill this gap by providing examples of interacting with HDFS data using Spark Python interface also known as PySpark. I’m currently using Spark 1.2.0 (the latest one available) on top of Hadoop 2.2.0 and Hive 0.12.0 (which comes with PivotalHD distribution 2.1, also the latest).
Map Reduce is a really popular paradigm in distributed computing at the moment. The first paper describing this principle is the one by Google published in 2004. Nowadays Map Reduce is a term that everyone knows and everyone speaks about, because it was put as one of the foundations to the Hadoop project. For most of the people Map Reduce is an equivalent to “Hadoop” and “Big Data”, which is completely wrong. But there are some people that understand the simplest case with WordCount and maybe even building an inverted index using Map Reduce.
But being simple as a concept, it has a kind of complicated implementation in Hadoop. I’ve tried to find a comprehensive description of it with a good diagram over the internet, but failed. All the diagrams keep repeating “Map – Sort – Combine – Shuffle – Reduce”. Of course, it is good to know that the framework works this way, but what about dozens of parameters that are tunable for the framework? What happens if you reduce the buffer size of the Map output or increase it? These diagrams don’t offer any help for this. This was the reason for me to build my own diagram and my own description based on the latest source code available in the Hadoop repository.
Hadoop MapReduce Comprehensive Diagram