Yesterday my blog has got the 100th subscriber. To commemorate this, I prepared the post on the major industry trends happening in the field of “data”. I might miss something, so feel free to comment and extend the article with your opinion!
Big data is falling down the hype curve
Even though Gartner has removed “Big Data” from the last year’s hype diagram, it does not mean it suddenly moved from the peak of the “hype” to the plateau of adoption. Here is how the hype cycle look like:
The question regarding running Hadoop on a remote storage rises again and again by many independent developers, enterprise users and vendors. And there are still many discussions in community, with completely opposite opinions. I’d like to state here my personal view on this complex problem.
The faster your data warehousing solution runs, the higher would be the business demand related to the speed of new data availability in their reports. Over the last time I’ve seen a number of attempts to build up a cool thing called “online DWH” – a data warehouse that is almost in sync with data sources and has its data marts and reports dynamically updated as new data flows into it. This is a very great and powerful thing, but unfortunately its implementation is not as straightforward as the business wants it to be.
Hadoop is known to be an ideal engine for processing unstructured data. But wait, what do you really mean by “unstructured data”? Can anything be considered as a “data” if it does not have a structure? Let’s start by taking a look at the historical brief.
Over the latest time I’ve heard many discussions on this topic. Also this is a very popular question asked by the customers with not much experience in the field of “big data”. In fact, I dislike this buzzword for ambiguity, but this is what the customers are usually coming to us with, so I got to use it.
If we take a look 5 years back, that was the time when Hadoop was not an option for most of the companies, especially for the enterprises that ask for stable and mature platforms. At that very moment the choice was very simple: when your analytical database grow beyond 5-7 terabytes in size you just initiate an MPP migration project and move to one of the proven enterprise MPP solutions. No one heard about the “unstructured” data – if you got to analyze logs just parse them with Perl/Python/Java/C++ and load into you analytical DBMS. And no one heard about high velocity data – simply use traditional OLTP RDBMS for frequent updates and chunk them for insertion into the analytical DWH.
Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
The world is biased. You can find many examples of it everywhere around you. I really like the story about the doctor:
I felt sick and went to the doctor. The doctor prescribed me specific pills that would help me get better. And it’s completely fine, unless I mentioned that this doctor has a pen, notepad and calendar branded by the same pills he prescribed me to take. I’ve never taken this pills.
This is a true story happening everywhere in my home country. The problem is this kind of things happens everywhere, including the IT sector.