Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around the internet. Seeing that, I could not resist the urge to take a closer look at this technology and poke into some of its pain points. What have also stumbled me at first is the lack of SnowflakeDB criticism in the blogs and message boards, which sounds suspicious given the self-proclaimed customer base of more than 1000 enterprises. So, let’s take a closer look at it.Continue reading
Yesterday my blog has got the 100th subscriber. To commemorate this, I prepared the post on the major industry trends happening in the field of “data”. I might miss something, so feel free to comment and extend the article with your opinion!
Big data is falling down the hype curve
Even though Gartner has removed “Big Data” from the last year’s hype diagram, it does not mean it suddenly moved from the peak of the “hype” to the plateau of adoption. Here is how the hype cycle look like:
The question regarding running Hadoop on a remote storage rises again and again by many independent developers, enterprise users and vendors. And there are still many discussions in community, with completely opposite opinions. I’d like to state here my personal view on this complex problem.
The faster your data warehousing solution runs, the higher would be the business demand related to the speed of new data availability in their reports. Over the last time I’ve seen a number of attempts to build up a cool thing called “online DWH” – a data warehouse that is almost in sync with data sources and has its data marts and reports dynamically updated as new data flows into it. This is a very great and powerful thing, but unfortunately its implementation is not as straightforward as the business wants it to be.
Great news! I have participated in a podcast recorded by Pivotal and published in our official blog. In this podcast I discuss the data architecture in general – how the things started, what was the main driver for its evolution and what we have now as a “modern data architecture”. Come and listen here: http://blog.pivotal.io/pivotal-perspectives/features/discussing-modern-data-architecture
Text transcript of this talk is also available by the same URL
Hadoop is known to be an ideal engine for processing unstructured data. But wait, what do you really mean by “unstructured data”? Can anything be considered as a “data” if it does not have a structure? Let’s start by taking a look at the historical brief.
Over the latest time I’ve heard many discussions on this topic. Also this is a very popular question asked by the customers with not much experience in the field of “big data”. In fact, I dislike this buzzword for ambiguity, but this is what the customers are usually coming to us with, so I got to use it.
If we take a look 5 years back, that was the time when Hadoop was not an option for most of the companies, especially for the enterprises that ask for stable and mature platforms. At that very moment the choice was very simple: when your analytical database grow beyond 5-7 terabytes in size you just initiate an MPP migration project and move to one of the proven enterprise MPP solutions. No one heard about the “unstructured” data – if you got to analyze logs just parse them with Perl/Python/Java/C++ and load into you analytical DBMS. And no one heard about high velocity data – simply use traditional OLTP RDBMS for frequent updates and chunk them for insertion into the analytical DWH.
Over the time working with enterprise customers, I repeatedly hear the question regarding the Hadoop cluster backup. It is a very reasonable question from the customer standpoint as they know that the backup is the best option to protect themselves from the data loss, and it is a crucial concept for each of the enterprises. But this question should be treated with care because when interpreted in a wrong way it might lead to huge investments from the customer side, that in the end would be completely useless. I will try to highlight the main pitfalls and potential approaches that would allow you to work out the best Hadoop backup approach, which would fulfill your needs.
The world is biased. You can find many examples of it everywhere around you. I really like the story about the doctor:
I felt sick and went to the doctor. The doctor prescribed me specific pills that would help me get better. And it’s completely fine, unless I mentioned that this doctor has a pen, notepad and calendar branded by the same pills he prescribed me to take. I’ve never taken this pills.
This is a true story happening everywhere in my home country. The problem is this kind of things happens everywhere, including the IT sector.