Snowflake: The Good, The Bad and The Ugly

Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around the internet. Seeing that, I could not resist the urge to take a closer look at this technology and poke into some of its pain points. What have also stumbled me at first is the lack of SnowflakeDB criticism in the blogs and message boards, which sounds suspicious given the self-proclaimed customer base of more than 1000 enterprises. So, let’s take a closer look at it.

Diagram 1. Snowflake Architecture based on their SIGMOID paper.

I will not write a full review of the Snowflake design here as it will take a lot of pages, and Snowflake team has already done it pretty well. Instead, I will focus on the main principles of their design and the differentiation from the other DWH solutions available on the market:

  • Amazon S3 and Azure Storage are used as a storage backend. This approach allows SnowflakeDB to piggyback on cloud providers both in terms of capacity provisioning and IO throughput guarantees for their storage. This is pretty convenient: cloud providers tend to invest a lot of money in making their storage bullet-proof as this is one of the main value propositions of the cloud platform, and it will be stupid not to use it for your own advantage developing a new cloud-native solution.
  • Storage is decoupled from compute. Traditional MPP solutions use local HDDs of the cluster nodes to store the data, effectively creating a non-divisible scaling unit of a single cluster machine. Also, due to MPP design, all the machines in a cluster must be of the same hardware specs to avoid resource skew between cluster nodes, thus the coupling of compute resources with storage is pretty tight. I have seen a bunch of examples when the vendor put tremendous effort to decouple storage from compute: take Exadata, or multiple storage vendors trying to put their enterprise storage appliances under the more traditional MPP solutions like Vertica and Greenplum. It has never really achieved any success for a simple reason – to build a distributed storage elastic enough to handle MPP workload, you need to put a tremendous amount of time and money into its development, deployment and support. This is what all the cloud providers did to introduce their cloud storage offering, and this is what no enterprise can afford to repeat in their on-premise deployments. So, Snowflake has decoupled storage from compute, introducing a concept of stateless worker pools (Virtual Warehouses) that talk to the cloud storage to read and write the data.
  • Stateless compute nodes. Query execution is handled by the processes running in Virtual Warehouse. Virtual Warehouse is essentially a set of stateless worker nodes that you can provision and scale on demand. You also have an opportunity to run multiple Virtual Warehouses over the same data stored in your cloud storage, which is a big advantage. Saying they are stateless is not completely true, as each of the nodes in Virtual Warehouse utilizes local disks for caching the data read from cloud storage. Block affinity to the Virtual Warehouse nodes is used to take advantage of the caching, however “stealing” is possible, so the affinity is not strict.
Diagram 2. Snowflake micro-partitions, illustration from the official documentation.
  • Micro-partitions. The data is stored in the cloud storage by reasonably sized blocks: 16MB in size based on SIGMOID paper, 50MB to 500MB of uncompressed data based on official documentation. This block is called micro-partition. Each block represents a set of rows from the table, and is stored in compressed columnar format. Each micro-partition has a set of statistics for each of the fields: min-max values, number of distinct values, bloom filter of the fields that exist in unstructured data. For unstructured data, some of the fields are pulled out of the unstructured blob (based on heuristics) and have the same field-level statistics collected on them. Also, this kind of fields are subject to heuristic-based type deduction, which allows extra statistics to be collected (for example, converting JSON field to integer and collecting min-max statistics for this integer). Statistics are collected for each block independently on block creation, which is handled by the Snowflake engine transparently to the users.
  • Centralised metadata management. SnowflakeDB has a component called Cloud Services. Effectively, Snowflake stores all the metadata for its customers in this layer in a secret sauce key-value store. Metadata is stored in a centralised manner (except for Snowflake VPS customers), which means block-level statistics and other metadata are stored in a single key-value store for a large set of customers. Storing metadata this way allows Snowflake to do interesting tricks like cloning the tables or the whole databases just by cloning their metadata. Metadata keeps referencing the same micro-partitions, and after the metadata fork the two objects can be changed independently of each other.
Diagram 3. Snowflake online upgrades based on their SIGMOID paper.
  • Online upgrades. Software upgrades are transparent to you and are handled centrally by the Snowflake team. This is achieved by having stateless compute and centralised cloud storage. Snowflake handles this by doing a dual run of the old and new software versions to allow your client application transition from old to new deployment transparently.
  • MVCC on a block level. Data blocks in Snowflake are immutable, i.e. micro-partitions are immutable. Each update of the data stored in micro-partition effectively causes a new micro-partition with updated data to be created, and metadata about this event to be written to the centralised metadata storage. This way, MVCC is implemented on a block level. Imagine you read table T at the timestamp X. Query engine requests the metadata store to return all the blocks that existed in table T at the timestamp X, and then read these blocks. If at the later moment Y the update of table data has happened, metadata is updated to reflect this change, so that the query at timestamp Z will read different set of blocks than the query at timestamp X. Old blocks can be stored for 90+ days, if you have enough money to afford it as you will be billed for the cloud storage use of the old blocks. Expired blocks are automatically cleaned up by the cloud services auxiliary processes, so you don’t have to care about it.
  • No partitions. No indexes. No tunables. This is one of the main mottos of Snowflake – it is a solution that is ready to use out of the box and you don’t need to spend time on optimising and tuning it (according to the their marketing).

Of course, this solution has numerous advantages. I won’t describe them here, you can just read the official marketing materials. No one will speak about their advantages better than their own marketers. I will focus more on the shortcomings of the Snowflake, starting with the least technical ones:

  • You’re not owning deployment infrastructure. Which means you fully depend on Snowflake support to react fast in case of emergency. You can’t perform any data recovery or mitigation procedures on your side unless you have a separate on-demand deployment of a different DWH solution, which would obviously be too expensive to maintain.
  • You’re not owning your data. Metadata (including encryption keys) is decoupled from the data and is stored by Snowflake separately. It means all the S3 or Azure Storage blobs you have are completely meaningless if the Cloud Services layer is affected by the outage.
  • Shared Cloud Services. Cloud Services layer is shared across multiple customers (except VPS ones), which means any security incident related to Cloud Services might result in exposure of the customer data for multiple customers at the same time. Another problem with Shared Cloud Services is that the outage of the Cloud Services has a huge blast radius. It will affect multiple customers, and getting support in case of such an outage will be very problematic.
  • Rapid customer base growth. Surprisingly, this is a problem, and I’ll explain why. Fast growth of the customer base means rapidly increasing load on support and engineering departments of the company. For a complex enough software solution like Snowflake, it usually takes at least 6 months for a newly hired engineer to reach productivity, and in the first 6 months this new hire will consume loads of time of their experienced peers. So there are natural limits on how fast the engineering and support teams can grow. Decreasing quality of the support and the introduction of the first line support has been a direct consequence of this, and reported as a drawback by multiple customers.
  • Proprietary code. It means there is a clear cap on the level of expertise for all the engineers outside of the Snowflake company. This angle can also be used by Snowflake for selling professional services engagements, but it is clearly a pain for the customers, especially for the ones used to open source solutions like Apache Hadoop and Apache Spark.
  • Small ecosystem. Snowflake is still a young technology compared to its competitors, and so a lot of functionality is still missing: no geospatial, limited UI functionality, immature ETL tools integration.
  • Only cloud. AWS and Azure are the only deployment options. I have nothing to add to this.

And some more technical issues directly related to their design:

  • No partitions. Yes, you can’t partition your fact table, it will be a single table with all the N years of events. To workaround this, you can sort the table data by a field like transaction date. This will allow Snowflake to use min-max statistics of micro-partitions to prune the ones that do not contain the relevant dates for the queries that filter on date. This is called clustered tables. Snowflake will gladly maintain the data clustered for you transparently, but of course for a fee of compute and storage resources required to achieve this.
  • No indexes. So you don’t have an index, but still want to have an efficient query that needs to select one customer from a data mart? And you have already used clustering on the date? Well, here is materialized view for you: just define a copy of the base table clustered by a different key, and here you get your fast customer-retrieving query. Of course, this comes at a cost of compute and storage resources required to maintain a full copy of the data and shadow each DML operation for the base table.
  • Only bulk data load. You remember that modifying even a single field in a single tow causes the whole block of data to be copied? This effectively means the only optimal operational mode for Snowflake is bulk inserts, and the greater the size of your batch, the more optimal will be the data processing and the final data layout. As an example, having an operational data store with Customer table and doing CDC that delivers updates of the 0.1% Customer table entries each 5 minutes to your DWH is not possible. You would have to do full table scan for each update, and on the update the Snowflake engine will create a copy of each micro-partition if you’re lucky enough (if the IDs of the customers with changed records are distributed evenly and happen to hit each micro-partition).
  • Limited UDFs. Due to being a SaaS solution, it has a big point on security. This means only interpreted languages are possible in UDFs, thus are the currently supported SQL and JS. You can say goodbye to the Python scripts as Python does not have a native sandbox solution. Also, Snowflake is not mature enough to introduce something like PL/SQL or at least PL/pgSQL.
  • No data constraints. No constraints are enforced except NOT NULL. No “CHECK” constraints for you to ensure data quality at the load time.
  • Limited workload management. If you want to have higher priority for some of the traffic – isolate it in separate virtual warehouse. It works, but not for all the use cases, and greater separation incurs greater costs. Fun anecdote: Snowflake allows the queries to spill to S3 and effectively never die due to OOM. So your data analysis with their queries containing cartesian products will be happy to utilize terabytes of the cloud storage resources. (Update: automatic scaling is available for the Snowflake Enterprise Edition and above, which helps to reduce the cost of running multiple virtual warehouses adjusting their size automatically based on the observed load)

As a summary, I’d say Snowflake is a really promising technology, and it is really a cloud-native DWH solution. I really like all the engineering work the company is doing, and looking forward to seeing more news on their successful customer base expansion. However, I remain skeptical when any technology pretends to be a silver bullet solution.

References:

Disclaimer: everything in this article represents my personal and humble opinion, and is not affiliated with any of my employers.

34 thoughts on “Snowflake: The Good, The Bad and The Ugly

  1. Uli Bethke

    Couple of points on the bulk upload.

    – No full table scan is needed. Micro-partition pruning can be used with MIN/MAX stats as you describe elsewhere
    – You can further minimise the “seek time” and micro-partition overlap for updates by using cluster keys
    – A full re-write of a micro-partition takes between 1-2 s
    – Using small and frequent micro batches makes the size of micro-partition much smaller and effectively reduces microparttiion re-write by a factor of 10. The trade-off obviously is smaller micor-partitions and less efficient compression. A rewrite of a micropartition then takes between 100-200 ms

    Reply
    1. 0x0FFF Post author

      > No full table scan is needed. Micro-partition pruning can be used with MIN/MAX stats as you describe elsewhere
      It depends on the table design. In my specific example of Customer table with mutable rows you will end up with full scan if the amount of changed records is significant enough. With 0.1% of the table records changed in one batch, you will end up with full scan if changed records are distributed uniformly across the customers and each micro-partition contain information on more than 1000 customers. However, I agree that you can use different design patterns to workaround this problem, for example by using SCD2 with only start timestamp. This will speed up the load, but still if the table is clustered by customer ID it will later cause the table to be re-clustered thus again bloating the storage.

      > You can further minimize the “seek time” and micro-partition overlap for updates by using cluster keys
      Agree. But while speeding up the operations, clustering will eventually cause table storage bloat even if the amount of changes is moderately low. Greenplum has record-level MVCC, and yet I have seen multiple customers bloating their table up to 10x the original size. Imagine what will happen if a change of a single record will cause creation of the new micro-partition, i.e. new version creation for thousands of records. This will amplify 10x size growth to 1000x and more.

      > Using small and frequent micro batches makes the size of micro-partition much smaller and effectively reduces microparttiion re-write by a factor of 10. The trade-off obviously is smaller micor-partitions and less efficient compression. A rewrite of a micropartition then takes between 100-200 ms
      I’d say it is not about micro batches, but more about the proper data model design to avoid mutable data whatsoever.

      Reply
  2. Uli Bethke

    As you rightly say workload management is done via virtual warehouses. I see various advantages over traditional logical workload managers. In particular I can dynamically autoscale each virtual warehouse at runtime rather than statically pre-allocate resources at design time. I would also argue that less separation incurs greater costs (rather than the other way around) as the pre-allocated resources may sit idle and can’t be dynamically re-allocated.

    Apart from this a very good write up.

    Reply
    1. 0x0FFF Post author

      > As you rightly say workload management is done via virtual warehouses. I see various advantages over traditional logical workload managers. In particular I can dynamically autoscale each virtual warehouse at runtime rather than statically pre-allocate resources at design time.
      I agree that it makes sense. My point is that essentially nothing stops Snowflake from having both. However, having it the way it is now is advantageous to Snowflake: first, you don’t need to spend development cycles to implement complex workload management, and second – current implementation encourages users in the need of better workload isolation to spend more on Snowflake (update to enterprise edition, spin up more VW)

      > I would also argue that less separation incurs greater costs (rather than the other way around) as the pre-allocated resources may sit idle and can’t be dynamically re-allocated.
      That’s not what I was talking about. If you have X resources in a cluster and a heterogeneous workload, splitting this resource pool into two of X/2 size and spreading workload between them will likely not work. Having heterogeneous workload on a single resource pool allows you to amortize the cost of resource spike by one workload type by allowing it to use the stale resources not currently utilized by another workload. When you decide to separate workloads to different resource pools, you need to provision each separate resource pool to handle peak load of that specific workload. In practice, splitting X into two will likely lead to two clusters of X*3/4 size.

      Reply
      1. Paul Horan

        >>When you decide to separate workloads to different resource pools, you need to provision each separate resource pool to handle peak load of that specific workload. In practice, splitting X into two will likely lead to two clusters of X*3/4 size.<<

        But remember that Snowflake virtual warehouses have auto-suspend and auto-resume, and horizontal scaling is automatic! So while the "definition" of those new resource pools might total to more than the original X value, the actual consumption totals could actually be LESS than running X for the entire period – because you can design the new clusters to better fit the workload profiles of the users that will be in them. Put dashboard users in SMALL or MEDIUMS with MCW turned on. Put data scientists in L or XL clusters with auto-suspend set to 5 minutes. Run large ETL jobs in 3 or 4XLs to maximize threading.

        Single-cluster technologies have to be scaled for the largest consumer (resulting in wasted cycles during idle periods), or for an average consumption level (resulting in bad SLAs and frustrated users – and also wasted cycles during idle periods).
        Snowflake doesn't have that problem.

        Reply
        1. 0x0FFF Post author

          You’re right, I missed the announcement of the automatic scaling feature and didn’t know it is now possible. I’ll add a note to the text.

          Reply
  3. Martin Alvarez

    Well, RESUME and SUSPEND do not cover typical EDW workloads. In many cases these Data Warehouses need to run 24X7. It may work for some workloads running in their own Virtual Warehouses (Clusters), but not for all.

    Also, before you abuse “happily” of the SUSPEND/RESUME functionality, keep in mind that every time you “SUSPEND” the VWH you are in fact releasing the nodes back to the node pool. This will make you loose the “LOCAL CACHE” of the VWH. When you “RESUME” the VWH you are actually “REBUILDING” it with fresh nodes so they have an empty local cache. How will this affect you? QUERY SLA PREDICTABILITY. The queries that run through the empty nodes will have to gather the data from the micro partitions, and this will eventually “warm” the local cache, but as a result, the first queries take 4 to 10 times longer than they should once the local cache is built.

    Reply
    1. Uli Bethke

      Yes, you release the nodes back to the pool. However, if these nodes are not assigned to another customer in the meantime the cache will still exist when resumed.

      Reply
      1. Martin Alvarez

        With the “per second” billing high demand it is quite unlikely that you will recover the same nodes. You will have to resume the Data Warehouse in less than 5 minutes o r less. Then what is the point of suspending it?

        Reply
        1. Stephen Pace

          @Martin: Two points. 1) Imagine hundreds or thousands of machines in the free pool and that Snowflake can determine which of the machines to deploy back to customers. It wouldn’t be hard to deploy “older” machines first out of that pool. 2) Imagine an ELT process that runs for 1 minute every 5 minutes. Shutting down a warehouse immediately and billing by the second is the difference between billing you 100% of the cluster vs 20% of the cluster. Those seconds add up.

          Reply
  4. DforData

    Thank you for this thoughtful piece. I guess there needs to be more such constructive criticism to help the system improve. As you’ve rightly pointed out, all the noise about Snowflake at this point is mostly one-sided.

    On a related note, would it be possible for you to see if you could think of other systems that have handled the issues listed here differently, for better or worse? Something like this – “Greenplum has record-level MVCC, and yet I have seen multiple customers bloating their table up to 10x the original size” – for the other issues as well. What do you think?

    Also, maybe because Snowflake is new, we can only find show-and-tells about migration from other DBs (like Greenplum and PostGres) to Snowflake, but not the other way around.

    Reply
  5. Ken Raetz

    Alexey, thank you for this honest and thorough write-up of Snowflake. Anytime a vendor says that “there is not bad or ugly”, I’m immediately suspicious. I’ve heard this very line with Snowflake. Yet I am intrigued by all of the press it is getting lately. So, we’re starting to look at it now too. Your write-up was definitely helpful to better understand what we might be getting into. Thanks again!

    Reply
  6. ihafidh

    Can you comment on using operational type of data (e.g. searches and look-ups for specific values) on Snowflake? I know that MPP systems are not suited for operational data. However, with all the hype surrounding Snowflake there are some trying to put operational data on MPP systems because they heard that Snowflake is fast.

    Reply
    1. 0x0FFF Post author

      Regardless of the design, MPP systems are bad at this type of workload. There were multiple attempts to make a hybrid system, but any hybrid is a tradeoff – to make the system a bit better for operational load you make it significantly worse for the analytical workload. The systems that serve operational data use different principles in the design. One example is an index. You need an index for fast lookup of the row. However, adding the index to your table introduces a burden of maintaining it while inserting the data, which is especially problematic for analytical systems that ingest data in large batches.

      In the end, hybrid systems exist and they are good at what they do, but they would lose to specialized systems in both areas. E.g. separate system for operational load and separate system for analytical load will always beat in performance a hybrid system. However, hybrid system might result in better TCO, but this should be assessed on the case by case basis.

      Reply
  7. Nish

    Great article as always. One thing I am intrigued is Snowflake avoiding data rebalancing when scaling up. In typical hadoop world, if you are adding a new node, the data will be re-organized, here when the auto scaling adds a new node or a new cluster how does it the data shuffling happen.

    Somewhere in the documentation I read Snowflake does an inmemory copy of the data to avoid rebalancing,.. can anyone explain it ?

    Reply
    1. Stephen Pace

      Nish, I think you’ve missed that compute and storage are completely separate in Snowflake. When you add more nodes to a warehouse cluster, there is no data there to redistribute. Think of it like this: when you start a warehouse cluster, it is “empty”. When you do a query, micropartitions get cached to the nodes in the cluster from the Cloud object store. The query is then returned to the user. If you flex up from a Medium (4 node) warehouse to a Large (8 node) warehouse, those additional 4 nodes are immediately available to do work for the next query that comes along (again, initially empty, but their cache will start to hydrate as queries come along.

      Reply
      1. Nish

        Ok. Got it. Thanks a lot Steve and the on account of the network speed now very close to diskio it becomes more efficient than the data reshuffling on an app system.cool.

        Reply
    2. 0x0FFF Post author

      Snowflake uses S3 as the persistent data store. It is up to Amazon to maintain S3 and rebalance the data when they add new machines to their S3 cluster. It all happens transparently to you and is handled by the cloud provider. It is also responsible for load-based data rebalancing.
      As for Snowflake, it just caches the data for faster access, but it is not to avoid rebalancing, it is for making your data access faster.

      Reply
  8. Kolos Kantor

    Thank you for the insightful and honest review Alexey!
    One question though, I do not understand your statement regarding Python. You said: “You can say goodbye to the Python scripts as Python does not have a native sandbox solution.”
    Either I misunderstand what you mean by that, or this article is slightly out of date. Have a look here:
    https://docs.snowflake.com/en/user-guide/python-connector-example.html
    You can use python with Snowflake, quite happily as far as I can tell. We managed to load a whole bunch of data with scripts, handling login credentials and the load itself including staging and content copy/insertion.
    Could you please clarify?

    Reply
    1. 0x0FFF Post author

      What I meant here is user-defined functions written in Python, e.g. functions that are executing within the database on your raw data. You can’t have them in Python as Python does not have a secure sandbox, which means you can escape the sandbox and access the data you should not access or get the Snowflake binaries you should not have.

      Reply
  9. Pingback: How Snowflake plans to change a flawed data warehouse model - SiliconANGLE

  10. Pingback: How Snowflake plans to change a flawed data warehouse model – My e-Records

  11. Pingback: How Snowflake Plans To Change A Flawed Data Warehouse Model ⋆ News: Art, Travel, Design, Technology

  12. William Summer

    There are a lot of pure techy chats about snowflake. I would not to argue with anyone about the design, snowflake architecture, or hardware.
    But for me as a developer, snowflake is extreme vanilla version of db software. Yes, it stores data, yes “select *” works, but nothing beyond it.
    If you want to develop anything, it is pain in the b4tt: limited functions, limited use of variables, no dynamic sql, limitations everywhere. Its metadata, information about objects in database – is very slow. Queries with wildcard searches are very slow. If you have to write anything sophisticated (matching, de-duplicating, etc.), it will take 10 times longer on snowflake than on for example SQL Server, and even then it would be questionable if you could write it without involving part of your script running in java.
    I am pretty sure that many engineers would not agree with me. “This bridge is perfect” maybe…, but why do I have to drive a pedal car backwards to cross it?

    Reply
  13. Kenneth C Wilson

    Not a fan, they started using it at my workplace and it has a pay structure designed for when you query data. This seems to have limited our companies IT setup and everything is being handled via duo authentication via web-based response.

    This has led to some extreme difficulties with getting data ETL loads for end users automated at all. It also has no integration capability with SSIS. Literally the only platform I have ever worked with that I could not do that. Stuck with an ODBC connector which must use that web-based authentication which means this can’t be run from a SSIS server using a Service Account at all.

    The have some PowerBI configurations setup, but those don’t work either as far as regularly updating your report.

    Another issue I have seen is data typing, which has led to varchar fields with exceptionally large values for them. So big, it has required using CAST set a proper length for the data fields you are pulling on a query. Not sure if this normal for Snowflake, but it definitely is a problem for ETL work.

    I don’t see any advantage to using Snowflake over other platforms like Azure, AWS or Google BigQuery.

    Reply

Leave a Reply