Snowflake: The Good, The Bad and The Ugly

Snowflake or SnowflakeDB is a cloud SaaS database for analytical workloads and batch data ingestion, typically used for building a data warehouse in the cloud. However, it appears to be so cool and shiny that people are getting mad at praising it all around the internet. Seeing that, I could not resist the urge to take a closer look at this technology and poke into some of its pain points. What have also stumbled me at first is the lack of SnowflakeDB criticism in the blogs and message boards, which sounds suspicious given the self-proclaimed customer base of more than 1000 enterprises. So, let’s take a closer look at it.

Diagram 1. Snowflake Architecture based on their SIGMOID paper.

I will not write a full review of the Snowflake design here as it will take a lot of pages, and Snowflake team has already done it pretty well. Instead, I will focus on the main principles of their design and the differentiation from the other DWH solutions available on the market:

  • Amazon S3 and Azure Storage are used as a storage backend. This approach allows SnowflakeDB to piggyback on cloud providers both in terms of capacity provisioning and IO throughput guarantees for their storage. This is pretty convenient: cloud providers tend to invest a lot of money in making their storage bullet-proof as this is one of the main value propositions of the cloud platform, and it will be stupid not to use it for your own advantage developing a new cloud-native solution.
  • Storage is decoupled from compute. Traditional MPP solutions use local HDDs of the cluster nodes to store the data, effectively creating a non-divisible scaling unit of a single cluster machine. Also, due to MPP design, all the machines in a cluster must be of the same hardware specs to avoid resource skew between cluster nodes, thus the coupling of compute resources with storage is pretty tight. I have seen a bunch of examples when the vendor put tremendous effort to decouple storage from compute: take Exadata, or multiple storage vendors trying to put their enterprise storage appliances under the more traditional MPP solutions like Vertica and Greenplum. It has never really achieved any success for a simple reason – to build a distributed storage elastic enough to handle MPP workload, you need to put a tremendous amount of time and money into its development, deployment and support. This is what all the cloud providers did to introduce their cloud storage offering, and this is what no enterprise can afford to repeat in their on-premise deployments. So, Snowflake has decoupled storage from compute, introducing a concept of stateless worker pools (Virtual Warehouses) that talk to the cloud storage to read and write the data.
  • Stateless compute nodes. Query execution is handled by the processes running in Virtual Warehouse. Virtual Warehouse is essentially a set of stateless worker nodes that you can provision and scale on demand. You also have an opportunity to run multiple Virtual Warehouses over the same data stored in your cloud storage, which is a big advantage. Saying they are stateless is not completely true, as each of the nodes in Virtual Warehouse utilizes local disks for caching the data read from cloud storage. Block affinity to the Virtual Warehouse nodes is used to take advantage of the caching, however “stealing” is possible, so the affinity is not strict.
Diagram 2. Snowflake micro-partitions, illustration from the official documentation.
  • Micro-partitions. The data is stored in the cloud storage by reasonably sized blocks: 16MB in size based on SIGMOID paper, 50MB to 500MB of uncompressed data based on official documentation. This block is called micro-partition. Each block represents a set of rows from the table, and is stored in compressed columnar format. Each micro-partition has a set of statistics for each of the fields: min-max values, number of distinct values, bloom filter of the fields that exist in unstructured data. For unstructured data, some of the fields are pulled out of the unstructured blob (based on heuristics) and have the same field-level statistics collected on them. Also, this kind of fields are subject to heuristic-based type deduction, which allows extra statistics to be collected (for example, converting JSON field to integer and collecting min-max statistics for this integer). Statistics are collected for each block independently on block creation, which is handled by the Snowflake engine transparently to the users.
  • Centralised metadata management. SnowflakeDB has a component called Cloud Services. Effectively, Snowflake stores all the metadata for its customers in this layer in a secret sauce key-value store. Metadata is stored in a centralised manner (except for Snowflake VPS customers), which means block-level statistics and other metadata are stored in a single key-value store for a large set of customers. Storing metadata this way allows Snowflake to do interesting tricks like cloning the tables or the whole databases just by cloning their metadata. Metadata keeps referencing the same micro-partitions, and after the metadata fork the two objects can be changed independently of each other.
Diagram 3. Snowflake online upgrades based on their SIGMOID paper.
  • Online upgrades. Software upgrades are transparent to you and are handled centrally by the Snowflake team. This is achieved by having stateless compute and centralised cloud storage. Snowflake handles this by doing a dual run of the old and new software versions to allow your client application transition from old to new deployment transparently.
  • MVCC on a block level. Data blocks in Snowflake are immutable, i.e. micro-partitions are immutable. Each update of the data stored in micro-partition effectively causes a new micro-partition with updated data to be created, and metadata about this event to be written to the centralised metadata storage. This way, MVCC is implemented on a block level. Imagine you read table T at the timestamp X. Query engine requests the metadata store to return all the blocks that existed in table T at the timestamp X, and then read these blocks. If at the later moment Y the update of table data has happened, metadata is updated to reflect this change, so that the query at timestamp Z will read different set of blocks than the query at timestamp X. Old blocks can be stored for 90+ days, if you have enough money to afford it as you will be billed for the cloud storage use of the old blocks. Expired blocks are automatically cleaned up by the cloud services auxiliary processes, so you don’t have to care about it.
  • No partitions. No indexes. No tunables. This is one of the main mottos of Snowflake – it is a solution that is ready to use out of the box and you don’t need to spend time on optimising and tuning it (according to the their marketing).

Of course, this solution has numerous advantages. I won’t describe them here, you can just read the official marketing materials. No one will speak about their advantages better than their own marketers. I will focus more on the shortcomings of the Snowflake, starting with the least technical ones:

  • You’re not owning deployment infrastructure. Which means you fully depend on Snowflake support to react fast in case of emergency. You can’t perform any data recovery or mitigation procedures on your side unless you have a separate on-demand deployment of a different DWH solution, which would obviously be too expensive to maintain.
  • You’re not owning your data. Metadata (including encryption keys) is decoupled from the data and is stored by Snowflake separately. It means all the S3 or Azure Storage blobs you have are completely meaningless if the Cloud Services layer is affected by the outage.
  • Shared Cloud Services. Cloud Services layer is shared across multiple customers (except VPS ones), which means any security incident related to Cloud Services might result in exposure of the customer data for multiple customers at the same time. Another problem with Shared Cloud Services is that the outage of the Cloud Services has a huge blast radius. It will affect multiple customers, and getting support in case of such an outage will be very problematic.
  • Rapid customer base growth. Surprisingly, this is a problem, and I’ll explain why. Fast growth of the customer base means rapidly increasing load on support and engineering departments of the company. For a complex enough software solution like Snowflake, it usually takes at least 6 months for a newly hired engineer to reach productivity, and in the first 6 months this new hire will consume loads of time of their experienced peers. So there are natural limits on how fast the engineering and support teams can grow. Decreasing quality of the support and the introduction of the first line support has been a direct consequence of this, and reported as a drawback by multiple customers.
  • Proprietary code. It means there is a clear cap on the level of expertise for all the engineers outside of the Snowflake company. This angle can also be used by Snowflake for selling professional services engagements, but it is clearly a pain for the customers, especially for the ones used to open source solutions like Apache Hadoop and Apache Spark.
  • Small ecosystem. Snowflake is still a young technology compared to its competitors, and so a lot of functionality is still missing: no geospatial, limited UI functionality, immature ETL tools integration.
  • Only cloud. AWS and Azure are the only deployment options. I have nothing to add to this.

And some more technical issues directly related to their design:

  • No partitions. Yes, you can’t partition your fact table, it will be a single table with all the N years of events. To workaround this, you can sort the table data by a field like transaction date. This will allow Snowflake to use min-max statistics of micro-partitions to prune the ones that do not contain the relevant dates for the queries that filter on date. This is called clustered tables. Snowflake will gladly maintain the data clustered for you transparently, but of course for a fee of compute and storage resources required to achieve this.
  • No indexes. So you don’t have an index, but still want to have an efficient query that needs to select one customer from a data mart? And you have already used clustering on the date? Well, here is materialized view for you: just define a copy of the base table clustered by a different key, and here you get your fast customer-retrieving query. Of course, this comes at a cost of compute and storage resources required to maintain a full copy of the data and shadow each DML operation for the base table.
  • Only bulk data load. You remember that modifying even a single field in a single tow causes the whole block of data to be copied? This effectively means the only optimal operational mode for Snowflake is bulk inserts, and the greater the size of your batch, the more optimal will be the data processing and the final data layout. As an example, having an operational data store with Customer table and doing CDC that delivers updates of the 0.1% Customer table entries each 5 minutes to your DWH is not possible. You would have to do full table scan for each update, and on the update the Snowflake engine will create a copy of each micro-partition if you’re lucky enough (if the IDs of the customers with changed records are distributed evenly and happen to hit each micro-partition).
  • Limited UDFs. Due to being a SaaS solution, it has a big point on security. This means only interpreted languages are possible in UDFs, thus are the currently supported SQL and JS. You can say goodbye to the Python scripts as Python does not have a native sandbox solution. Also, Snowflake is not mature enough to introduce something like PL/SQL or at least PL/pgSQL.
  • No data constraints. No constraints are enforced except NOT NULL. No “CHECK” constraints for you to ensure data quality at the load time.
  • Limited workload management. If you want to have higher priority for some of the traffic – isolate it in separate virtual warehouse. It works, but not for all the use cases, and greater separation incurs greater costs. Fun anecdote: Snowflake allows the queries to spill to S3 and effectively never die due to OOM. So your data analysis with their queries containing cartesian products will be happy to utilize terabytes of the cloud storage resources. (Update: automatic scaling is available for the Snowflake Enterprise Edition and above, which helps to reduce the cost of running multiple virtual warehouses adjusting their size automatically based on the observed load)

As a summary, I’d say Snowflake is a really promising technology, and it is really a cloud-native DWH solution. I really like all the engineering work the company is doing, and looking forward to seeing more news on their successful customer base expansion. However, I remain skeptical when any technology pretends to be a silver bullet solution.

References:

Disclaimer: everything in this article represents my personal and humble opinion, and is not affiliated with any of my employers.

8 thoughts on “Snowflake: The Good, The Bad and The Ugly

  1. Uli Bethke

    Couple of points on the bulk upload.

    – No full table scan is needed. Micro-partition pruning can be used with MIN/MAX stats as you describe elsewhere
    – You can further minimise the “seek time” and micro-partition overlap for updates by using cluster keys
    – A full re-write of a micro-partition takes between 1-2 s
    – Using small and frequent micro batches makes the size of micro-partition much smaller and effectively reduces microparttiion re-write by a factor of 10. The trade-off obviously is smaller micor-partitions and less efficient compression. A rewrite of a micropartition then takes between 100-200 ms

    Reply
    1. 0x0FFF Post author

      > No full table scan is needed. Micro-partition pruning can be used with MIN/MAX stats as you describe elsewhere
      It depends on the table design. In my specific example of Customer table with mutable rows you will end up with full scan if the amount of changed records is significant enough. With 0.1% of the table records changed in one batch, you will end up with full scan if changed records are distributed uniformly across the customers and each micro-partition contain information on more than 1000 customers. However, I agree that you can use different design patterns to workaround this problem, for example by using SCD2 with only start timestamp. This will speed up the load, but still if the table is clustered by customer ID it will later cause the table to be re-clustered thus again bloating the storage.

      > You can further minimize the “seek time” and micro-partition overlap for updates by using cluster keys
      Agree. But while speeding up the operations, clustering will eventually cause table storage bloat even if the amount of changes is moderately low. Greenplum has record-level MVCC, and yet I have seen multiple customers bloating their table up to 10x the original size. Imagine what will happen if a change of a single record will cause creation of the new micro-partition, i.e. new version creation for thousands of records. This will amplify 10x size growth to 1000x and more.

      > Using small and frequent micro batches makes the size of micro-partition much smaller and effectively reduces microparttiion re-write by a factor of 10. The trade-off obviously is smaller micor-partitions and less efficient compression. A rewrite of a micropartition then takes between 100-200 ms
      I’d say it is not about micro batches, but more about the proper data model design to avoid mutable data whatsoever.

      Reply
  2. Uli Bethke

    As you rightly say workload management is done via virtual warehouses. I see various advantages over traditional logical workload managers. In particular I can dynamically autoscale each virtual warehouse at runtime rather than statically pre-allocate resources at design time. I would also argue that less separation incurs greater costs (rather than the other way around) as the pre-allocated resources may sit idle and can’t be dynamically re-allocated.

    Apart from this a very good write up.

    Reply
    1. 0x0FFF Post author

      > As you rightly say workload management is done via virtual warehouses. I see various advantages over traditional logical workload managers. In particular I can dynamically autoscale each virtual warehouse at runtime rather than statically pre-allocate resources at design time.
      I agree that it makes sense. My point is that essentially nothing stops Snowflake from having both. However, having it the way it is now is advantageous to Snowflake: first, you don’t need to spend development cycles to implement complex workload management, and second – current implementation encourages users in the need of better workload isolation to spend more on Snowflake (update to enterprise edition, spin up more VW)

      > I would also argue that less separation incurs greater costs (rather than the other way around) as the pre-allocated resources may sit idle and can’t be dynamically re-allocated.
      That’s not what I was talking about. If you have X resources in a cluster and a heterogeneous workload, splitting this resource pool into two of X/2 size and spreading workload between them will likely not work. Having heterogeneous workload on a single resource pool allows you to amortize the cost of resource spike by one workload type by allowing it to use the stale resources not currently utilized by another workload. When you decide to separate workloads to different resource pools, you need to provision each separate resource pool to handle peak load of that specific workload. In practice, splitting X into two will likely lead to two clusters of X*3/4 size.

      Reply
      1. Paul Horan

        >>When you decide to separate workloads to different resource pools, you need to provision each separate resource pool to handle peak load of that specific workload. In practice, splitting X into two will likely lead to two clusters of X*3/4 size.<<

        But remember that Snowflake virtual warehouses have auto-suspend and auto-resume, and horizontal scaling is automatic! So while the "definition" of those new resource pools might total to more than the original X value, the actual consumption totals could actually be LESS than running X for the entire period – because you can design the new clusters to better fit the workload profiles of the users that will be in them. Put dashboard users in SMALL or MEDIUMS with MCW turned on. Put data scientists in L or XL clusters with auto-suspend set to 5 minutes. Run large ETL jobs in 3 or 4XLs to maximize threading.

        Single-cluster technologies have to be scaled for the largest consumer (resulting in wasted cycles during idle periods), or for an average consumption level (resulting in bad SLAs and frustrated users – and also wasted cycles during idle periods).
        Snowflake doesn't have that problem.

        Reply
        1. 0x0FFF Post author

          You’re right, I missed the announcement of the automatic scaling feature and didn’t know it is now possible. I’ll add a note to the text.

          Reply

Leave a Reply