Data driven Snowflake optimisation at HelloFresh

"We saved 30% in our Snowflake warehouse costs by analysing the impact of different warehouse configurations, building a custom…

Faysal Rehmat

HelloTech

· ~14 min read · August 22, 2023 (Updated: August 26, 2023) · Free: Yes

"We saved 30% in our Snowflake warehouse costs by analysing the impact of different warehouse configurations, building a custom multi-threaded python application which submitted a simulated workload to Snowflake."

Introduction

During the pandemic, HelloFresh experienced incredible growth that most businesses can only dream of. It went from a scale up with a startup mentality, to a household name seemingly within a couple of months.

However, despite our best efforts one thing was evident within HelloFresh's community of data professionals, our legacy Data Warehousing solution based on Cloudera CDH was struggling to keep up.

In 2022 Snowflake was introduced to create a scalable, high quality, descriptive analytics data warehouse solution, within the overall stack at HelloFresh.

Hellofresh uses the data mesh architecture to allow domain teams & local markets to innovate at the speed of demand, while enabling best practices & tooling standardisation through central functions. These central function teams are known as Global teams, whereas the local engineering teams sit fully integrated into a region's business unit.

Challenge

Cloud providers charge you for what you use, and this is also the approach taken by Snowflake. However, what happens if the engineering team responsible for spending money isn't the same one that is responsible for writing the cheque?

A challenge we have at HelloFresh. Our local engineering teams are focused almost entirely on extracting value from the data and growing the business, which leaves significant room on the table for improving efficiency of what we already have.

This is exactly the problem I was given in my second week of joining the Global Data Solutions team. The brief was simple, "There is a medium sized snowflake warehouse which is costing us significantly more than any other warehouse. We think it's something to do with Grafana. Please investigate and report back."

Figure 1: The mean daily Snowflake credits used by the top eight most expensive Snowflake virtual warehouses at HelloFresh before October 2022

My first thought was: What's a Snowflake warehouse? At that stage I had never worked with Snowflake, and only heard about it casually mentioned on long term roadmaps along with several other technologies which make up the sea of tech you are supposed to be clued up on as a data engineer.I have a hard enough time not forgetting SQL. Therefore queue me nervously googling Snowflake documentation while in a meeting with my manager on my second monitor.

"Ah yes, this is like virtual compute on the platform right? "

I said hoping the second monitor's reflection on my glasses didn't betray me.

"Yes exactly"

Phew! But I really needed to do my research. Therefore in the coming weeks and months I had several discussions with our Senior Staff Data Engineer Hariprasad Natarajan, our point of contact at Snowflake and Solutions Architect Frans van Dortmont, and weekly discussions with the Senior Director of Data Solutions and my manager at the time Kenneth Holzer. I was doing my best to understand how Snowflake worked and more importantly how best to make Snowflake work for us.

Context

How Snowflake virtual warehouse costs work

Snowflake's virtual warehouse is a cluster of compute resources where we can execute SQL, Python, Scala, Java, and Javascript code. For our purpose we will be focusing on SQL. You can submit those SQL queries via a worksheet on Snowflake's web-based user-interface Snowsight, or via an API request using your favourite connection method [1]

Figure 2: Image showing different taxi sizes [2] used in an analogy for explaining Snowflake's pricing model for virtual warehouses

We can think of using a Snowflake warehouse like taking a taxi. A taxi is something we can book on demand, charges us a cost multiplier based on how long we use it, and only has a limited number of seats. A bigger taxi has more space but also costs more to hire.

Snowflake works in a similar manner. When querying a warehouse we are charged credits for every second we are executing a query on our warehouse. We can only have roughly eight concurrent queries at once running on a single cluster Snowflake warehouse (assuming the results are not being read from cache).

Figure 3: Screenshot from Snowflake website showing virtual warehouse cost per second of usage [3]

Snowflake charges the customer a flat rate of Snowflake credits based on the size of the warehouse used. There are even larger warehouse sizes [3], but for our use case we won't be investigating them, although our methodology will still apply. Getting a larger warehouse is known as scaling up.

We can also have multiple clusters assigned to the same warehouse. This is known as scaling out (getting two taxis instead of one), in which case credits Snowflake charges are multiplied by the number of clusters used at any given time.

By default your warehouse won't start charging you for the second cluster until you have maximised your first one [4]. How aggressively you scale out is determined by the scaling policy set when you configured your warehouse.

What we are trying to figure out is "Which Snowflake warehouse size is best for my workload?"

Before we answer this question however, we must know the following:

What is the purpose of the workload I want to run?
How big is my workload?

Snowflake is able to provide us data on how we are using the service via its monitoring tables [5]. From here we are able to query for insights to answer the two questions above.

Analysis

Purpose of Workload

Upon further investigation, I found that the overwhelming majority of costs for our US Ops Reporting warehouse were coming from a single service account.

Figure 4: Graph showing the total execution time by different service accounts on HelloFresh's US OPS Reporting Snowflake virtual warehouse during December 2022. Please note that the hours of the day on this chart is Central European Time (CET) which is six hours ahead of EST time zone. 15:00 would be 09:00 in New York

This service account was tied to operational dashboards in Grafana which needed frequent refreshes at all hours of the day, as seen in Figure 4. The data being queried comes from a real time streaming application which itself was frequently refreshing. These dashboards are then shown on screens in our distribution centres.

Therefore the impact of any changes we make in warehouse size would affect how quickly our operational dashboards refreshes on a screen, and hence how quickly a human being could see information on said screen, process that information, and make operational decisions accordingly. This would give us a leeway when changing relative query performance, given that query execution time for the queries in question was well under a second.

Size of workload

Now that we understand the purpose of our queries we need to understand how big they are. This we can do by performing a query size mix analysis, during which we analyse how many small, medium and large queries we are submitting to our virtual warehouse.

This is important since there are hardware based guidelines given by Snowflake on the size of queries we should be submitting for different sized warehouses.

Figure 5: Table showing the recommended Snowflake virtual warehouse size for different query sizes based on bytes scanned [6]. For reference, other metrics such as query complexity and ordering of data also play a role in determining the right warehouse size, however the most important metric to first review is bytes scanned.

Based on this recommendation, we have the following query size bands

Figure 6: Table showing the classification of query sizes based on bytes scanned

We can now analyse how many queries we have running in each query size band for the Snowflake virtual warehouse we are investigating.

Figure 7: Graph showing the number of small, medium, and large queries based on Figure 6 were submitted to US Ops Reporting Snowflake virtual warehouse at HelloFresh during December 2022

As we can see from this analysis, the vast majority of queries being run on this warehouse are actually quite small (Less than 1GB). This would indicate that the most cost efficient warehouse size might be a small or even an x-small.

Testing

In order to mimic a real world scenario we must submit representative workloads concurrently to different sized virtual warehouses. Making sure to limit the number of concurrent queries being actively worked on by our warehouse to eight.

This is the number of queries our warehouse would be able to process at any given time before it starts to queue our queries. Failure to do this would cause a bottleneck in our test that we cannot control.

Representative workload

In order to get a representative workload, we took a one hour slice of queries (9–10am EST) that our Grafana service account was submitting to our Snowflake warehouse in question via the monitoring database provided by Snowflake.

We then submitted these queries to new and isolated testing warehouses of sizes x-small, small, and medium created specifically for this application. These mimicked the same cluster count and scaling policy as our original US Ops Reporting warehouse.

Managing concurrency

Queries don't run in isolation in the real world. If we tried to run our queries one by one for our test, those queries would have access to extra compute resources that they would not have in the real world. Therefore we must submit the queries in the same order that they ran originally WITH other queries that were running at the same time.

To do this we can split the one hour's worth of queries into sixty "one minute bins". A bin being a group of queries which all were submitted to our warehouse for processing between a one minute window, e.g. All queries submitted between 9:01 and 9:02 would be one bin.

We then submitted the queries in a given bin in the same order that they ran originally based on query start time, then analysed the total time to completion for those one minute bins to get a picture of how our queries perform on different warehouse sizes in Snowflake.

We must also limit the number of concurrent requests to eight in order to match our concurrency limit for Snowflake's virtual warehouses. Once this limit of eight is reached, we must wait until one of the "in progress" queries to return a result before submitting the next query.

To achieve this in our application, we used Python's ThreadPoolExecutor and a semaphore [7]. A semaphore being variable which represents our concurrency limit. Every time we pick up a task for work we decrement the semaphore. When the variable reaches zero, no more tasks can be picked up for work. When we finish a task, we increment the semaphore and the application can start a new task for work.

Figure 8: Diagram showing how a semaphore was used in our data application used in testing for the best Snowflake virtual warehouse

In our application the task in question would be query submission to our testing Snowflake warehouse from our one minute bin of real world workload.

Caveats

In the real world there will not be eight concurrent requests at all times in our warehouse. Sometimes there may be six, four or even a significant gap of a few seconds before the next query is submitted. It is impossible to recreate the natural fluctuations in workload, however during the analysis phase I found that our US Ops Reporting warehouse was running at maximum compute utilisation, most of the time.

Furthermore, we added a feature in our application where we grouped queries submitted at around the same time together in what we called "minute bins''. For example the minute bin number two would be all the queries submitted to the warehouse we were investigating between the times 9:01am and 9:02am.

In the real world, query submissions would be distributed along the hour, whereas in our test we would submit a query as soon as our warehouse was under the concurrency limit in order to simplify the software and test, this meant sometimes we would finish our test workload artificially quickly compared to the original time taken for the workload to finish, thereby deflating costs in our test. I.e. finishing a one minute bin of work before one minute was over.

To combat this, the application would sleep for the remainder of that minute before starting the next minute bin of work. Ensuring we did not go under the one minute of work (due to Snowflake warehouse shutdown policies) for any one minute bin of test workload.

Another thing to consider is that we selected queries that were submitted within a one hour window, this is not the same as queries which were started and completed within one hour. Therefore we expect in our testing for the medium sized warehouse to take slightly longer than one hour to return all our results.

One final caveat is that during the testing the underlying data comes from a real time streaming application hence it is constantly changing. This means that although we are submitting the same queries to different sized warehouses, the data being used for those queries will have slight variations.

Hence a workload that took one hour to complete on a medium warehouse may not take exactly as long if run again at a different time of day. This variation is very small however and will be ignored during analysis. If we were to do this experiment again we would create a clone of the underlying data which would negate the fluctuations in the underlying data.

Results

Standard Test

Figure 9: Graph showing the time taken to complete each minute bin of work on x-small, small, and medium Snowflake virtual warehouses.

Figure 10: Table showing the total time taken to complete the one hour workload on x-small, small, and medium Snowflake warehouse. Warehouse Performance column references the total hours column compared to the medium Snowflake warehouse. Percentage Savings column is in reference to the credits used column compared to the medium Snowflake warehouse. The savings/performance ratio column is a division between the percentage savings column and the warehouse performance column.

Based on this testing in our one hour slice of workload, we observed a 36% savings in warehouse cost by going from a medium warehouse to a small warehouse. This came at a performance (measured in time taken to complete the hour slice of work by the warehouse) cost of 9.34%.

This is fantastic news, however Snowflake had an ace up its sleeve. An experimental feature at the time called Query Acceleration Service.

Query Acceleration Service

When you submit a query to a Snowflake warehouse, there is an optimizer working behind the scenes which figures out the best way to execute your query. It then breaks down this work into steps. During this process it is able to figure out the most expensive steps in your query, and hence the specific steps which would benefit from more resources than what your current warehouse has available to it.

Query Acceleration Service enables you to offload those expensive steps onto a separate shared cluster of resources [8]. Meaning we could process the query quicker as it would not be bottlenecked by the warehouse it was submitted to.

For example, let's say you are running a query on a small virtual warehouse, but have configured Query Acceleration Service so that any steps that are too taxing on a small, are given the horsepower of a medium warehouse.

Now let's say you submit a query and the optimizer breaks this query into five steps. Your first three steps work fine on a small virtual warehouse, however your fourth and fifth steps require more horsepower. Query Acceleration service will move the compute of the fourth and fifth step onto its Query Acceleration service pool of resources that it manages, process those steps, charge you accordingly (i.e. cost of medium compute for those two steps plus a little extra), and finally give you back the result.

Figure 11: Diagram showing how Query Acceleration Service uses a pool of compute resources to accelerate compute for Grafana queries running on a small virtual warehouse [9].

This can be very beneficial since we can sometimes get outlier queries. Enabling this feature allows our warehouse to scale our compute without having to spin up a larger warehouse, or a new cluster. Overall, this has the effect of keeping costs down as most of the time we can be running a smaller warehouse.

Let's see what effect this has on our testing workload for our x-small and small virtual warehouses when we give them the horsepower of a medium through Query Acceleration Service.

Query Acceleration Service Results

Figure 12: Graph showing the time taken to complete each minute bin of work on x-small, small, and medium Snowflake virtual warehouses with Query Acceleration Service enabled.

When comparing with Figure 9, we can see the red and blue lines are now much closer to the orange line throughout the test. This means that the x-small and small warehouses are very close in performance to the medium virtual warehouse when Query Acceleration Service is enabled and set to scale to a medium warehouse level of compute when judged by Snowflake.

Figure 13: Table showing the total time taken to complete the one hour workload on x-small, small, and medium Snowflake virtual warehouse with Query Acceleration Service enabled to scale up to a medium virtual warehouse level of compute when needed. Warehouse Performance column references the total hours column compared to the medium warehouse. The Percentage Savings column is in reference to the Credits used column compared to the medium warehouse. The savings/performance ratio column is a division between the percentage savings column and the warehouse performance column.

We now also observe that the warehouse performance difference between the baseline (medium warehouse) and small is only 3.29% compared to the 9.34% that it was in our first test (See Figure 10 and 13). This also gives us better value for money switching to a small one since the savings/performance ratio is now 12.44 compared to the 3.89 that it was previously.

As a result of this testing, on January 8th 2023 we switched from a triple clustered medium warehouse to a triple clustered small warehouse. We also enabled Query Acceleration Service, allowing us to scale to a medium sized virtual warehouse when needed for outlier queries.

The graph below shows the difference in cost observed between one month before and after our change

Figure 14: Graph showing the daily Snowflake Credit usage by the US Ops Reporting Snowflake virtual warehouse at HelloFresh over time. The warehouse size was changed to small on January the 8th 2023.

The result of us switching the US Ops Reporting virtual warehouse from a size medium to a small was a 30% reduction in credit usage when comparing one month before and after this change.

We must also track the impact on query performance by the switch in virtual warehouse sizes

Figure 15: Graph showing the median query execution time on US Ops Reporting Snowflake virtual warehouse at HelloFresh one week before and after the warehouse size was changed to a small.

The graph above shows the change in median execution time was 0.18s when comparing queries one week before and after the change in warehouse size from a medium to a small. Admittedly this is a much bigger performance hit than was anticipated from our testing.

Perhaps our workload was not representative enough, since we used only a one hour slice and are now seeing the results for one week. There could also be new, more taxing analytical queries coming to our grafana dashboards which could have been introduced into the real world workload at any time between when we did testing and when we changed our configuration for our virtual warehouse.

However, when we consider the purpose of this warehouse is for an analytical dashboard which will be displayed on a TV. An extra 0.18s wait is not a bad return on investment. For reference, the average blink of an eye is 0.33s.

Conclusion

By building a data application to send concurrent testing workloads to different warehouse sizes, we were able to analyse the best warehouse size for our use case. Furthermore, by enabling Query Acceleration Service we were able to drastically reduce the performance hit on the average query while scaling down the warehouse size.

This led to a 30% reduction in warehouse credit usage for US Ops Reporting warehouse in 2023. The change in warehouse size resulted in only a 0.18s increase in median query execution time, which is an excellent return on investment.

Finally, we can repeat this test for all current and future Snowflake warehouses at HelloFresh. Optimising every warehouse and answering well and truly our million dollar question of "Which Snowflake warehouse size is best for my workload?"

References

[1] Drivers | Snowflake Documentation [Internet]. docs.snowflake.com. [cited 2023 Aug 21]. Available from: https://docs.snowflake.com/en/developer-guide/drivers

[2] Taxicab Vehicles New Logo — TLC [Internet]. www.nyc.gov. [cited 2023 Aug 21]. Available from: https://www.nyc.gov/site/tlc/businesses/taxicab-logo-vehicles.page

[3] Overview of Warehouses | Snowflake Documentation [Internet]. docs.snowflake.com. Available from: https://docs.snowflake.com/en/user-guide/warehouses-overview

[4] Understanding Compute Cost | Snowflake Documentation [Internet]. docs.snowflake.com. [cited 2023 Aug 21]. Available from: https://docs.snowflake.com/en/user-guide/cost-understanding-compute#label-virtual-warehouse-credit-usage‌

[5] Monitoring Activity | Snowflake Documentation [Internet]. docs.snowflake.com. [cited 2023 Aug 21]. Available from: https://docs.snowflake.com/en/user-guide/ui-snowsight-activity

[6] van Dortmont F. Virtual Warehouse Management. Snowflake Professional Services; 2022 Nov p. 25.

[7] Brownlee J. How to Limit the Number of Pending Tasks in the ThreadPoolExecutor [Internet]. superfastpython.com. 2021 [cited 2023 Aug 21]. Available from: https://superfastpython.com/threadpoolexecutor-limit-pending-tasks/

[8] Using the Query Acceleration Service | Snowflake Documentation [Internet]. docs.snowflake.com. [cited 2023 Aug 21]. Available from: https://docs.snowflake.com/en/user-guide/query-acceleration-service

[9] Ryan J. What is Snowflake Query Acceleration Service? [cited 2023 Aug 21]. Available from: https://www.analytics.today/blog/what-is-snowflake-query-acceleration-service

#data-engineering #snowflake #multithreading #python #cost-management