December 5, 2024

How to reduce your BigQuery costs?

Lucas Leroux
Manager Data Science and Analytics
Pierre-Henri Cumenge
Cofounder and CTO @ Theodo Data & AI
Coralie Betbeder-Déchelette
Data Lead @ Joko
Boris Dorado
Head of Data & AI @ Klee
Emmanuel Martin-Chave
VP Data @ BlaBlaCar
Christelle Marfaing
Chief Data Officer @ May
Nolwenn Belliard
Head of Analytics @ Sumup
Yoann Benoit
Head of Data & Product @ Hymaïa
Lucas de Vries
Head of Data Freelance
Claire Gouze
Head of Data Freelance
Christophe Blefari (aka Blef)
Staff Data Engineer - Freelance
Yanni Iyeze
Data Analyst Senior @ Exotec
Edouard Flouriot
Director of Data Analytics @ Sorare
4 practical examples within the Modern Data Network from BlaBlaCar, Nickel, Toucan Toco and Libeo.
partager l'article
No items found.

Reducing the bill for your data infrastructure is a priority for Data teams. The data warehouse is the heart of this infrastructure, and often the most important lever to reduce its costs.

BigQuery rates, like those of most data warehouses, consist of two main elements: usage (cost of executing requests) and storage (costs of data storage). For example, processing 1 Tb of data costs $[5, 6]; storing it costs $20 a month. However, storage is a fixed monthly cost, while execution grows according to use (duh logical?), making it the most important component of its bill.

The Modern Data Network gathered 4 inspirational testimonials to help you reduce your BigQuery costs. BlaBlaCar, Nickel, Toucan Toco and Libeo share how to precisely track these costs and optimize them.

BlaBlaCar: control the operational costs of data

Controlling operational costs within a Data department is not an easy task. This is particularly true for the costs associated with BigQuery, because our users are not always aware of the cost of their actions. For example, executing a query processing 1 TB of data in BigQuery costs $6.25. When you have dozens or more users of the platform, the problem can quickly become uncontrollable.

In July 2022, we noticed a significant increase in our costs related to BigQuery — around 10% month-over-month. We were exceeding our monthly Google Cloud Platform (GCP) budget by 25%, and BigQuery accounted for 80% of these costs. We identified the main cause: many of our pipelines and ETL flows were based on legacy or obsolete code, and did not adapt well to our evolving needs. For example, only some of our pipelines were incremental. We were therefore faced with poor management of our code base resulting in a lack of scalability.

To solve this problem, we sought to obtain better visibility on cost factors. We realized that the existing tools to monitor our costs lacked granularity, which made it difficult to identify possible cost reductions. As a result, we launched a project called Big Savings, whose goal is ultimately to reduce the volume of data we process. This will allow us to sustainably reduce costs in the long term.

In order to obtain this enhanced visibility, we focused on the labeling (keys/values in the metadata) of our ETLs and the queries used for our Tableau dashboards. We established conventions to define categories and subcategories in particular, the objective being to allocate costs into specific categories and monitoring our data consumption more effectively.

For ETLs, we updated our workflows in Airflow to automatically incorporate these labels. For Tableau, where labelling was more complex due to the lack of built-in functionality, we used custom SQL queries to include comments at the beginning of the queries, indicating the associated category and subcategory. We then retrieve this information in the metadata of the BigQuery logs. In addition, we created dedicated service accounts for each team, thus simplifying the understanding and monitoring of our costs.

Thanks to this complete labeling, we created a cost tracking dashboard, giving data consumers clear visibility of the volumes of data processed, and alerting them when we exceeded our budget. The new observability will allow us to optimize costly processes and react quickly and efficiently to any changes resulting in an increase in the volume of data processed.

For example, the first results revealed to us that some users made very expensive requests (> $30). One of our first actions was to inform and educate them about BigQuery’s best practices. We also identified some very expensive pipelines, which we made incremental to reduce their costs.

In addition to labelling, we undertook other cost-saving initiatives. We studied the use of BigQuery Reservations and Slots, but the analysis showed that we would not benefit from a cost reduction. Furthermore, this change involved a much more complex configuration than we currently have. We also considered changing BigQuery’s storage pricing model, which could have saved us 30% on storage costs. However, due to our existing contract with Google, this option was not available to us.

We reduced costs by cleaning our GCS buckets and changing the storage class of objects. We programmatically identified the largest folders in the GCS buckets and removed backups that were no longer needed. We thus saved substantial amounts related to storing unused data. By migrating files to cheaper storage classes depending on their frequency of access, we managed to reduce costs by more than 50%.

In conclusion, it is essential to have a clear vision and a thorough understanding of the costs associated with data processing in order to optimize them effectively. This requires a solid governance strategy and the use of appropriate tools. The final objective is to better control spending and to predict future budgets with more confidence.

Nickel: organize a FinOps tournament to reduce dbt costs by 30%.

At Nickel, we have more than 1200 dbt models created and maintained by 26 data analysts in 11 business teams. This organization has obvious advantages for data analysts but also certain risks, especially in terms of costs.

As the number of Pull Requests grows, the amount of non-optimized or obsolete models increases, and represents significant costs.

As an analytics engineer, Remi Benoist therefore organized a FinOps tournament for data analysts, in order to optimize their organization’s models.

Tournament rules of engagement

  • All model optimizations go through Pull Requests.
  • The optimization carried out in GB/month is calculated using the BQ and dbt logs.
  • The data analyst who pushes the Pull Request wins the optimized GB/month.
Model volume indicator in the CI/CD

Each data analyst can optimize any model, provided that the model owner validates the Pull Request.

The top 3 of the tournament win prizes: a meal offered for the 3rd and 2nd, a Google Nest for the 1st as well as their portrait in the form of a poster in our offices. The tournament lasts 2 months and the awards ceremony will take place on the occasion of a bi-annual event bringing together all data professionals at Nickel.

Process and results

During the tournament, Remi sent weekly emails to give optimization advice to data analysts: partitioning, clustering, incremental models, cron optimization, etc.

Following up with data analysts proved constructive. Giving them ideas, challenges, or even helping them to optimize models made it possible to deliver a lot of work. A dashboard identifying the cost of our models allowed us to prioritize our actions.

In the end, more than 46 models were optimized in 2 months, representing an optimization of 288 Tb/month, equivalent to 30% of dbt consumption. This tournament also made data analysts aware of FinOps practices and the different techniques to create optimized models.

Today, the costs of the models are constantly monitored via a dedicated dashboard and the CI/CD of our Pull Requests.

Toucan Toco: tracking the data warehouse costs.

At Toucan, we only focused for the moment on setting up a dashboard to follow the evolution of the costs of our BigQuery data warehouse.

To do this, we have gathered several data sources concerning:

  • Storage: list of all the tables and datasets that make up our data warehouse
  • Usage: information about requests made by users or service accounts
  • Billing: billing data by service and product

This allows us to track our usage and associated costs on a regular basis. We have also added several other reports to get details, including one that allows us to list the most expensive requests. We can then look at the content of each request and try to optimize them.

For example, we realized that some tools were scanning the entire data warehouse when configuring or modifying the login information. This kind of query is very expensive especially when you have a data warehouse that contains a large number of tables and columns. A simple optimization consisted in limiting access to a smaller number of datasets.

We also try to check the implementation best practices, including those shared by Google Cloud:

  • Avoid as much as possible in queries to select all the columns of each table via `SELECT *`.
  • Create partitions on tables and use these partitions to filter.
  • Use caching to its maximum and therefore avoid `WHERE date = NOW()`

We also set up an alert system when:

  • A service account suddenly requests more data from one day to the next.
  • A user exceeds his request quota for the day (value that is arbitrarily defined, but we cannot block the user on BigQuery)

Libeo: reduce costs by 40% with simple monitoring.

At the beginning of 2023, Libeo’s Data team set themselves for the first time as a goal of reducing the costs of their stack.

There are 5 people in the Data team: 1 manager, 2 Data Analyst, 1 Data Engineer and 1 Data Scientist. The stack is quite standard for a scale-up:

  • Ingestion : Fivetran
  • Data warehouse : BigQuery
  • Modelization : dbt (core) with Airflow as a scheduler
  • BI : Metabase (open source)

Unsurprisingly, the lever with the most impact — excluding ingestion as we had a contractual commitment — is to reduce the costs of our BigQuery requests, and in particular the costs related to our 700+ dbt models.

Data Stack Costs by Service Category and Month

The first initiative was to set up simple monitoring on Metabase to precisely track the cost of dbt models and know where to start to maximize our impact.

We built a dashboard to follow every week:

  • The average volume (gbyte) processed at each run for each of our dbt models
  • The total volume processed by model and its weight among all our dbt models
Most expensive dbt models by run week

With these 3 pieces of information, we classified our models from the most expensive to the cheapest. And not surprisingly here too, a minority of models accounted for a significant part of the costs. So, we started with them!

The second initiative was to refactor the modeling of these models in order to optimize or remove them if their use did not justify the cost. For example, we had a model aggregating several tracking events to avoid multiple joins for data analysts and users from our self-service tools. However, this model was mostly used to filter on a single event without clustering that would have limited the volume of data processed… This model was therefore deleted.

For the remaining models, the third initiative was to reduce the number of runs of overly greedy models to a maximum of one per day.

In our development process, each analyst can test and launch all our dbt models several times a day. Our CI/CD also launches a lot of dbt models as soon as a modification is made in the codebase. This system works well if the models are small and fast to run, but for the most substantial ones it is expensive (in addition to slowing down the pipeline). We have therefore chosen to launch them only once a day: they are now excluded by default from the runs of analysts’ local environments and CI/CD.

A new standard for the team

We had a lot of room for improvement, and these first initiatives reduced our BigQuery costs by 40%. Above all, they include this monitoring in our team standards. It is now watched weekly and the dbt models are optimized regularly.

Going further

Mastering your BigQuery bill is a complex subject. We hope that these techniques can help your team save money and avoid surprises at the end of the month. Who does not want to announce to his Finance team and other teams a reduction in his infra bill?

Feel free to share your approaches and feedback — especially if you use Snowflake ;)

PS: Mixpanel shared a great article on the same subject.

Partageons ensemble notre savoir

Rejoignez notre communauté d’experts en data et boostez votre réseau, vos compétences et votre impact dans le domaine
nous rejoindre

Nos prochains événements

TAG
This is some text inside of a div block.
This is some text inside of a div block.
TAG
This is some text inside of a div block.
This is some text inside of a div block.
TAG
This is some text inside of a div block.
This is some text inside of a div block.