Orchestrating the Training of Online ML Models with Airflow at X3M

Build ML Model Training Pipelines with Airflow: Strategies for Efficiency and Scalability

Mauricio Anderson

Apache Airflow

· ~6 min read · April 17, 2024 (Updated: April 17, 2024) · Free: Yes

Introduction

At X3M, our firsthand understanding of the AdTech industry enables us to develop software and provide a comprehensive range of services to transform the advertisement ecosystem through fairness and transparency. We empower our clients to unlock their app's full monetization potential and navigate the dynamic world of digital advertising, ensuring they receive higher CPMs and better fill rates from their apps without any hidden fees.

Our main platform for achieving this goal is the "Mediator". This platform serves as an intermediary component between our customers' apps, which are monetized via advertisement, and the demand, composed of those companies searching for ad spaces to place their ads.

To keep it simple, each time a user is going to see an ad in an app, that ad space should have been sold beforehand. To ensure this, the app uses our mediator platform to delegate the task of finding a buyer for its ad spaces. The app sends a request asking for an ad to our mediator, which then analyzes the contextual information and finds a buyer for that space.

Business Value

Ok, but how do we enhance this process to maximize the benefits our clients obtain from using our technology? Here's where our ML-based optimization techniques take place.

The process of selling an ad space can be (over) simplified to the problem of maximizing the return of a bid process with sequential fixed prices. In other words, each time the app has an opportunity to show an ad, we run a process consisting of offering sequentially that space to different potential buyers until one of them decides to buy it at a given price.

This process is time-sensitive, and, consequently, we need to quickly search the subset of all possible combinations of tuples "buyer-price" such that the expected return is maximized. As expected, the algorithms behind this process use a set of machine learning models, which are trained incrementally every few hours. This way, known in the literature as Online Machine Learning, the models are constantly updated with the most recent information, capturing market behavior changes. Here, Airflow is the core tool for orchestrating the pipeline to achieve this.

The core steps

At this point, it's time to build the pipeline to train these models. Each of them follows a three-step pattern:

Learn → Promote → Vacuum

In the Learn step, we take the last trained version of a model as well as a batch of recent data. Then, we make the model learn incrementally from the new data and store the new model version in our model repository.

After this step is completed, we continue with the task Promote, in which we take the model trained in the previous step and compare its metrics against the version currently working in production. If the metrics from the new model beat those from the model in production, the latter is marked as archive, while the one recently trained is promoted to production. If the new model does not achieve better results, it does not get promoted, and the model running in production stays the same.

Finally, we have a step called Vacuum that simply runs a cleaning process to remove temporary files and other assets.

Having arrived at this point, we have seen a basic scheme for a pipeline to run a model's training process. Now, let's dive deeper into a few optimizations we have implemented and the mechanisms we employ to make the pipeline suitable for our use case.

Optimizations

The following points came from the fact that we need to ensure our system is profitable. To achieve this, we have to face both aspects: keep the training costs low and the model performance high, assuming that the better the models are, the higher the reward we get from the selling process.

Catching up

The first improvement consists of introducing a branch operator after the learn task in order to reduce the time it takes to catch up when a new model is added, which happens quite often. We mentioned that the incremental approach allows us to keep models updated every few hours once they have caught up. The disadvantage of doing this is that we would be "wasting" time in the promote task when a model is newly added because it won't be used in production until it is up to date — i.e., has gone through all the DAG runs up to the current time.

Taking into consideration that the promote task takes up half of the total DAG run time, avoiding this step when it isn't necessary by using a branch operator to skip it in catch-up runs speeds up the process by two. The way in which the date to start using the promote step is set, and used in the branch operator, is by a DAG configuration, as explained below.

Fig 1. Model training pipeline graph

Managing external resources

The second optimization mechanism consists of implementing Airflow pools to keep the use of external resources under control. The tasks learn and promote require data to train or evaluate models. This data comes from our data lake, which is hosted in Databricks. This platform provides SQL Warehouses, which allows us to query our data using plain SQL. To run queries, this feature uses a cluster composed of multiple nodes: one driver and one or more workers. The higher the number of concurrent queries we run, the higher the number of workers we will require (and the cost we incur). In numerical terms, each worker is able to manage up to 10 concurrent queries. With that in mind, we've set up a specific pool for those tasks that handle queries. This helps keep the cluster size at a level that allows everything to run smoothly.

Additionally, each time a run is launched, if the cluster is not up, then it will start. The process of starting, and eventually stopping, a cluster takes time in which the system is not doing anything useful for us. So, coordinating the runs to be triggered at the same time allows us to reduce the amount of wasted time.

In the next image, it's possible to observe how these two points work together to maximize the efficiency we get from the process.

Fig 2. Number of queries sent to the data lake and periods with the cluster running.

Splitting the models

So far, we have been discussing the basic pipeline for training a model and a set of optimizations to keep costs low. However, there is an additional challenge to face: avoiding training models with mixed data from different clients. Additionally, our models perform best when specialized on further subsegments of data from individual clients, for example having a different model for different operative systems (android/iOS), among other segmentation variables. Thus, we need multiple models, each of them trained with a specific subset of data.

To achieve this in a structured and efficient way, we dynamically create DAGs from a configuration source. In its current form, this is JSON file in which models are added following an expected structure, as shown in the below example. Each entry in the file has the information required to build the pipeline to train the specified model. The main elements are "dag_params" to pass pipeline configs, "model" to pass model-related configs and "segment" to define the subset of data to use in that particular model.

Using this strategy, we managed to offer a mechanism to define models without requiring a complete understanding of the code behind the training process.

[
  {
    "dag_params": {
      "start_date": "2023–07–10",
      "start_promote": "2023–07–24",
      "schedule_interval": 6,
      …
    },
    "model": {
      "model_type": "…",
      …
    },
    "segment": {
      "platform": "android",
      …
    },
    "id": "…"
  }
]

Conclusion

In this post, we've seen how we designed the pipeline to train our ML models, which are the key components of our ad sales optimization techniques, all while keeping costs low and performance levels high.

All this began with a basic pipeline that we started to tweak, keeping in mind that profitability is a must. First, we focussed on reducing costs by using Airflow pools and strategically adding a branch operator that accelerates the catch-up training of newly added models. Then, we split the model creation based on segmentation criteria. We also implemented a way to add new models agnostically from the machinery behind the scenes using dynamically generated DAGs.

We hope you enjoyed reading about our project and how we squeezed Airflow to get the best of it and build a profitable and scalable platform.

#airflow #adtech #machine-learning #mlops #apache-airflow