Hello everyone! In this guide, I will discuss Databricks Asset Bundles for Machine Learning and how they enable individuals to efficiently construct an end-to-end machine learning life cycle with continuous integration and continuous delivery (CI/CD) capabilities.
Databricks Asset Bundles (DABs) represent a novel tool designed to simplify the creation of intricate data, analytics, and machine learning projects on the Databricks platform. These bundles streamline project management during development by offering Continuous Integration/Continuous Deployment (CI/CD) capabilities through a straightforward YAML syntax. Leveraging bundles enables automated testing, deployment, and configuration management, thereby minimizing errors and fostering software best practices across the organization through templated projects.
Who can benefit from it?
- Data Scientists.
- Machine Learning Engineers.
- ML Solution Architects.
Pre-requisites
- Databricks CLI version 0.205.2 or higher is required.
- Databricks Workspace Access.
- Service Principal to access databricks resource.
- DevOps Access.
Throughout this guide I would be using Microsoft Azure, Azure Databricks & Azure DevOps. Please refer to documentation of DAB where you can follow through the steps for other Cloud Providers.
A note to readers, this DAB and its features is still under public preview. Please take proper advice from your solution architect before trying to productionize the ML operations using MLOps Stacks (DAB).
A bundle includes the following parts:
- Required cloud infrastructure and workspace configurations.
- Source files, such as notebooks and Python files, that include the business logic.
Let's dive right in, folks! No time to waste! The first crucial step is to get that CLI authenticated against your Databricks workspace. Get ready to rock and roll!
Architecture:
The following snippet outlines the architecture I aim to implement.
Setting up MLOps Stacks:
databricks auth login --host <workspace-url>
You will be redirected for authentication, after successful authentication you need to initialize the Databricks Asset Bundle using init command.
databricks bundle init mlops-stacks
As you proceed with setting up end-to-end, you'll encounter the screen prompt where you'll select "CICD_and_Project."
Set your project name. Select the cloud provider and CICD platform as the bundle will be initialized based on the resources in each cloud providers.
Follow through the steps of providing the URL of Staging workspace, Production workspace, branch details.
As you observe, I'm utilizing the Model Registry to register trained models, excluding the Unity Catalog, and leveraging the Feature Store for feature lookups.
Isn't it amazing that as you progress through this step, your base template will be set up within minutes? The DAB will handle this in the background, creating a project in your current location.
Now, let's examine this code structure (removing unnecessary or non-functional code, such as those in public-preview, which are not within the scope of the architecture).
Code Structure:
Realtime Notebooks:
I've included notebooks for model serving and real-time inferencing.
model_serving ← Model serving code for using the registered model for real time inference.
realtime_inference ← Inferencing code for performing the prediction on real time dataset.
I've streamlined the aforementioned notebooks into the Databricks workflows, necessitating changes to the YAML files corresponding to these workflows. You'll find all the workflow-related YAML files under the "resources" section in the project folder named "my_mlops_stacks". And of course, I'm referring to the ML code inside the project, which centers around a taxi data use case to predict the fare amount.
feature_engineering ← Feature computation code (Python modules) that implements the feature transforms. The output of these transforms get persisted as Feature Store tables as pick_features and dropoff_features.
Note: I am using the offline feature store (workspace feature store). I will be covering about the limitations in a bit.
training ← Training folder contains Notebook that trains and registers the model with feature store support.
When a model is registered with the offline feature store, it comes with its own set of advantages and limitations. However, it cannot be directly served in real-time, as the expectation is to utilize an online feature store such as Cosmos DB or DynamoDB for achieving low latency in real-time scenarios.
To address this issue, I've implemented automatic logging and registration of the trained model using the default method (pyfunc).
validation <- Optional model validation step before deploying a model.
batch_inference <- code will be executed as part of the scheduled workflow. By default, the option is to use the "score_batch" module when the model logging is conducted through the feature store. However, since we're opting for a different approach, we can generate a batch score notebook from the Databricks UI. Simply click on "Use model for inference" -> "Batch processing" -> fill in the required details and click on "Use model for batch inference". This will automatically generate a notebook for batch inference.
databricks.yml <- databricks.yml is the root bundle file for the ML project that can be loaded by databricks CLI bundles. It defines the bundle name, workspace URL and resource config component to be included.
Furthermore, I've included configuration settings to utilize the existing cluster ID based on the target environment.
Connecting to Repos:
To facilitate the setup of pipelines, I'm committing all these files to Azure Repos in my branch named "main_mlops_stack", where I'm configuring the pipeline for CI and CD. The Bundle includes the Azure DevOps pipeline configuration file (YAML) named "deploy-cicd.yml". Unfortunately, in my case, it didn't go as expected, so I opted to manually set up my pipelines using the other two YAML files.
If you're familiar with setting up pipelines, this should be the simplest task of all. Just click on "New Pipeline" and select the files from the respective repositories.
In addition to this, I've made modifications to the "my_mlops_stacks-bundle-cicd.yml" YAML file to ensure that the mlops_stacks are deployed to my development workspace whenever changes are merged into my "dev_mlops_stack" branch. I've set up this branch to include the mlops stacks in my development workspace as well.
devops-pipelines:
- my_mlops_stacks-tests-ci.yml — Run unit test and integration on staging environment.
- my_mlops_stacks-bundle-cicd.yml — Validate and deploy the bundle to target Workspace.
The pipeline setup is now complete. The next step is to configure the variables in the Variable Groups (Library) module of Azure DevOps as follows, providing the keys and secret values.
The above snippet contains only for Dev Workspace if you are setting up for other environments, feel free to add it there.
branch policies for CI:
I've established branch policies for the "main_mlops_stack" branch and connected the above pipelines to the build validation.
Whenever a pull request is submitted against the "main_mlops_stack" branch, the CI Test and CD validate jobs (excluding deploy jobs) will execute automatically.
Note: The Deploy jobs in the MLOps_Stacks — CD Pipeline will run only after the code is merged into the target branch.
The snippet above depicts the deploy pipeline jobs that executed after the code was merged into the "main_mlops_stack" branch. Similarly, the code can be merged into the "release_mlops_stack" branch to deploy it to the production workspace.
Workflows:
The workflows that are deployed as a part of sourced from the yaml files inside the resource folder.
The workflows can be scheduled, triggered through REST API's or it can be run through the databricks asset bundle run command.
As you follow along with these steps, you'll build a minimal yet powerful version of MLOps with a modular approach, CI/CD and customizable as per our requirement. Happy coding!
Feel Free to applaud and comment.
Thanks to databricks for coming up with such an amazing tool!