7 min read

Track DVC Pipeline Runs with MLflow

Rédigé par Yannick Wolff

Yannick Wolff
DVC + MLflow = ❤️

At Sicara, we love exploring and integrating ML tools together to make our workflows as convenient as possible. One of our favorite tools for ML projects is DVC and we already wrote two articles about how to combine it with other tools: Streamlit and Makefile.

Since good things always come in threes, I’m going to show you how DVC can be combined with another great tool called MLflow, in order to have a super easy way to run ML experiments on your project and explore the ones launched by all team members.

This article is supposed to be useful whether you know DVC and MLflow or not. If you already know these tools, don’t hesitate to skim over the first paragraphs.


Let’s track everything with MLflow

Mlflow is an open source tool built by Databricks to manage the lifecycle of an ML project. It provides several components, including:

  • a tracking API, available as a Python package, which allows to record useful data and metadata each time you run an experiment: input data, parameters, models, results, etc.
  • a standardized way of defining, storing and serving models
  • a model registry to store, version and manage these models
  • a dashboard to visualise tracked data and models

MLflow Models as well as the Model Registry are very useful on ML projects. However, I will only focus in this article on the tracking API and the dashboard, which already bring a lot of value in themselves, especially when combined with DVC.

The idea behind experiment tracking is simple: log every task you’re running and every interesting piece of information about it. Let’s say you’re writing a script generating some data and computing the accuracy of a model on this data. Then, you’re going to log:

  • the model parameters, with mlflow.log_params(model_parameters)
  • the generated data, with mlflow.log_artifact("data.csv")
  • the metric got, with mlflow.log_metric("accuracy", accuracy)

But what‘s the point of logging all these things?

Your run and all its metadata will appear on the MLflow dashboard, among all the other runs of the team:

This dashboard can be launched with the mlflow server command

This way, next time someone wonders which value was used for that threshold in order to get this awesome accuracy, you will be able to recover it.

MLflow dashboard also offers possibility to compare several runs together:

This feature allows to compare 2 or more runs side by side

MLflow is very useful to have a quick overview of the last experiments run on a project. But as a project develops, the amount of different tasks run by data scientists increases, as well as the complexity of these tasks. Then, it becomes hard to visualise on the dashboard how all runs are related to each over. That’s where DVC comes in!

From runs to pipelines with DVC

DVC, for Data Version Control, is to data as Git is to code: its main functionality is to version data on your project. For this purpose, it stores your datasets, models, or any heavy files in a remote storage and allows to track on Git only small metadata files pointing to them. DVC also provides a Git-like command-line interface: dvc status, dvc add, dvc push, dvc pull, dvc checkout, etc.

But DVC does not stop there and offers plenty of awesome features, including:

  • the ability to write, run and reproduce Data Pipelines
  • a cache system which allows to skip the execution of a task if it has already been launched by you or any team member
  • parameters / metrics tracking and visualisation

You may have a déjà vu feeling... DVC indeed allows, as well as MLflow, to track parameters and metrics while running experiments, thanks to a set of commands called DVC experiments. Does that mean that DVC and MLflow are concurrent tools? Isn’t it redundant to use them together? That’s what we’re going to see. But before that, I wanted focus a bit more on another feature I just mentioned above: Data Pipelines.

DVC’s documentation describes them as “series of data processing stages [where] connections between stages are formed by the output of one turning into the dependency of another”. This might seem somewhat abstract, so let’s take the example of a classic model training pipeline:

This kind of visualisation can be generated with the dvc dag command

You can see in this example a preprocessing stage which runs 3 times for train, val and test datasets, a training stage, and finally several stages to evaluate the model.

Such a pipeline can be created thanks to a dvc.yaml file respecting the dedicated format. The idea is to indicate for each stage its code and data dependencies, as well as its outputs, and DVC will understand on its own how all stages depend on each other. Once created, a pipeline can be launched with the dvc repro command.

In the case of the example above, the dvc.yaml will look like this:

You can see that the first preprocessing stage is a bit special, as it’s built with a foreach element, introduced with DVC 2.0.

You may also notice that I’ve carefully written all the dependencies for each stage: code of the related script or any imported Python module, input data and models. It’s important to do so if you want DVC cache functionality to work properly. Indeed, the way the pipeline is written in my example implies that:

  • if I modify the preprocessing (in scripts/preprocess_data.py), and run dvc repro, the whole pipeline will be re-run
  • if I modify the model (in src/model.py), the pipeline will be launched starting from the training
  • if I modify the test set, only its preprocessing and the model evaluation will be launched
  • and so on...

In addition to this very handy caching behavior, DVC pipelines add a lot of structure in your project. They help to understand how all tasks are related to each over in a complex ML project. Exactly what is missing when working with MLflow, which handles all different tasks independently—or almost independently, as we’re going to see later with the nested runs feature.

DVC + MLflow?

If DVC is so convenient to version data and manipulate pipelines, and if it also provides experiment tracking features, why do we need MLflow?

The answer is simple: MLflow’s experiment tracking component is far more developed than the one natively offered by DVC. Indeed, DVC experiments commands do not provide any dashboard to visualise experiments, but have to be used directly in the terminal.

I precised “natively offered by DVC” because the team who created DVC recently launched a new product called DVC Studio, an MLflow-like dashboard designed to work with DVC experiments. Yet, DVC Studio is a young tool (created in 2021 versus 2018 for MLflow), hence it has some limitations compared to MLflow:

  • Some features are missing. In particular, there is no possibility to create nested runs—an MLflow feature that I’m going to describe below
  • You can only use DVC Studio if your code repository is hosted on GitHub, GitLab or Bitbucket. It might not be problematic for you, but I often work for customers who’re used to host their repository on AWS CodeCommit or Azure DevOps.
  • You need to pay for DVC Studio Teams Plan if you want 5+ collaborators or on-premises deployment

That is why I ended up using DVC and MLflow on my projects:

  • DVC for data versioning and pipelines
  • MLflow for experiments tracking, model registry and serving

Now, I’m going to explain how I organize a project to have the two frameworks cohabit well, and to benefit from the advantages of both.

DVC + MLflow = ❤️

I decided to respect two simple standards:

  1. Launching a DVC pipeline should correspond to an MLflow run
  2. Each stage in the DVC pipeline should correspond to an MLflow child run, nested in the pipeline run

MLflow indeed allows to start a run nested in another one, so that runs appear as a tree structure on the dashboard:

You can expand each parent run to show all its children

This organisation allows to easily distinguish different iterations on the pipeline, each one appearing as a separate run. It makes also possible to focus on one specific stage of an iteration, showing only the parameters and metrics related to this stage. This is particularly useful on complex projects, when your pipeline contains a dozen of stages and there is only one you’re interested in during the current experiment.

How to implement that in practice?

Firstly, we want to start a new MLflow run each time we launch the DVC pipeline.

One way to do that is to create an util script—let’s name it start_pipeline.py, which will manage to start this parent run:

Several points need to be clarified in the snippet above:

  • We use the Typer library, a very handy CLI builder, to easily get a command-line parameter for the run name.
  • Before starting the run, we need to set the current experiment, which means, in the context of an MLflow project, a kind of subproject in which we’re working. Each experiment has its separate page in the dashboard showing its runs only.
  • Then, the run is started, with the given run name
  • Once the run is started, its run_id is printed: as I will explain below, it’s necessary to be able to start runs as child of this parent run.
  • Finally, in order to keep track of what has been launched (remember MLflow philosophy: track every valuable piece of information!), we log the dvc.yaml file as an artifact of the run. It will be accessible from MLflow’s dashboard.

In order to call start_pipeline.py each time we launch the pipeline, let’s create a Makefile command which performs both. A basic shell script would also work, but I like gathering all the project useful commands in a Makefile:

This command can be launched with: make run_pipeline RUN_NAME=name_of_your_run

The run_id printed by start_pipeline.py is caught and saved to an environment variable named MLFLOW_RUN_ID, which is exported so that Python subprocesses launched at each stage of the pipeline can have access to this variable. As explained in the documentation, the effect of this environment variable is that next calls to mlflow.start_run method will not start new runs but reactivate the one of the given run_id.

This way of propagating the run started in start_pipeline.py script to the pipeline stages can seem a bit hacky, but since they are executing in separate subprocesses, the only other way to share information between them would be to write and read a shared file—which I would have found even more cumbersome.

Finally, in order to turn every stages into MLflow nested run without writing the same lines of code every time, the solution I suggest is to write a Python decorator to apply to each of our scripts:

This decorator performs several tasks:

  • As in start_pipeline.py, it sets the current experiment
  • Then, it reactivates the run started in start_pipeline.py, as explained above
  • Finally, it starts a new run, with the nested=True parameter to create a child run, and using the name of the decorated function as a run name.

Now, we can decorate our functions to turn them into MLflow runs that will be used as DVC stages:

And that’s how you get a wonderful dashboard representing the different iterations tried on your pipeline and its stages:

That’s the same image as the one above, I put it again because I’m particularly proud of this dashboard 😎

Conclusion

I hope that I convinced you that “DVC + MLflow = ❤️”, or more precisely that... “DVC + MLflow + Typer + Makefile + some custom code = ❤️”.

This is of course only one way of doing among plenty of possible solutions. There are indeed numerous alternatives to MLflow and DVC Studio for experiments tracking: Neptune, Sacred, Weights & Biases, Comet, Guild AI, ClearML, Valohai, and I could go on... In order to meet your specific need, it’s also possible to build your custom dashboard with tools like Streamlit, as explained in this article from my co-worker.

Cet article à été écrit par

Yannick Wolff

Yannick Wolff

Suivre toutes nos actualités

Active Learning in Machine Learning

5 min read

The Carbon Footprint of an AI project

Reconciling Databricks Delta Live Tables and Software Engineering Best Practices

5 min read