In this article, I show how to automate useful actions that should occur before a DVC pipeline execution (
dvc pull input files, check that the current git workspace is not dirty) or after the execution (
git commit the
dvc push the produced data) with Makefile.
Disclaimer: this article assumes you are familiar with DVC (Data Version Control) and DVC Pipelines.
At Sicara, we love DVC (Data Version Control) and use it in most of our machine learning projects. Personally, I think DVC is a great tool because 1) it has a limited scope (tracking the data) and it does it very well 2) it is very easy to integrate with other tools.
If you are interested in DVC integrated with other tools, you may want to read the article I wrote about DVC + Streamlit (another very cool tech for ML!).
Lean methodology is part of Sicara’s DNA. In practice, we have a tech guild dedicated to ML tooling that meets on a weekly basis to do Yokoten. Yokoten literally means “horizontal deployment” in Japanese. It consists of sharing a good practice learned from a project with other projects.
This article is a Yokoten I presented to our Guild on September 21. I share what we learned on my project to improve the execution of DVC pipelines and how they are actually executed.
A Pipeline to Compute Model Metrics
In my project, we have many DVC Pipelines to do different kinds of things. Some of them are executed just a few times (sometimes even once) for instance when we explore new ideas (iteration on the model, a new way to train our model, etc). On the other hand, some of them are re-executed on a regular basis e.g., training pipelines, evaluation pipelines.
Let’s take an example: we have a metrics pipeline that looks like this:
As often as the code (model logics, evaluation scripts) or the data (model weights, the test data) is modified, we need to compute model metrics to ensure the model performs as expected i.e., that its performance is better than before and that edge cases are covered.
To do so, we simply launch:
dvc repro --force metrics/dvc.yaml
--force option, metrics are critical hence we prefer to force the execution of all stages even if it takes more computation time.
What would we like to Automate?
Several actions are required before launching a pipeline with
- check the current git workspace is not dirty: we do not want to execute the pipeline if you have changes not staged for commit;
- pull the input data:
dvc reprowill automatically restore data from intermediary stages from the local cache, but it will not pull input data from remote storage. This may cause the pipeline execution to fail / not to be up-to-date;
Other actions should be done after the pipeline execution:
- save the results in a commit: this commit is special as it should normally contain only changes for the
dvc.lockfile. Automate the
git commitallows to standardize the commit name so that it is easier to identify later on;
- make the results available to everyone in the team: a common mistake we used to make is to forget to
dvc pushthe data. As a consequence,
dvc pullrun afterward fails and you have to ask the data scientist that launched the pipeline to manually
dvc pushthe data - if he still has it!
Sometimes, the pipeline execution takes some time - about two hours for our metrics pipeline. Thus, for convenience, it’d be great to have the following:
- send message to the team in case of success/failure: we want to be informed what is going on without looking at pipeline logs all the time;
- make the pipeline execution asynchronous: in our case, we run metrics on a remote GPU instance, so we’d like to launch the execution in “detached” mode so that we can exit the instance (ssh) right after launching the pipeline.
Let’s Do Automation!
To automate the aforementioned actions, we wrote a Makefile like this:
# Check workspace is not dirty
git diff --quiet HEAD
# Pull the input data
dvc pull -Rf model
dvc pull -Rf dataset/testset
# Compute metrics
dvc repro --force metrics/dvc.yaml
# Commit the metrics
git add metrics
git commit -m "[DVC] Update model metrics"
# Push the metrics
dvc push -R metrics
# Notify metrics computation is done !
send_success("Metrics done !")
(nohup make _compute_metrics || make send_failure("Metrics failed :(")) &
git diff --quiet HEADmakes the execution fails before pipeline execution if changes are not staged;
send_success()functions are just
curlcommands that send message to the team slack channel (see a tutorial here);
nohuplaunches the pipeline execution in the background. You can grab pipeline logs with
tail -f nohup.out.
Then, it becomes very easy to compute the metrics:
- SSH to a remote (GPU) instance
make compute_metricsand exit the remote instance
- Wait for the message on the slack channel!
These three simple steps allow automating the required actions before/after
dvc repro. It ensures
dvc push is not forgotten and that the sequence of actions is exactly the same for each pipeline execution by any member of the team (e.g., the commit name).
DVC proposes 3 git hooks that you can install by running
The Makefile I proposed somehow covers similar needs:
dvc pullcommands just before
dvc reproensure the data is up-to-date;
git commitimmediately follow
dvc reprohence the
dvc statusbecomes pointless;
dvc pushcommand is just before the
I think both approaches (git hooks or Makefile) may be relevant depending on your use case. In my project, git hooks are a bit too long to execute because we have many data and pipelines tracked by DVC making every commit painful. One advantage of the Makefile approach is that
dvc pull/push are run only when it is necessary i.e., before/after
I hope this article was useful and gave you food for thought! Do not hesitate to leave comments, I am convinced there is a lot to improve!
If you want to know more don't hésitate to contact-us !