October 26, 2023 • 4 min read

How to speed up your Python & Poetry GitlabCI with Docker

Rédigé par Erwan Benkara

Erwan Benkara

As your project grows, the process of installing Python packages with the correct versions can become quite slow. When this happens in your Continuous Integration (CI) pipeline, the feedback loop can seem endless. This guide aims to help you reduce your feedback loop time with GitlabCI and Docker. Poetry is now a widely adopted packaging and dependency management tool for Python. With its robust dependency resolver, it helps developers install and maintain their dependencies effortlessly.

Here are the key steps and lessons I discovered to streamline GitlabCI during a recent project. It all started with my struggle in managing dependencies and then transitioning all my jobs to Docker to expedite the entire pipeline.

Dependencies, GitlabCI Executors, and Docker

First, let's take a moment to consider what dependencies are. Dependencies are essentially relationships between different software components; one piece of software relies on the functionalities provided by another to function correctly. In Python, it's as simple as the airflow module needing the Flask module. However, some Python packages may require system-specific packages. For instance, the PostgreSQL client package, psycopg2, needs the system package libpq-dev to function. Since it's system-specific, it can vary from your local machine to your CI machine and even to your production machine.

Talking about data transformation and Biquery data pipeline with Airflow, we have some great articles you can read to learn more about it!

python system dependencies
Overview of the different levels of dependency

In Gitlab, you have several types of executors to run your CI scripts. It's important to note that the default executor in GitlabCI is the Shell Executor, where your scripts run directly on the machine, specifically in a dedicated per-job shell. So, if you come across a dependency package with a system requirement, you can install it manually. However, this may not be the best approach.

Firstly, it can lead to conflicts with other software. Secondly, your primary concern should be isolation and reproducibility, which is precisely what Docker is designed for. GitlabCI offers other executors, including the Docker Executor. Each script runs in a dedicated Docker container, allowing you to control precisely which dependencies (both system-wide and Python-wide) you want to install in your image.

The isolation and reproducibility come at a minimal cost. Here's what you need:

  • A Dockerfile that defines your image and installs, for example, libpq-dev if needed
Add any system dependency you want in your Docker image
  • A Docker registry accessible to your CI runner for both pushing and pulling images
Use your Python Docker image in GitlabCI

Dockerizing Your Python & Poetry Environment: The Simple Approach

With these steps, you can safely replicate your environment from your local machines to the production environment. Now, let's focus on the Python and Poetry specific aspects.

At first glance, Dockerizing Python and Poetry may appear straightforward. Even though there's no official Poetry image on the Docker Hub, it only takes a few lines to build an image.

Basic image with Poetry & Python installed
  1. You start from a base Python image. The Python version you set in your tag must match the version you specify in your pyproject.toml file.
  2. Then you install Poetry with the official install.
  3. As a good measure, you can set the POETRY_HOME environment variable to control where Poetry will be installed.

And that’s it ! You can now build, tag and push your image to your registry and use it from GitlabCI. Just don’t forget to install your dependencies with Poetry.

Install your Python dependencies when running your GitlabCI job

Yes, it works. But, if like me, you use this image to run your tests in GitlabCI, you may notice it is terribly slow. Much slower than with our good old Shell Executor. In my case, we went from 2 minutes to more than 5 minutes to run our 4 test jobs. If you dig a bit, you’ll find out that the poetry install is the one at fault. Let’s speed that up.

Optimizing Python Dependency Management with Poetry

We won’t exactly speed up the poetry install, it’s nonsense. The key point is that you don’t have to install your dependencies on every run.

By nature, you don’t add or modify your dependencies on every commit. Your pyproject.toml changes here and there, sometimes. Thus, installing your dependencies directly inside your Docker image is a good idea.

Install your Python dependencies directly inside the image

The best practice is to use the poetry.lock file. It lists the exact versions of dependencies, as it is very well explained in the Poetry documentation. It can speed up the installation as the version resolution is already done.

Either way, you can now use a pre-built image that contains all your dependencies in your GitlabCI. You can remove the line poetry install in your .gitlab-ci.yml. Yet, you now have to solve the ultimate issue.

What happens if I want to add another dependency?

You can’t use the current image you have in your registry. You have to build it again. It can be painful if you have to do it every time you change your pyproject.toml.

The best way is to build it only when you need it, thanks to an additional job in your CI.

Let’s call this job Deps. This job should be triggered when there is a change on whether:

  • your pyproject.toml / poetry.lock or
  • your Dockerfile

You can do this using rules and changes in GitlabCI.

Build a new Docker image in GitlabCI only when your dependencies have changed

In simple cases, you can set the image tag to something fixed like latest for instance. If you are numerous in your team, you may face some issues like teammates overriding your image. In such case, you can compute an image tag by hashing these 3 files with the md5sum native Linux function. Here is an example. Note that you can potentially exclude poetry.lock.

To sum up, you would end up with a workflow looking like this.

gitlabci flow
GitlabCI final workflow

TL;DR

  • Dependencies can be both system-wide and Python-wide
  • Docker ensures isolation and reproducibility - dependencies are the same from one image to another
  • Install your dependencies - with Poetry - in an image you can reuse in GitlabCI to save time
  • Optimize your CI by installing dependencies only when you detect a change

Looking for Data experts? Feel free to reach out!

Cet article a été écrit par

Erwan Benkara

Erwan Benkara