May 4, 2023 • 3 min read

How to create BigQuery data pipelines with Airflow and Docker

Rédigé par Wiem Trigui

Wiem Trigui

Do you want to manage your BigQuery data pipelines automatically? Airflow is a great option to schedule, execute, and monitor your workflows since it is very easy to configure and setup.

If you are newer to Airflow, this Sicara article presents an excellent introduction to Airflow and its components. Otherwise, we can start our tutorial on how to run BigQuery in Airflow DAGs locally using Docker!

Prerequisites

Why Docker with Airflow is awesome?

To launch airflow on your machine, we recommend using Docker. One of the major advantage of Docker is application portability. Docker encapsulates everything your application requires to run in containers. Thus, you can guarantee that your application will run in any UNIX environment without having to worry about differences in software versions or configurations . Moreover, Docker containers provide a level of isolation that makes it easier to manage multiple Airflow instances. This makes switching between multiple Airflow projects with different requirements very easy. Otherwise, you need to reset Airflow every time you switch project which is painful.

How to run Airflow with Docker Compose?

To set up Airflow with Docker-Compose, you can rely on this article. It presents a Docker Compose file configuration that defines a multi-container Airflow deployment, with a PostgreSQL database, a scheduler, a web server, and a utility container for initialization.

Tutorial requirements

Before following the tutorial, please ensure that you have the following requirements fulfilled :

  • Docker desktop installed in your machine: you can visit the official Docker website and download it from there.
  • A python working folder containing an empty folder named dags, a docker-compose.yml file and a .env file given in this article
Project initial structure

NB:

  • The new version of yaml cause the error “Map keys must be unique”. To fix that, put common and depends-on in the same line: <<: [ *common, *depends-on ].
  • For this tutorial, you should upgrade the airflow image version to 2.5.1.

After these modifications, the new docker-compose.yml file should look like the following:

Create your GCP project

The first step is to create a Google Cloud platform project where the BigQuery datasets and tables will be created (At the moment of writing this article, Google is offering a 90 day $300 free trial for every new user). For that, follow these steps:

  1. Go to the project management page in the Google Cloud console.
  2. Click “Create Project”
  3. Enter the “project name” and note of the “project ID”. This is important for later. In our case, the project ID associated to our project is “airflow-project-test-385219”.
  4. Click “Create” to create your GCP project
Create a new GCP project

Authenticate Google client libraries

First, install the Google Cloud SDK according to the official doc that provides detailed instructions for various operating systems.
Thereafter, you can authenticate Google client libraries to interact with Google APIs. To do this, you can run this command:

gcloud auth application-default login

The browser will be opened, and you need to choose the Google account to be authenticated.

A file ~/.config/gcloud/application_default_credentials.json will then be created in your machine.

Now that the project creation and the authentication is done, we can create our first BigQuery DAG.

Prepare your DAG

For this tutorial, we are using a simple DAG with a few BigQuery operators to demonstrate how to run some queries. You'll need to put the DAG and the query files in the dags directory.

Project structure

The aim of this code is to create an airflow DAG with two tasks:

  1. Create a dataset test_dataset in EU location using the BigQueryCreateEmptyDatasetOperator
  2. Create a table test_table in the test_dataset. For that, we use BigQueryInsertJobOperator which calls an sql file create_bq_table.sql containing the table creation query.

Setup Google Cloud connection in Airflow

First, you need to mount a volume for Google cloud credentials. In the docker-compose.yml file, create a volume between the local google cloud credentials file ~/.config/gcloud/application_default_credentials.json in your machine and a file /home/airflow/.config/gcloud/application_default_credentials.json in Docker.

Then, you need to declare the Google Cloud environment variables.

  1. Define the connection id GOOGLE_CLOUD_DEFAULT and the connection type google-cloud-platform.
  2. Set the value of GOOGLE_APPLICATION_CREDENTIALS to the credentials file path in Docker
  3. Set value of GOOGLE_CLOUD_PROJECT to the id of your Google Cloud project.

Test it out

Now you can launch your dag!

  1. Run docker compose up -d to run airflow in detached mode.
  2. Open http://0.0.0.0:8080 in your web browser (Username: airflow, Password: airflow)
  3. In DAGs/Actions, click ‘Trigger DAG’ and ensure that the dataset and the table creations are successful.

In google cloud console, in BigQuery service, verify that the dataset and the table have been created in your project.

Finally, run docker compose stop to stop the containers or docker compose down to shut down the containers and remove them.

You can find the code for this tutorial in this GitHub repository.

What’s Next

There is much more you can do with Airflow to manage BigQuery tables, run queries and validate the data. You can find different BigQuery operators in this Airflow documentation.

To know more about Airflow and its use cases, I recommend you this Sicara article about dbt transformation which explains how to use dbt and Airflow to master your data transformation.

Cet article a été écrit par

Wiem Trigui

Wiem Trigui