Do you want to manage your BigQuery data pipelines automatically? Airflow is a great option to schedule, execute, and monitor your workflows since it is very easy to configure and setup.
If you are newer to Airflow, this Sicara article presents an excellent introduction to Airflow and its components. Otherwise, we can start our tutorial on how to run BigQuery in Airflow DAGs locally using Docker!
Why Docker with Airflow is awesome?
To launch airflow on your machine, we recommend using Docker. One of the major advantage of Docker is application portability. Docker encapsulates everything your application requires to run in containers. Thus, you can guarantee that your application will run in any UNIX environment without having to worry about differences in software versions or configurations . Moreover, Docker containers provide a level of isolation that makes it easier to manage multiple Airflow instances. This makes switching between multiple Airflow projects with different requirements very easy. Otherwise, you need to reset Airflow every time you switch project which is painful.
How to run Airflow with Docker Compose?
To set up Airflow with Docker-Compose, you can rely on this article. It presents a Docker Compose file configuration that defines a multi-container Airflow deployment, with a PostgreSQL database, a scheduler, a web server, and a utility container for initialization.
Before following the tutorial, please ensure that you have the following requirements fulfilled :
- Docker desktop installed in your machine: you can visit the official Docker website and download it from there.
- A python working folder containing an empty folder named
docker-compose.ymlfile and a
.envfile given in this article
- The new version of yaml cause the error “
Map keys must be unique”. To fix that, put
depends-onin the same line:
<<: [ *common, *depends-on ].
- For this tutorial, you should upgrade the airflow image version to 2.5.1.
After these modifications, the new
docker-compose.yml file should look like the following:
Create your GCP project
The first step is to create a Google Cloud platform project where the BigQuery datasets and tables will be created (At the moment of writing this article, Google is offering a 90 day $300 free trial for every new user). For that, follow these steps:
- Go to the project management page in the Google Cloud console.
- Click “Create Project”
- Enter the “project name” and note of the “project ID”. This is important for later. In our case, the project ID associated to our project is “airflow-project-test-385219”.
- Click “Create” to create your GCP project
Authenticate Google client libraries
First, install the Google Cloud SDK according to the official doc that provides detailed instructions for various operating systems.
Thereafter, you can authenticate Google client libraries to interact with Google APIs. To do this, you can run this command:
gcloud auth application-default login
The browser will be opened, and you need to choose the Google account to be authenticated.
~/.config/gcloud/application_default_credentials.json will then be created in your machine.
Now that the project creation and the authentication is done, we can create our first BigQuery DAG.
Prepare your DAG
For this tutorial, we are using a simple DAG with a few BigQuery operators to demonstrate how to run some queries. You'll need to put the DAG and the query files in the
The aim of this code is to create an airflow DAG with two tasks:
- Create a dataset
EUlocation using the
- Create a table
test_dataset. For that, we use
BigQueryInsertJobOperatorwhich calls an sql file
create_bq_table.sqlcontaining the table creation query.
Setup Google Cloud connection in Airflow
First, you need to mount a volume for Google cloud credentials. In the
docker-compose.yml file, create a volume between the local google cloud credentials file
~/.config/gcloud/application_default_credentials.json in your machine and a file
/home/airflow/.config/gcloud/application_default_credentials.json in Docker.
Then, you need to declare the Google Cloud environment variables.
- Define the connection id
GOOGLE_CLOUD_DEFAULTand the connection type
- Set the value of
GOOGLE_APPLICATION_CREDENTIALSto the credentials file path in Docker
- Set value of
GOOGLE_CLOUD_PROJECTto the id of your Google Cloud project.
Test it out
Now you can launch your dag!
docker compose up -dto run airflow in detached mode.
- Open http://0.0.0.0:8080 in your web browser (Username: airflow, Password: airflow)
- In DAGs/Actions, click ‘Trigger DAG’ and ensure that the dataset and the table creations are successful.
In google cloud console, in BigQuery service, verify that the dataset and the table have been created in your project.
docker compose stop to stop the containers or
docker compose down to shut down the containers and remove them.
You can find the code for this tutorial in this GitHub repository.
There is much more you can do with Airflow to manage BigQuery tables, run queries and validate the data. You can find different BigQuery operators in this Airflow documentation.
To know more about Airflow and its use cases, I recommend you this Sicara article about dbt transformation which explains how to use dbt and Airflow to master your data transformation.