DBT is one of the most widely used SQL data pipeline tools while Kubernetes is a standard for containerized applications management. Why not using both of them?
DBT allows you to define all the steps for your data pipelines in a flexibly and effectively. Soon, you will wonder where you can deploy your DBT application and run it to process your data. Here come Terraform and Kubernetes.
In this article, I’ll guide you through a project that schedules a series of DBT jobs in a Kubernetes cluster hosted in GCP. You will first deploy the infrastructure with Terraform. Then, you will define your DBT jobs. Eventually, you will deploy your DBT pipeline and make it run!
Overview of the DBT project
Before we dive into the explanations, you can clone the repository.
git clone https://github.com/ebenkara15/dbt-kube-demo
The project is pretty simple. We want to process some fake data from a list of customers and shops. The point is to be able to trigger the DBT jobs from Kubernetes to BigQuery. The diagram below makes the explanation more visual. We will come back to the details later on.
As you can guess, we will need a few tools to manage and deploy our resources. The objective is to have as little as possible interaction with the console.
The only prerequisite is to have a GCP project setup with a valid Billing Account. Once this is done, you can install the GCP CLI named
gcloud. Firstly, to authenticate and setup the CLI:
We will use several tools to run our project.
terraform: deploy and maintain the infrastructure (install | docs)
kubectl: interact with our cluster (install | docs)
docker: build our application (install | docs)
helm: package and deploy our application (install | docs)
You are now ready to start!
Set up your infrastructure for Kubernetes
Refer to the
terraform folder in the repository to follow this part.
If you are familiar with Terraform, you must have heard of Terraform state. It is a file that lists all the infrastructure resources and their configuration. A good practice is to store it remotely - in a bucket for example. Yet, there is no bucket for our state for now since the project is new and empty. The project bootstrap folder consists in activating the required GCP APIs and creating a bucket that will hold the state for future resources.
terraform/bootstrap, modify the project name in
_settings.tf and run
What do we need?
Architecture can be vast and complex sometimes. But we will keep it simple.
We need at least two obvious things - a Kubernetes cluster and a BigQuery dataset. But at first we have to configure some network resources. Basically, we create a private network in
terraform/network.tf and host the cluster into it with
terraform/gke.tf. The cluster is set to use Autopilot, it makes the configuration much easier.
However, the BigQuery part is straightforward, we just create a dataset that is similar to a database in conventional DBMS.
We also create an Artifact Registry that will hold our Docker image.
Think about permissions
Currently, you have a cluster that can host your DBT application and a dataset to hold your data and run your queries. So, what is next? Actually, if we deploy our application and run it now, we would get a log saying
403 Forbidden. Indeed, the application is lacking permissions but we will jump into this topic later.
You can run
terraform init and
terraform apply in the terraform folder to create the infrastructure: it will take around 8 minutes to complete. Just keep in mind we have to discuss the content of
Congratulations! You just built the infrastructure we need for the project !
If you would like to learn more about Terraform with a super example, you can read this article from a colleague of mine.
Define your DBT pipeline
Up to this point, the infrastructure is ready. Let’s define our DBT pipeline.
Below is an overview of the small pipeline we want to create.
“I don’t have any data to process”
No worries, you can either run
data_import.sh script in the
data folder or import manually the two
.ndjson files from that same folder in the bucket we just created. Make sure you create a folder named
data in the bucket if you chose to import it manually.
These dummy data represent shops and customers. The pipeline we are about to build aims at analyzing purchases made by customers.
How does DBT work?
DBT at its core is a tool that lets you define everything you need for your SQL pipelines. It basically connects to your favorite database and runs your SQL queries.
There is already a small DBT project defined in the repo in
src/kube_bq_jobs folder. You find essentially two main folders:
models. Macros are similar to functions in programming languages: these are reusable pieces of code. Models let you define transformation for your data. For instance the following macro imports some data from a GCS Bucket into BigQuery Dataset.
“It doesn’t look like SQL”
Indeed, DBT uses Jinja as its templating system. It allows you to use predefined variables and functions, if-else statements and many other things. In the previous macro, we wrote
but DBT will replace it by
The next step is to process and transform these data with one of our models. Here is an example of the number of purchases grouped by shop :
You can notice the use of
that simplifies the reference to other tables: you just have to put the name of one of the existing models! Handy, no ?
How do we connect to BigQuery?
Good point. DBT uses profiles to store your configurations. These profiles are usually defined in
.dbt/profiles.yml. One profile defines the different elements you want to use in your project. You then refer to it in the
dbt_project.yml - find it in
src/kube_bq_jobs folder. You can find below the profile we use in the project:
dataset: dbt_data # BigQuery dataset ID
location: europe-west1 # Your GCP location
method: oauth # Authentication method
project: kube-dbt-demo # GCP project ID
type: bigquery # SQL Backend used
GCP offers several methods to connect clients to their services, BigQuery is no exception. Actually, when you ran
gcloud init, you entered your credentials. These credentials are stored and then used as Application Default Credentials when accessing GCP services. That’s why we’ve set
oauth as a
method. But we still have permission aspects to review in the Kubernetes section.
If you want to see another example of interaction with BigQuery, you can read this article about Airflow ETL.
Dockerize your app
The DBT application is now ready to be dockerized! There is already a Dockerfile defined in the repo. It is based on the official DBT image. We just copy the project into it and that’s it.
Firstly, you need to run
gcloud auth configure-docker: it will handle the Docker authentication when pulling/pushing image from/to GCP artifact registry.
You are good for building and pushing the image. Go to the root of the repo :
docker build -t europe-west1-docker.pkg.dev/<project-id>/demo-repo/dbt-kube-demo:latest .
docker push europe-west1-docker.pkg.dev/<project-id>/demo-repo/dbt-kube-demo:latest
Run your pipeline (if you want)
If you have Python and Poetry installed, you can run the project now to make sure everything works fine :
poetry run dbt run
Deploy your application over Kubernetes
Let’s recap. The infrastructure is deployed and the application can be run from a local machine. The last step is to deploy the application to the cluster we created when dealing with Terraform.
What is Kubernetes?
Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications.
Kubernetes does a lot of things: we will focus on the basics. Natively, Kubernetes handles the lifecycle of your containerized applications. Everything in Kubernetes is denoted as resources. The fundamental and the smallest resource of K8s is the Pod. A Pod can be seen as a “logical host”: everything that is run inside a Pod shares the same context (IP address, physical resources, etc.). But the Pod is not the only resource you can define. Here is a small list of Kubernetes resources we will use:
ServiceAccount: like GCP, it defines an identity for workload
ConfigMap: a set of configurations that can be used by Pods
CronJob: a Pod scheduler that runs a given job
As a cluster can grow, you can organize your resources by namespace. However, you need to be aware that some resources are defined cluster-wide and others namespace-wide. The above resources are all defined at a namespace level.
What about Helm?
Helm is a package manager for Kubernetes. Kubernetes lets you define some resources but it does not manage your deployment. For instance, our app contains 3 different Kubernetes resources. If you want to deploy the application, you have to run 3 different
kubectl commands to deploy each. You rather want your app to behave like a package. With Helm, you run a single command and it deploys all your resources. Helm also allows you to roll back to a previous revision of your application.
Furthermore, Helm comes with a templating system. You create a YAML file that holds your values and you can reference these values in your resources definition. Typically, you put in this file the values that are susceptible to change:
- image name/tag
- specific resource name
- Pod configuration
It can be very useful in case your application has to be deployed in several environments (dev, pre-prod and prod, for instance). You just have to specify the right values file when running your Helm command.
The previous list of resources is those we will define to run our DBT application. The
ConfigMap will hold the profile for DBT. Since there is no sensitive value, we can safely use this resource. The
CronJob will pull our image and run it. Finally, the
ServiceAccount is attached to the
CronJob and will be used to authenticate against BigQuery.
All Kubernetes resources are defined using a YAML file. For example, the
ConfigMap definition looks like this :
Firstly, you must indicate the
kind of resource you are creating. Then, you give your resource a name and you assign it to a namespace: these 2 values are handled by Helm. Eventually, into the
data block, you declare what you want to store. Here, we only define
Now, let’s have a look at the
schedule: # "0 * * * *" Run every hour
- name: dbt-profile
mountPath: /usr/app/profiles.yml # Retrieve profiles.yml
- name: dbt-profile # Mount the ConfigMap
- key: dbt-profile
restartPolicy: OnFailure # Restart the job if it fails
It may seem unreadable but it is quite simple. You first define the
schedule for the job. It is arbitrarily defined to be run every hour, but you can change it. Then you have to specify the container you want to run and the service account you want to attach. For the container, you must specify an image and a list of
dbt run). The
volumeMounts parts load the DBT profile inside the pod.
Where is the
ServiceAccount? Right after!
But, wait! Why do we have to define another service account? Good point.
Kubernetes is a system that helps you manage your cluster. Thus, it provides a way to your Pods to authenticate inside the cluster: the
ServiceAccount resource. It has nothing to see with the GCP service account.
However, it would be very handy if we could bind the K8s
ServiceAccount to a GCP service account. With this pattern, we could grant the GCP service account the permissions we need, link it to the K8s service account and finally attach the latter service account to our Pods.
So we have to bind our service accounts. Actually, the binding is done in both GCP IAM and K8s. In K8s, we will refer to the GCP service account with an annotation into the
ServiceAccount definition :
# Reference to the GCP service account
On the GCP side, when you apply the
iam.tf file, the following steps occur :
datablocks: retrieves all the permissions we need
google_project_iam_custom_roleblock: creates a custom role and gives it the correct permissions
google_service_accountblock: creates a blank service account
google_project_iam_memberblock: makes the service account endorse the custom role
google_service_account_iam_bindingblock: binds the GCP service account and the K8s service account
The last block is the most interesting. It gives the GCP service account a special role which is
roles/iam.workloadIdentityUser. This indicates to GCP that the service account will be bound to a K8s service account. The final step is to indicate which K8s service account will be bound to it. The
members argument specifies which K8s service account you want to bind but it should respect a specific syntax :
For us, it is something like this :
members = [
Deploy your application
Here we are! It’s time to deploy your application.
You first need to connect to your cluster. Since you have
gcloud installed, you need to run these commands :
gcloud components install gke-gcloud-auth-plugin
gcloud container clusters get-credentials dbt-cluster --region <region> --project <project-id>
Eventually, you will run the ultimate Helm command to deploy your application. In a terminal, go to
kubernetes/helm/dbt-bigquery folder and run :
helm upgrade --install --atomic dbt -f values.yaml -n dbt . --create-namespace
Once you receive the confirmation that your application is deployed, you can jump into the GCP Console. Go to GKE, select your cluster and click on the Workloads tab. You should see a CronJob named
dbt-dbt-bigquery. If it doesn’t run automatically, you can click on it and select “Run now” and follow the logs. Ultimately, go to BigQuery and see if your tables are correctly created !
You are looking for Data Engineers experts? Feel free to contact us !