8 min read

DBT & Kubernetes: the perfect duo for your data pipelines

Rédigé par Erwan Benkara

Erwan Benkara

DBT is one of the most widely used SQL data pipeline tools while Kubernetes is a standard for containerized applications management. Why not using both of them?


DBT allows you to define all the steps for your data pipelines in a flexibly and effectively. Soon, you will wonder where you can deploy your DBT application and run it to process your data. Here come Terraform and Kubernetes.

In this article, I’ll guide you through a project that schedules a series of DBT jobs in a Kubernetes cluster hosted in GCP. You will first deploy the infrastructure with Terraform. Then, you will define your DBT jobs. Eventually, you will deploy your DBT pipeline and make it run!

Overview of the DBT project

Before we dive into the explanations, you can clone the repository.

git clone https://github.com/ebenkara15/dbt-kube-demo

The project is pretty simple. We want to process some fake data from a list of customers and shops. The point is to be able to trigger the DBT jobs from Kubernetes to BigQuery. The diagram below makes the explanation more visual. We will come back to the details later on.

Simplified architecture diagram of the project
Simplified architecture diagram of the DBT project

As you can guess, we will need a few tools to manage and deploy our resources. The objective is to have as little as possible interaction with the console.

Prerequisite

The only prerequisite is to have a GCP project setup with a valid Billing Account. Once this is done, you can install the GCP CLI named gcloud. Firstly, to authenticate and setup the CLI:

gcloud init

Tools

We will use several tools to run our project.

  • terraform: deploy and maintain the infrastructure (install | docs)
  • kubectl: interact with our cluster (install | docs)
  • docker: build our application (install | docs)
  • helm: package and deploy our application (installdocs)

You are now ready to start!

Set up your infrastructure for Kubernetes

Refer to the terraform folder in the repository to follow this part.

Project bootstrap

If you are familiar with Terraform, you must have heard of Terraform state. It is a file that lists all the infrastructure resources and their configuration. A good practice is to store it remotely - in a bucket for example. Yet, there is no bucket for our state for now since the project is new and empty. The project bootstrap folder consists in activating the required GCP APIs and creating a bucket that will hold the state for future resources.

Go to terraform/bootstrap, modify the project name in _settings.tf and run terraform apply.

What do we need?

Architecture can be vast and complex sometimes. But we will keep it simple.

We need at least two obvious things - a Kubernetes cluster and a BigQuery dataset. But at first we have to configure some network resources. Basically, we create a private network in terraform/network.tf and host the cluster into it with terraform/gke.tf. The cluster is set to use Autopilot, it makes the configuration much easier.

However, the BigQuery part is straightforward, we just create a dataset that is similar to a database in conventional DBMS.

We also create an Artifact Registry that will hold our Docker image.

Think about permissions

Currently, you have a cluster that can host your DBT application and a dataset to hold your data and run your queries. So, what is next? Actually, if we deploy our application and run it now, we would get a log saying 403 Forbidden. Indeed, the application is lacking permissions but we will jump into this topic later.

You can run terraform init and terraform apply in the terraform folder to create the infrastructure: it will take around 8 minutes to complete. Just keep in mind we have to discuss the content of iam.tf.

Congratulations! You just built the infrastructure we need for the project !

If you would like to learn more about Terraform with a super example, you can read this article from a colleague of mine.

Define your DBT pipeline

Up to this point, the infrastructure is ready. Let’s define our DBT pipeline.
Below is an overview of the small pipeline we want to create.

DBT pipeline overview
Overview of the DBT pipeline

“I don’t have any data to process”

No worries, you can either run data_import.sh script in the data folder or import manually the two .ndjson files from that same folder in the bucket we just created. Make sure you create a folder named data in the bucket if you chose to import it manually.

These dummy data represent shops and customers. The pipeline we are about to build aims at analyzing purchases made by customers.

How does DBT work?

DBT at its core is a tool that lets you define everything you need for your SQL pipelines. It basically connects to your favorite database and runs your SQL queries.

There is already a small DBT project defined in the repo in src/kube_bq_jobs folder. You find essentially two main folders: macros and models. Macros are similar to functions in programming languages: these are reusable pieces of code. Models let you define transformation for your data. For instance the following macro imports some data from a GCS Bucket into BigQuery Dataset.

DBT macro that import data from GCS Bucket to BigQuery Dataset

“It doesn’t look like SQL”

Indeed, DBT uses Jinja as its templating system. It allows you to use predefined variables and functions, if-else statements and many other things. In the previous macro, we wrote but DBT will replace it by `kube-dbt-demo`.`dbt_data`.

The next step is to process and transform these data with one of our models. Here is an example of the number of purchases grouped by shop :

A simple group by operation to get the number of purchases by shop

You can notice the use of that simplifies the reference to other tables: you just have to put the name of one of the existing models! Handy, no ?

How do we connect to BigQuery?

Good point. DBT uses profiles to store your configurations. These profiles are usually defined in .dbt/profiles.yml. One profile defines the different elements you want to use in your project. You then refer to it in the dbt_project.yml - find it in src/kube_bq_jobs folder. You can find below the profile we use in the project:

kube_bq_jobs:
outputs:
dev:
dataset: dbt_data # BigQuery dataset ID
fixed_retries: 1
location: europe-west1 # Your GCP location
method: oauth # Authentication method
priority: interactive
project: kube-dbt-demo # GCP project ID
threads: 1
timeout_seconds: 300
type: bigquery # SQL Backend used
target: dev

GCP offers several methods to connect clients to their services, BigQuery is no exception. Actually, when you ran gcloud init, you entered your credentials. These credentials are stored and then used as Application Default Credentials when accessing GCP services. That’s why we’ve set oauth as a method. But we still have permission aspects to review in the Kubernetes section.

If you want to see another example of interaction with BigQuery, you can read this article about Airflow ETL.

Dockerize your app

The DBT application is now ready to be dockerized! There is already a Dockerfile defined in the repo. It is based on the official DBT image. We just copy the project into it and that’s it.

Firstly, you need to run gcloud auth configure-docker: it will handle the Docker authentication when pulling/pushing image from/to GCP artifact registry.

You are good for building and pushing the image. Go to the root of the repo :

docker build -t europe-west1-docker.pkg.dev/<project-id>/demo-repo/dbt-kube-demo:latest .
docker push europe-west1-docker.pkg.dev/<project-id>/demo-repo/dbt-kube-demo:latest

Run your pipeline (if you want)

If you have Python and Poetry installed, you can run the project now to make sure everything works fine :

poetry install
poetry run dbt run

Deploy your application over Kubernetes

Let’s recap. The infrastructure is deployed and the application can be run from a local machine. The last step is to deploy the application to the cluster we created when dealing with Terraform.

What is Kubernetes?

Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications.

Kubernetes does a lot of things: we will focus on the basics. Natively, Kubernetes handles the lifecycle of your containerized applications. Everything in Kubernetes is denoted as resources. The fundamental and the smallest resource of K8s is the Pod. A Pod can be seen as a “logical host”: everything that is run inside a Pod shares the same context (IP address, physical resources, etc.). But the Pod is not the only resource you can define. Here is a small list of Kubernetes resources we will use:

  • ServiceAccount: like GCP, it defines an identity for workload
  • ConfigMap: a set of configurations that can be used by Pods
  • CronJob: a Pod scheduler that runs a given job

As a cluster can grow, you can organize your resources by namespace. However, you need to be aware that some resources are defined cluster-wide and others namespace-wide. The above resources are all defined at a namespace level.

What about Helm?

Helm is a package manager for Kubernetes. Kubernetes lets you define some resources but it does not manage your deployment. For instance, our app contains 3 different Kubernetes resources. If you want to deploy the application, you have to run 3 different kubectl commands to deploy each. You rather want your app to behave like a package. With Helm, you run a single command and it deploys all your resources. Helm also allows you to roll back to a previous revision of your application.

Furthermore, Helm comes with a templating system. You create a YAML file that holds your values and you can reference these values in your resources definition. Typically, you put in this file the values that are susceptible to change:

  • image name/tag
  • specific resource name
  • Pod configuration

It can be very useful in case your application has to be deployed in several environments (dev, pre-prod and prod, for instance). You just have to specify the right values file when running your Helm command.

Defining resources

The previous list of resources is those we will define to run our DBT application. The ConfigMap will hold the profile for DBT. Since there is no sensitive value, we can safely use this resource. The CronJob will pull our image and run it. Finally, the ServiceAccount is attached to the CronJob and will be used to authenticate against BigQuery.

All Kubernetes resources are defined using a YAML file. For example, the ConfigMap definition looks like this :

kind: ConfigMap
apiVersion: v1
metadata:
name:
namespace:
data:
dbt-profile: |
kube_bq_jobs:
outputs:
dev:
dataset: dbt_data
fixed_retries: 1
location: europe-west1
method: oauth
priority: interactive
project: kube-dbt-demo
threads: 1
timeout_seconds: 300
type: bigquery
target: dev

Firstly, you must indicate the kind of resource you are creating. Then, you give your resource a name and you assign it to a namespace: these 2 values are handled by Helm. Eventually, into the data block, you declare what you want to store. Here, we only define dbt-profile.

Now, let’s have a look at the CronJob resource.

apiVersion: batch/v1
kind: CronJob
metadata:
name:
spec:
schedule: # "0 * * * *" Run every hour
jobTemplate:
spec:
template:
spec:
serviceAccountName:
containers:
- name:
image: ":"
imagePullPolicy:
command:
- dbt
- run
volumeMounts:
- name: dbt-profile
mountPath: /usr/app/profiles.yml # Retrieve profiles.yml
subPath: profiles.yml
volumes:
- name: dbt-profile # Mount the ConfigMap
configMap:
name:
items:
- key: dbt-profile
path: profiles.yml
restartPolicy: OnFailure # Restart the job if it fails

It may seem unreadable but it is quite simple. You first define the schedule for the job. It is arbitrarily defined to be run every hour, but you can change it. Then you have to specify the container you want to run and the service account you want to attach. For the container, you must specify an image and a list of commands (here dbt run). The volume and volumeMounts parts load the DBT profile inside the pod.

Where is the ServiceAccount? Right after!

Permissions focus

But, wait! Why do we have to define another service account? Good point.

Kubernetes is a system that helps you manage your cluster. Thus, it provides a way to your Pods to authenticate inside the cluster: the ServiceAccount resource. It has nothing to see with the GCP service account.

However, it would be very handy if we could bind the K8s ServiceAccount to a GCP service account. With this pattern, we could grant the GCP service account the permissions we need, link it to the K8s service account and finally attach the latter service account to our Pods.

Service Account binding
Service Account binding between Kubernetes and GCP

So we have to bind our service accounts. Actually, the binding is done in both GCP IAM and K8s. In K8s, we will refer to the GCP service account with an annotation into the ServiceAccount definition :

apiVersion: v1
kind: ServiceAccount
metadata:
name:
namespace:
labels:

annotations:
# Reference to the GCP service account
iam.gke.io/gcp-service-account: dbt-sa@<project-id>.iam.gserviceaccount.com

On the GCP side, when you apply the iam.tf file, the following steps occur :

  • data blocks: retrieves all the permissions we need
  • google_project_iam_custom_role block: creates a custom role and gives it the correct permissions
  • google_service_account block: creates a blank service account
  • google_project_iam_member block: makes the service account endorse the custom role
  • google_service_account_iam_binding block: binds the GCP service account and the K8s service account

The last block is the most interesting. It gives the GCP service account a special role which is roles/iam.workloadIdentityUser. This indicates to GCP that the service account will be bound to a K8s service account. The final step is to indicate which K8s service account will be bound to it. The members argument specifies which K8s service account you want to bind but it should respect a specific syntax :

"serviceAccount:<project-id>.svc.id.goog.[<k8s-namespace>/<k8s-serviceaccount>]"

For us, it is something like this :

members = [
"serviceAccount:${local.project}.svc.id.goog[dbt/dbt-sa]"
]

Deploy your application

Here we are! It’s time to deploy your application.

You first need to connect to your cluster. Since you have kubectl and gcloud installed, you need to run these commands :

gcloud components install gke-gcloud-auth-plugin
gcloud container clusters get-credentials dbt-cluster --region <region> --project <project-id>

Eventually, you will run the ultimate Helm command to deploy your application. In a terminal, go to kubernetes/helm/dbt-bigquery folder and run :

helm upgrade --install --atomic dbt -f values.yaml -n dbt . --create-namespace

Once you receive the confirmation that your application is deployed, you can jump into the GCP Console. Go to GKE, select your cluster and click on the Workloads tab. You should see a CronJob named dbt-dbt-bigquery. If it doesn’t run automatically, you can click on it and select “Run now” and follow the logs. Ultimately, go to BigQuery and see if your tables are correctly created !

You are looking for Data Engineers experts? Feel free to contact us !

Cet article a été écrit par

Erwan Benkara

Erwan Benkara