Creating an ELT pipeline to ingest data is a crucial part of any project of Data Engineering. Yet, it can be a cumbersome task depending on the complexity of the data source. To this purpose, Airbyte is a new open-source tool that is rapidly gaining in popularity by claiming to offer a simple way to implement end-to-end ELT flows in minutes. We will see in this article what is Airbyte, and how we can use it to set up a connection from scratch between Notion and BigQuery.
What is Airbyte and how to install it ?
Airbyte was first launched in 2020 with the aim of offering open-source connectors from data sources to data storage solutions. The idea was to replace proprietary tools and enable people to develop and share solutions for less popular data sources not considered by proprietary suppliers. In less than two years, Airbyte managed to raise $181.2 million for their solution, and at the time of writing this article, they already have more than 250 connectors in their catalog (including those in beta or alpha).
But how does it work ? Essentially, an ELT flow with Airbyte will be characterized by three different components :
- a source connector, specific to the data source
- a destination connector, specific to the chosen storage solution
- a connection, linking the two connectors and triggering synchronizations of the data
An important point to keep in mind, is that the two connectors are completely independent with one another : any source connector can be used with any destination connector. This has two main advantages. First, a source of data only needs one version of connector, which can then be used in multiple connections regardless of the destination, hence allowing to take full advantage of the already developed connectors. Second, since any couple of source and destination connectors can work together, this makes it extremely easy to change the data storage solution: just replace the destination connector with a new one without touching the configuration of the source connector.
How to install Airbyte
Now that we begin to see the power of Airbyte, let’s see how to actually use it. There are two different ways to do so:
- with Airbyte Cloud, a paying solution hosted by Airbyte that lets you use all the connectors and features without actually having to deal with installing the tool
- with Airbyte Open Source, where you can deploy the tool on the infrastructure of your choice without paying anything
Although some connectors and a few features are only available with Airbyte Cloud, we are going to use the open source solution, as it is free and very easy to deploy on our local machine.
Make sure you have Docker running on your computer. Then, open a terminal, and run the 3 following lines:
git clone https://github.com/airbytehq/airbyte.git
And… that’s it! Opening a browser at the indicated URL and after creating an account with your email address, you should see the welcome page of Airbyte, ready to set up your first connection.
Set up an Airbyte connection between Notion and BigQuery
Now that we have an instance of Airbyte up and running, we are going to see how to set up a connection to get data from Notion to BigQuery. This will be done in three main steps, following the three main components of Airbyte I presented earlier:
- Configure the source connector for Notion
- Configure the destination connector for BigQuery
- Finalize the set-up of the connection
Step 1: Notion source connector
Set up of Notion
First, you need to create a Notion account if you don’t already have one, and create an integration with reading rights by going to this page (it should take you less than 2 minutes). Make sure to save the given Internal Integration Token, you will need it later.
Then, on any page in your Notion workspace, add the integration you just created to the connections, so that Airbyte can have access to this page later on. To do so, click on the three dots in the top right corner of the page, and go right at the bottom of the pop-up window.
After this small configuration of Notion, we are ready to actually configure the connector itself.
Configuration of the source connector
Back on Airbyte, a click on Set up your first connection brings us to a list to choose our connector from. After selecting the Notion one, the screen splits in two panels: on the left, the configuration of the connector itself with different fields, while a Setup Guide on the right gives us more information about the connector and its usage.
This connector is particularly easy to set up: put a name and a start date of your choosing, and paste the token of the integration created earlier in the Access Token field. Then click on the Set up source button. Airbyte will test your configuration, to see if the token you entered is a valid one for example, before leading you to the next step, the destination connector !
Step 2: BigQuery destination connector
Set up of BigQuery
First, we need to create a GCP project that we will use to store the data coming from Notion. Once you have a project, create a service account on this page, and give it the BigQuery administrator role so that Airbyte can write the data. Finally, create and save a JSON key for this service account, which we will use later on during the connector configuration.
Configuration of the destination connector
Just like the source connector, Airbyte presents us with a list of connectors to pick from. By selecting the BigQuery connector, we reach a page with two panels similar to the one we had before, this time for BigQuery.
Once again, we fill the different fields with information about our BigQuery project, and use the key we got from the service account earlier. Once everything is complete, Airbyte will test the connector just like it did for the source one, before displaying the final step to configure the connection itself.
Step 3: Setup of the connection
Now that our two connectors are ready, the last screen of configuration shows us different options regarding the connection. Let’s go through the different sections of this page to see what is possible.
Apart from the obvious title section, the first one regards the replication frequency. This setting allows us to choose how frequently the data will be synced from the source to the destination. Indeed, no need to rely on an external orchestrator to trigger the pipeline, Airbyte allows us to do so natively.
Streams and Synchronization modes
The next section is about streams. In Airbyte, streams are defined by the source connector, and there are used to separate the data into different channels, each one being configurable independently. For an API, for example, each stream typically corresponds to a different route. In the case of Notion, we have four different streams that we can activate or deactivate depending on whether we want to get the related data or not. Streams can also be used to store the data in different locations in the destination.
For each stream, we can also select a different synchronization mode. Synchronization modes define what data the source connector will get at each sync, and how this data will be handled by the destination connector. There are four of them:
- Full refresh | Overwrite: at each synchronization, the source connector gets all the available data. The destination connector then overwrites any existing data with the new one.
- Full refresh | Append: similar to the previous one, but the data is appended to existing data instead of overwriting it.
- Incremental | Append: this mode uses a Cursor field to save in a state what was already treated in a previous sync. Then, the source connector uses this state to get only new data, which is appended to existing data in the destination.
- Incremental | Deduped history: in addition to the cursor field, a primary key is used by the destination connector to update the fields that already exists in the database. This mode is a bit more complex than the others, so make sure to check Airbyte’s documentation if you want to know more.
In our case, let’s use Full refresh | Overwrite for all streams for simplicity.
Normalization and transformations
A last section of the configuration allows us to apply transformations on the data once it has been loaded in the destination. A first transformation called normalization is implemented by Airbyte and can be used as is. But you can also use other custom transformations by linking a git repository. However, only DBT transformations are accepted, and additional limitations can apply depending on the destination connector used.
For now, we just leave the normalization on the Normalized tabular data option. We can then click on Set up connection to finalize the configuration… and we are done! Our first connection with Airbyte is already up and running.
After the first sync is finished, you can go to your BigQuery project, and find out that data is already there for each stream you selected.
Now that our connection is running, Airbyte lets us monitor the different synchronizations by accessing the logs. We can also trigger a new sync whenever we want, or even change the configuration of the connection itself.
On the positive side, Airbyte is a great tool that can be used to set up a fully functional ELT pipeline rapidly. As we saw together, the simplicity of its interface and the numerous connectors already developed by the community allow connecting sources and destinations without having to write any custom code. Furthermore, Airbyte offers many other functionalities like custom transformations to fit different needs. Developing a new custom connector in Python or Java for a new source or destination is also very easy to do. Airbyte provides all the necessary documentations, as well as templates to start with.
However, you should keep in mind that Airbyte is still quite young and evolving rapidly. Many of the connectors available today are still in alpha or beta. Even the functionalities of Airbyte itself might not be as complete as those of other platforms. But its rapid evolution means new features come out every month, so now that you know how to use it, go on and explore more of its capabilities in real-life projects!
At Sicara, we already used Airbyte in multiple projects to help clients with their data pipelines. So if you ever want to meet experts in Airbyte, or more broadly in data engineering and data science, feel free to contact us!