Data pipelines are crucial to businesses for collecting, processing and analyzing data efficiently. However, building a data pipeline from scratch can be time-consuming, resource-intensive, and often requires technical expertise. That's where Dataiku, a leading no-code data platform, comes in. With its user-friendly interface, Dataiku empowers users of all technical levels to build data pipelines with ease.
Having used Dataiku on a daily basis for more than five months, I am convinced that it is an exceptional tool for quickly constructing exploratory pipelines. Its preview feature for transformation steps offers quick feedback, which enables simplifying the mental data model for the developer.
I will detail below how I connected Dataiku to an S3 bucket and perform some computations on the Iris dataset’s sepal and petal lengths with no-code.
Dataiku is a very intuitive no-code data platform.
When it comes to data platforms, there are two main capabilities needed to start working:
1. Data storage:
In Dataiku, this is the combination of Datasets, for which you can choose Connections. Connections are the way to connect to your data hosting systems. Datasets are the data tables themselves hence contains also the schemas, and are always plugged into a specific Connection. Once the Connection is setup, either as read or write, creating a Dataset in it is pretty straightforward.
2. Data processing:
This is called Recipe in Dataiku world. A Recipe is a fully customizable transformation engine, taking a dataset as input and transforming it into something “consumable”. You can see the recipe catalog as your toolbox to perform any no-code transformation that you want.
First, to begin with the pipeline, let’s setup the data source
For this tutorial, I used the Dataiku free edition available on the Dataiku website installed on my local machine. I won’t detail the installation here. Besides, I used my AWS account, for which the very few requests made during this tutorial to my S3 cost me nothing.
If you are not interested in connecting to AWS S3, you can just upload the CSV directly and skip this section.
In my AWS account, I first created a bucket and dumped inside the famous Iris Kaggle dataset, in its JSON format. If you intend to follow this tutorial, I suggest you do the same.
Then create a Group and a new policy to allow reading and writing on your bucket with “Create inline policy”:
Add List, Read, and Write to the policy, to allow Dataiku to use the bucket and its content:
After that, create a user in the newly created group:
For this user, go to the Security credentials section and create a new Access key. Copy the Access key and Secret as you will use it later.
Then, launch your local Dataiku instance, and go to the Administration panel to create a connection to a bucket in your AWS environment by clicking on New connection and selecting Amazon S3.
Setup the connection, including the Access key, the Secret, and the bucket name.
Also, use the SSE encryption in the connection setup:
Then, import the data from the bucket in Dataiku
Once the connection is created, it is very simple to create a dataset parsing the JSON from the bucket.
Click on create dataset from AWS S3:
Then choose the connection that was created above and select the file you want to read in a dataset called iris and click create. Explicitly specify the file to read so Dataiku will not try to read any other data you might store in this bucket later on:
After creating the dataset, you will be redirected to the dataset view. One thing Dataiku is doing well: reading multiple file formats, with very little overhead for the developer.
The length and widths should have a string type even if Dataiku infers their meaning as Decimal. In the settings of the Dataset, let’s just cast them as floats directly in the schema and save the setting:
Now, just enjoy playing with data thanks to Dataiku transformations
First, the goal is to compute the average of the lengths and widths
Go back to the flow view and select your dataset. Then choose a Group recipe. You can choose CSV storage for your data on the S3 connection for simplicity.
In the Group section of the recipe configuration, uncheck the count of occurrences within groups because you won’t need it. Select the average computation for all the metrics of each species.
And in the Output section, you can see the name of the newly created columns (you can even rename them). Then, click on run to build the dataset:
If you are following this tutorial and doing it at the same time: congratulations, you have just built your first Dataiku dataset !
Then, we want to have the average for each specimen of iris from my original dataset
Create a Join recipe to merge the means with the original dataset, using the species as key:
In the Group section of the join, make sure the key to join is the species, and that the left dataset was the initial one, then run the build:
And now, let’s compute the deviation from the average for each specimen:
From the new dataset, create a Prepare recipe: add a first deviation calculation, and the preview is instantly displayed in blue in the dataset. Most important, the preview can be displayed for each intermediary step within the recipe. Preview feature really simplifies the review of changes as you make them, because you can view the state after each step within the recipe. Once you have many steps, this becomes your best ally for debugging!
Then add the final transformations. Round the deviation, drop the mean columns, and move the species column in first position. Then click on run:
You can finally browse your preprocessed iris dataset in Dataiku. You could also get the CSV directly from the bucket!
Looks like the job is done!
Time to wrap up
Without a single line of code, we were able to connect to a S3 bucket to Dataiku and perform a few transformations, with a preview at each step. It took me around one hour to build this tutorial setup, which is quite fair compared to building from scratch a Python environment with an S3 connection, logic and tests. In addition, as a python developer, it was a relief, for once, not to worry about my environment, the libraries I would have to use or my linter to punish me for a missing docstring: transformations steps are quite self-documented!
However, the pipeline I presented was merely a basic and rudimentary exploration. In actual practice, software craftsmanship involves code reviews and testing as fundamental components. I have not yet discovered an effective method for conducting no-code reviews, and I have had lingering concerns about the possibility of causing unintentional damage to my pipeline at any moment. Consequently, I have come to the realization that Dataiku can be extremely valuable for data scientists who wish to explore datasets or for business teams who have limited coding abilities but wish to conduct mock-ups and prototypes with straightforward automation features.