Data Engineering

October 18, 2023 • 4 min read

Data quality: how to have reliable data with Sifflet?

Rédigé par Lucie Martin

We all want reliable data on which to base our analyses. Often, a lack of knowledge, time, or money leads to neglecting data quality.

However, it is crucial to avoid carelessness as it can result in significant financial losses. Gartner's data quality market study, released in 2021, put the annual loss linked to data quality at $12.9 million. Financial losses can occur due to delayed deliveries or customers leaving for competitors because of a lack of confidence.

Poor data quality can also affect a company's reputation.

I had the chance to improve the data quality in a retail company recently. I ensured that the data was accurate, consistent, complete, and up-to-date to help the company make informed decisions. With this article, I would like to share how we monitor data quality using a data observability platform, Sifflet.

How can you control the data quality of data with Sifflet? What are the advantages and disadvantages of this tool?

Why monitor data quality?

Data quality is the state of health of the data at a given point in time. The following criteria can be used to monitor data: completeness, uniqueness, validity, consistency, accuracy, and integrity.

For each criterion, here is an example of the questions you should ask yourself :

Completeness: do we have all the data?
Uniqueness: are there two values for the same ID?
Validity: are the fields in the correct format?
Consistency: don't the various sources contradict each other?
Accuracy: is the data a good representation of the element it describes?
Integrity: have the data been altered?

Good data quality is essential for the company I worked for, as it sells its data and wants its customers to be satisfied with it. Furthermore, they want to deliver their product on time without damaging their reputation.

When dealing with a large volume of data, automating quality control is necessary. The Sifflet platform is an ideal solution for monitoring large amounts of data.

What is Sifflet?

Sifflet is a full-stack data observability platform that enables companies to trust their data in delivering actionable insights at scale. It is important to understand that data observability is not the same as data monitoring. This platform Sifflet has three main features, which will be described later in this article.

What is the difference between data observability and data monitoring?

Data monitoring enables the identification of a predefined set of failure modes. Whereas, data observability goes further, as it helps to understand the source of incidents. To sum up, data monitoring falls within the scope of data observability.

What are the possible applications of Sifflet?

Monitor data and detect when a quality criterion is met to ensure the data quality
investigate the impact of a data issue raised by reviewing the data lineage
navigate through the data easily and better understand it with the data catalog, metadata inventory

What is a data lineage? It is a process that allows you to create a map showing the origin and stages followed by a piece of data.

Lineage example from Sifflet documentation

Use case for data quality on a retail project

In this section, I present the case of the project I worked on. I present the simplified architecture, the problems, and how to set up a data quality control system.

Project context

Data is collected from various sources and placed in a data warehouse, BigQuery. The data is raw, as it has not been transformed. After retrieving the data, it undergoes a transformation process using DBT (Data Build Tool). Finally, data analysts use this transformed data to create dashboards for users.

Data pipeline before Sifflet implementation

What is DBT? It is an open-source Python-based environment for data transformation with data pipelines and data quality through DBT tests. The following article describes DBT in greater detail.

Problem and solution

In the section below, the data pipeline does not include any data quality checks. This can have embarrassing consequences, such as presenting incomplete or incorrect dashboards. That's why the observability platform Sifflet was chosen.

Sifflet is connected in parallel to the data processing pipeline to check at each stage whether the data quality is correct or not.

Data pipeline after Sifflet implementation

How to do data quality with Sifflet?

Creating monitoring rules on Sifflet is quick and easy. I'll describe the steps here.

First, you need to connect Sifflet to the data source. For more details on how to install a new one, I suggest you consult the Sifflet documentation, which goes through each step in turn.

Add a data source connection from Sifflet documentation

Then, in the "Monitors" tab, you can create a new monitoring rule, on the desired table. Don't forget to schedule the rule's execution and activate notifications.

Add a monitor rule from the project I worked on

If an incident has occurred, you can consult it in the "incidents" tab, and then assign it to yourself. The data lineage, which presents the data life cycle, helps you to understand the origin of the problem that has arisen in order to solve it.

“Incidents” page that summarizes all incidents that have been created - picture from Sifflet documentation

Some tips on how to use Sifflet :

use tags for your rules or tables to help you manage, organize, and filter them
use domains to split your data assets into subsets to provide a view of specific business areas to a specific team. For example, the "marketing" domain gathers all marketing-related data, to which only the marketing team has access

Feedback on the implementation of data quality with Sifflet

After this experience, would I advise you to use Sifflet? Yes, I would! Why?

Here are the benefits I've seen from using the platform :

Learning the platform is easy because you don't need any coding skills.
Sifflet can connect to a wide variety of data sources, such as ETL/ELT or BI tools
several employees can use the platform at the same time
as users, we receive notifications when an incident is detected
you can monitor your data as code
you can better understand your data with data catalogs and data lineage
the company takes into account your feedback
they are responsive and open to suggestions
new releases are regularly

I had regular meetings with the company Sifflet. This enabled us to discuss our day-to-day use of the platform and our needs and, on Sifflet’s side, to present us with the platform's new features.

The disadvantages I have noticed are:

I've encountered some bugs on Sifflet that have all been quickly resolved, i.e. within the day
Sifflet doesn't send notifications on Google Chat, the messaging system used by my team. It might have been easier for me if Google Chat had been integrated into Sifflet
Git has not yet been integrated into Sifflet but is planned for the near future. In the meantime, be careful if you delete a rule because you won't be able to recover it

Conclusion

As previously described, it is crucial to pay attention to data quality in our daily lives, and it should not be disregarded. In this article, I've introduced you to a new tool, Sifflet, on the market that lets you set up checks on the quality of your data. It has many advantages, such as its ease of use and its wide variety of data sources.

If you use DBT to transform your data, and you don’t want to use Sifflet, I advise you to combine DBT with Elementary to control data quality.
Are you looking for Data Quality Experts? Don't hesitate to contact us!

Cet article a été écrit par

Lucie Martin