If you are a python developer or a data engineer, you're likely to have heard of Pydantic. This article is tailored for you if you are unsure of its purpose, seeking to learn how to use it, or in search of tips for its applications.
Pydantic, the data engineer tool
Pydantic, a Python tool, defines data models with specific types and formats, providing a declarative way to structure data and facilitating validation and transformation. It ensures data integrity and is used in scenarios involving data normalization and validation. This tool is favored by data engineers for its proficiency in serialization, integration with frameworks like FastAPI and BigQuery, and its automatic documentation capabilities.
🧐 Other tools, such as JSON Schema, YAML, and Protobuff, share common features with Pydantic. They are also capable of defining data structures, validating inputs, and enforcing schema specifications. In this article, I present pydantic as I used it on my project and it had an impact on my delivery time and my daily development comfort.
🚀 Why not incorporate Pydantic into your data project when it saves implementation time, accelerates project familiarization, and anticipates 30% reduction in errors? Boost your project efficiency with Pydantic's seamless integration, faster onboarding, and a significant decrease in potential errors.
A simple data project example
The main issue is data quality
In my ongoing data project, achieving data quality and consistent transformations is our mission. The orchestrated ETL manages data from diverse sources and tries to keep data integrity as well as optimal quality within transformation.
Let’s take a python task consisting of 3 steps:
1 - Load data from distant database locally
2 - Clean this data and transform it
3 - Load the treated data into the database
Not all database-loaded data has the same schema. They come from different sources such as call centers or websites. However, the transformation is the same. The problem is to achieve a common transformation without losing information because of data patterns.
Python native solution
The problem was modeling and validating objects which was complex with python. This snippet of code illustrates the difficulty of using python
dict as data type and the risk of introducing the correlated bug for ingestion, validation and transformation.
This snippet illustrates how hard it is to ensure data quality within merge method. Type and quality validation must be repeated and manipulated data is not qualitative: None values should be ingest and use within transformations. It is also hard to maintain clean code and to build common patterns for distinct data schema.
Pydantic our game-changer
Initially, my team and I devised a unified schema and mapped each source meticulously. Yet, maintaining consistent data patterns in transformation and validation is a challenge, leading us to lean heavily on unit tests for validation.
During the project as business needs have evolved our data model has also evolved. However, modifying twice our data model led us to Pydantic—a game-changer! Pydantic’s class definitions enabled precise data schemas, allowing custom types for each field. It’s not just to simplify and secure transformations ; but also validation, ingestion and data exposition.
Perfect ETL to Streamline Contact Data
In the project, there’s a database housing client data along with their contact records. The orchestrated pipeline uses python tasks like the one described before. The technologies used for data pipeline are airflow, bigquery and dbt. You will find best practices for creating data pipelines with airflow and docker .
The aim is to maintain a single, updated contact record for each client across multiple interactions. To manage our incoming data, our team built a unified structure using Pydantic for a contact data model.
To solve our data quality and code base issue we migrated our implementation with pydantic. The Pydantic
PivotContact class provides incoming contact uniformity from distinct sources. It provides types and optional types for all fields and even specific types where it is possible to add data validation.
TableResource class which extends pydantic BaseModel native class is a specific class with customized common method for all our data class. In this code snippet
PivotContact class has the
bigquery_schema method which enables to build bigquery schema with the model.
Multi sources data ingestion
In all data platforms, the first step is to ingest data. As data schemas of sources are different but data validation and transformation are common to all data, a common data schema is required. All sources are mapped in the common schema. Let's dive deeper into practical implementation
This snippet highlights how Pydantic ensures data conformity with the standardized contact object after mapping using the data model
PivotContact. This interface is the only one to use then the objects used will be pydantic objects. Objects obtained after pydantic mapping and validation have determined and validated values. All subsequent operations will be done on this quality object.
How to validate data with Pydantic ?
The project data model requires data validation such as contact address. This field has to respect a schema. All contacts shall contain a unique id and valid creationDate. Pydantic base model enables us to build an
Address class for the Addresses field and a custom validator for date type in contact data. This validator is a
BeforeValidator applies to fields in the pydantic model. It allows to apply a validation and to modify the field according to the rules to apply to the chosen field. The value given to the field is passed by this validator beforehand and thus respects the constraints of the field.
Now, let's delve into the significance of robust data validation
Here, Pydantic empowers data accuracy and consistency by enforcing validation protocols, ensuring the integrity of critical fields like addresses and dates.
Mutualize transformations with pydantic
In the data project, loading information to BigQuery involves transformation and serialization, critical steps for efficient integration. Pydantic plays a pivotal role by serializing structured data seamlessly, ensuring compatibility and proper formatting for BigQuery storage.
This simplified code snippet demonstrates how Pydantic facilitates the serialization process, enabling structured data to be formatted and prepared for smooth integration into BigQuery, maintaining consistency and adherence to required specifications.
My tips 🌈
Our data transformation involves processes like updating contact records and contact deduplication. Pydantic, in conjunction with tools like dbt and Kubernetes, orchestrates these transformations efficiently!
Applicability and Wider Usage
The scenarios we’ve outlined aren’t unique; they echo common challenges in various data projects. Pydantic’s versatility extends far beyond our specific use case for example:
- Its seamless integration with frameworks like FastAPI makes it a preferred choice for API creation and data exposure.
- It is used for relational databases like Postgres
Pydantic's data validation and construction methods have been a game-changer for our project, elevating data clarity and eradicating the risks associated with invalid information propagation. Implementing Pydantic slashed data error rates by 30%, showcasing its tangible impact.
An upcoming article will compare deeper Pydantic and other tools like JSON Schema, YAML, and Protobuff.
🚀 Ready to revolutionize your data projects as on this project with pydantic? Discover Sicara’s expertise in data engineering