Data Engineering

January 16, 2024 • 4 min read

Boost Your Data Projects with Pydantic: Validation and Efficiency

Rédigé par Chloé Adnet

If you are a python developer or a data engineer, you're likely to have heard of Pydantic. This article is tailored for you if you are unsure of its purpose, seeking to learn how to use it, or in search of tips for its applications.

Pydantic, the data engineer tool

Pydantic, a Python tool, defines data models with specific types and formats, providing a declarative way to structure data and facilitating validation and transformation. It ensures data integrity and is used in scenarios involving data normalization and validation. This tool is favored by data engineers for its proficiency in serialization, integration with frameworks like FastAPI and BigQuery, and its automatic documentation capabilities.

🧐 Other tools, such as JSON Schema, YAML, and Protobuff, share common features with Pydantic. They are also capable of defining data structures, validating inputs, and enforcing schema specifications. In this article, I present pydantic as I used it on my project and it had an impact on my delivery time and my daily development comfort.

🚀 Why not incorporate Pydantic into your data project when it saves implementation time, accelerates project familiarization, and anticipates 30% reduction in errors? Boost your project efficiency with Pydantic's seamless integration, faster onboarding, and a significant decrease in potential errors.

A simple data project example

The main issue is data quality

In my ongoing data project, achieving data quality and consistent transformations is our mission. The orchestrated ETL manages data from diverse sources and tries to keep data integrity as well as optimal quality within transformation.

Let’s take a python task consisting of 3 steps:

1 - Load data from distant database locally
2 - Clean this data and transform it
3 - Load the treated data into the database

Not all database-loaded data has the same schema. They come from different sources such as call centers or websites. However, the transformation is the same. The problem is to achieve a common transformation without losing information because of data patterns.

Python native solution

The problem was modeling and validating objects which was complex with python. This snippet of code illustrates the difficulty of using python dict as data type and the risk of introducing the correlated bug for ingestion, validation and transformation.

	def get_merged_contact(left: dict, right: dict) -> dict:
	merge_contact = dict()

	left_date = left.get("creationDate") if left.get("creationDate") else None
	right_date = right.get("creationDate") if right.get("creationDate") else None

	if left_date is None:
	if right_date is None:
	merg_contact["creationDate"] = current_date()
	merge_contact["creationDate"] = right_date

	if right_date is None:
	merge_contact["creationDate"] = left_date

	if right_date > left_date:
	merge_contact["creationDate"] = left_date

	else:
	merge_contact["creationDate"] = right_date


	# use specific field merge method for other fields
	merge_contact["addresses"] = merge_contact_addresses(left, right)
	return merge_contact

view raw ingestion.py hosted with ❤ by GitHub

naive python dict merge

This snippet illustrates how hard it is to ensure data quality within merge method. Type and quality validation must be repeated and manipulated data is not qualitative: None values should be ingest and use within transformations. It is also hard to maintain clean code and to build common patterns for distinct data schema.

Pydantic our game-changer

Initially, my team and I devised a unified schema and mapped each source meticulously. Yet, maintaining consistent data patterns in transformation and validation is a challenge, leading us to lean heavily on unit tests for validation.

During the project as business needs have evolved our data model has also evolved. However, modifying twice our data model led us to Pydantic—a game-changer! Pydantic’s class definitions enabled precise data schemas, allowing custom types for each field. It’s not just to simplify and secure transformations ; but also validation, ingestion and data exposition.

Perfect ETL to Streamline Contact Data

Introduction

In the project, there’s a database housing client data along with their contact records. The orchestrated pipeline uses python tasks like the one described before. The technologies used for data pipeline are airflow, bigquery and dbt. You will find best practices for creating data pipelines with airflow and docker .

The aim is to maintain a single, updated contact record for each client across multiple interactions. To manage our incoming data, our team built a unified structure using Pydantic for a contact data model.

	class PivotContact(TableResource):
	id_contact: str
	source: Optional[str] = None
	name: Optional[str] = None
	birth_date: Optional[ParsedDate] = None
	addresses: Optional[List[Optional[Address]]] = None
	emails: Optional[List[Optional[str]]] = None
	phone_numbers: Optional[List[Optional[str]]] = None
	active: Optional[bool] = None
	creation_date: Optional[ParsedDate] = None

	class TableResource(BaseModel):
	@classmethod
	def bigquery_schema(cls) -> list:
	return build_bq_schema(replace_refs(cls.model_json_schema()))

view raw pivot_contact_base_model.py hosted with ❤ by GitHub

PivotContact pydantic Class

To solve our data quality and code base issue we migrated our implementation with pydantic. The Pydantic PivotContact class provides incoming contact uniformity from distinct sources. It provides types and optional types for all fields and even specific types where it is possible to add data validation.

The TableResource class which extends pydantic BaseModel native class is a specific class with customized common method for all our data class. In this code snippet PivotContact class has the bigquery_schema method which enables to build bigquery schema with the model.

Multi sources data ingestion

In all data platforms, the first step is to ingest data. As data schemas of sources are different but data validation and transformation are common to all data, a common data schema is required. All sources are mapped in the common schema. Let's dive deeper into practical implementation

	common_contact = get_mapped_contact(contact, source_name="call_center")
	pydantic_common_contact = PivotContact(**common_contact)

view raw pivot_contact.py hosted with ❤ by GitHub

pivotContact instantiation

This snippet highlights how Pydantic ensures data conformity with the standardized contact object after mapping using the data model PivotContact. This interface is the only one to use then the objects used will be pydantic objects. Objects obtained after pydantic mapping and validation have determined and validated values. All subsequent operations will be done on this quality object.

How to validate data with Pydantic ?

The project data model requires data validation such as contact address. This field has to respect a schema. All contacts shall contain a unique id and valid creationDate. Pydantic base model enables us to build an Address class for the Addresses field and a custom validator for date type in contact data. This validator is a BeforeValidator.

BeforeValidator applies to fields in the pydantic model. It allows to apply a validation and to modify the field according to the rules to apply to the chosen field. The value given to the field is passed by this validator beforehand and thus respects the constraints of the field.

Now, let's delve into the significance of robust data validation

	def validate_date(d: Optional[str]) -> Optional[date]:
	"""Parse and validate a date from string `d`"""
	datetime_parsed = validate_datetime(d)
	if type(datetime_parsed) == date:
	return datetime_parsed
	if datetime_parsed:
	return datetime_parsed.date()
	return None

	ParsedDate = Annotated[Optional[date], BeforeValidator(lambda d: validate_date(d))]

	class Address(BaseModel):
	street_number: Optional[str] = None
	address_line: Optional[str] = None
	city_code: Optional[int] = None
	country_code: Optional[str] = None
	main_address: Optional[bool] = None

view raw tools.py hosted with ❤ by GitHub

pydantic data validation

Here, Pydantic empowers data accuracy and consistency by enforcing validation protocols, ensuring the integrity of critical fields like addresses and dates.

Mutualize transformations with pydantic

In the data project, loading information to BigQuery involves transformation and serialization, critical steps for efficient integration. Pydantic plays a pivotal role by serializing structured data seamlessly, ensuring compatibility and proper formatting for BigQuery storage.

	def merge(left: PivotContact, right: PivotContact) -> PivotContact:
	# Merge 2 contacts and return updated contact
	merge_contact = PivotContact()

	# simple field merge
	if left.creationDate > right.creationDate:
	merge_contact.id_contact = left.id_contact
	else:
	merge_contact.id_contact = right.id_contact

	# use specific field merge method for other fields
	merge_contact.addresses = merge_contact_addresses(left.addresses, right.addresses)
	return merge_contact


	def merge_and_serialize(left: PivotContact, right: PivotContact) -> dict:
	# Simplified serialization function leveraging Pydantic
	merged_contact = merge(left, right)
	return merged_contact.model_dump()

view raw pydantic_merge.py hosted with ❤ by GitHub

Pivot contact transformation

This simplified code snippet demonstrates how Pydantic facilitates the serialization process, enabling structured data to be formatted and prepared for smooth integration into BigQuery, maintaining consistency and adherence to required specifications.

My tips 🌈

Our data transformation involves processes like updating contact records and contact deduplication. Pydantic, in conjunction with tools like dbt and Kubernetes, orchestrates these transformations efficiently!

Applicability and Wider Usage

The scenarios we’ve outlined aren’t unique; they echo common challenges in various data projects. Pydantic’s versatility extends far beyond our specific use case for example:

Its seamless integration with frameworks like FastAPI makes it a preferred choice for API creation and data exposure.
It is used for relational databases like Postgres

Conclusion

Pydantic's data validation and construction methods have been a game-changer for our project, elevating data clarity and eradicating the risks associated with invalid information propagation. Implementing Pydantic slashed data error rates by 30%, showcasing its tangible impact.

An upcoming article will compare deeper Pydantic and other tools like JSON Schema, YAML, and Protobuff.

🚀 Ready to revolutionize your data projects as on this project with pydantic? Contact us!

Cet article a été écrit par

Chloé Adnet