Data Engineering

April 8, 2022 • 4 min read

Azure Durable Functions for ETL Orchestration

Rédigé par Mathieu Soul

Azure Durable Functions - Fan-Out/Fan-In

TL;DR - I build a serverless ETL process using orchestrated Azure Durable Functions in Python, implement a Fan-Out/Fan-In parallelisation pattern, and dive into monitoring and logging best practices.

Extract, Transform, and Load

Let’s give ourselves an ETL use case: I have a set of 100 restaurants, that upload every hour their latest sales in a CSV file (0.5 GB on average):

timestamp	item_ID	item_name	qty
2021-01-01 19:21:16	1234	Fries	2
2021-01-01 19:23:45	4321	Large Drink	1
2021-01-01 19:24:16	5678	Hamburger	3
2021-01-01 19:24:32	8765	Nuggets	1

view raw restaurant_1.csv hosted with ❤ by GitHub

Example of an uploaded CSV file

Specifically, these files are stored in a Blob Storage with the following file structure:

	root/
	- 2021/ # Year
	- 1/ # Month
	- 1/ # Day
	- 0/ # Hour
	- restaurant_1.csv
	- restaurant_2.csv
	...

view raw file_structure.txt hosted with ❤ by GitHub

Blob Storage file structure

To help the main office’s analytics, our job is to extract the uploaded data, transform it (applying some simple filtering), and load it into the company’s DataLake each hour. We also have to post the best-selling restaurant of the past hour to an API.

Choosing your service

Several services in Azure come to mind when building an ETL process:

Data Factory - a low-code, managed ETL service
DataBricks - an Apache Spark-based big data analytics service
Functions - severless code execution which can be triggered by HTTP, CRON...

Let’s focus on exploring the Azure Functions option, as it’s by far the cheapest one around!

Azure Durable Functions

Azure Functions are great for simple standalone tasks that require a limited amount of resources (10 min timeout, and 1.5GB of RAM per instance in the Consumption Tier). In our case, a single traditional Azure Function falls a bit short...

This is where Azure Durable Functions comes in: this extension allows us to design stateful workflows while remaining in the Azure Functions service (and in Python!). By orchestrating several instances, we can achieve parallel treatment of files!

Simply put, Azure Durable Functions introduces 4 new Azure Function types:

Orchestrator functions - These implement the workflow.
Activity functions - These do the work (computation, I/O...).
Entity functions - These create state, even in a serverless paradigm.
Client functions - These trigger orchestrations, and can themselves be triggered by traditional bindings (HTTP, CRON...)

Designing the Workflow

Here are a few key concepts for designing an orchestration:

Idempotency - As a general rule, ETLs should be idempotent (more details below)
Limitations - Activity Functions have the same limitations as Azure Functions: data volume of your use case must be under these limitations
Workflow/Work separation - Computation should never be done in Orchestrator Functions
At least once execution - Orchestrators only guarantee that each Activity Function is executed at least once

A few more details on the first point: for many reasons (error, change in source data...) it’s very practical to be able to re-trigger the pipeline without having to deal with side effects. This concept is known as idempotency:

“An operation is idempotent when calling it multiple times does not have a different result than calling it once.”

A common Data Engineering application of this concept is using overwrite instead of append. If a pipeline with an append operation runs twice, you may create duplicates!

Another important point: at least once execution means that any Activity Function has a risk of being called twice! In addition to the pipeline, it’s a good practice to make sure that each individual Activity Function is idempotent as well.

Keeping this in mind, let’s list out the basic tasks (activities) we need:

List files - List uploaded files to process.
Process file - Extract, transform, load, and compute the restaurant’s total.
Update ranking - Post new ranking to an API.

The Durable Functions documentation provides a sample parallelisation pattern called Fan-Out/Fan-In. Let’s draw out the workflow!

Coding an Orchestration

Let’s implement the previous workflow in Python code!

	import azure.durable_functions as df

	def orchestrator_function(context: df.DurableOrchestrationContext) -> str:

	# Call list_files activity function
	files_to_process = yield context.call_activity("list_files")

	# Fan-Out/Fan-In files to process with process_files activities
	process_files_activities = [
	context.call_activity("process_files", filepath)
	for filepath in files_to_process
	]
	total_sales_list = yield context.task_all(process_files_activities)

	# Call update_ranking with set of returned sales
	yield context.call_activity("update_ranking", set(total_sales_list))

	return "Success!"

view raw __init__.py hosted with ❤ by GitHub

Azure Durable Orchestrator code

Although the code is rather transparent, there are a couple of subtleties:

The yield keyword acts as a checkpoint for Azure Durable Functions: the orchestrator awaits completion of the called activities at these checkpoints.
For our ETL to be idempotent, the loading of the data must overwrite and not append.
Considering the set of total sales allows our ranking to remain correct if a Process Files Activity Function is executed twice (see the previous section).

Azure Durable Functions can be coded in the same language as your functional code, and synchronised in a single repository with Git!

Monitoring Durable Functions

Azure Functions being one of the less managed services, it doesn’t come with a built-in monitoring interface.

In addition, Orchestrator Functions will be “invoked” several times during their execution, each time with a different ID. Tracking them throughout their invocations, as well as the Activity Functions they launch, isn’t an easy task!

Azure Durable Functions - Invocations of an Orchestrator Function — Invocations of an Orchestrator Function

Activity Function Monitoring

By default, each Azure Function execution generates a set of logs that are perfect for monitoring individual Activity Functions. Here are a few examples of what can be done!

Activity Function Failures

We can query the exceptions table to count exceptions per activity.

	// Kusto Query
	exceptions
	\| where operation_Name != "orchestrator"
	\| summarize exceptions=dcount(operation_Id) by operation_Name
	\| render barchart;

view raw activity_failures hosted with ❤ by GitHub

Kusto query for Activity Function Failures

Activity Function Duration

To stay clear of the timeout, it’s also interesting to monitor activity duration (average and max) through the requests table:

	// Kusto Query
	requests
	\| where operation_Name != "orchestrator"
	\| extend duration=duration/1000/60 // Convert to minutes
	\| summarize mean_duration=avg(duration), max_duration=max(duration) by name
	\| render barchart;

view raw activity_duration hosted with ❤ by GitHub

Kusto query for Activity Function Duration

Shared Dashboards

With Azure Shared Dashboards, you can create live graphs built on queries of various Azure resources. In our case, we can pin the graphs created by our previous Kusto queries to our dashboard (a tutorial for this can be found here).

Orchestrator Function Monitoring

The Durable Functions Monitor VsCode extension directly queries the queues and storage systems used under the hood by the Azure Durable Functions, avoiding the problem of tracking the multiple Orchestrator Function invocations (see Task Hubs for more details).

Orchestration-level visualisations such as the GANTT chart can provide real insight on the state of an orchestration, and help for debugging:

GANTT chart showing the orchestration’s activity/entity runs

Custom Logging in Azure Functions

In order to have use-case specific monitoring, we can log additional custom structured data to the default log database, query it with Kusto, and add the generated graphs and tables to our Shared Dashboard!

Structured data can be added to the custom dimensions field of Azure Function traces by the opencensus log handlers. This open-source library specialised in the collection of distributed tracing, metrics and logging telemetry has created specific Azure Monitor exporters, through which you can send these custom logs to Application Insights.

	import logging

	from opencensus.ext.azure.log_exporter import AzureLogHandler
	from src.constants import APP_INSIGHTS_CONNECTION_STRING

	logger = logging.getLogger(__name__)
	logger.addHandler(
	AzureLogHandler(connection_string=APP_INSIGHTS_CONNECTION_STRING)
	)

	def log_restaurant_statistics(
	restaurant_id: str,
	processed_hour: str,
	total_sales: float,
	number_of_processed_rows: int,
	number_of_filtered_rows: int,
	):
	logger.info(
	"Logging restaurant statistics.",
	extra={
	"custom_dimensions": {
	"restaurant_id": restaurant_id,
	"processed_hour": processed_hour,
	"total_sales": total_sales,
	"number_of_processed_rows": number_of_processed_rows,
	"number_of_filtered_rows": number_of_filtered_rows
	}
	}
	)

view raw __init__.py hosted with ❤ by GitHub

Custom logging in the Process Files Activity Function

We can then extract this information from custom dimensions with our Kusto request to plot a given restaurant’s total sales over time!

	// Kusto query
	traces
	\| where message == "Logging restaurant statistics."
	\| extend restaurant_id = customDimensions['restaurant_id']
	\| extend total_sales = customDimensions['total_sales']
	\| extend processed_hour = customDimensions['processed_hour']
	\| extend date = todatetime(processed_hour)
	\| summarize argmax(timestamp, *) by restaurant_id, date
	\| project restaurant_id, total_sales, date
	\| render timechart;

view raw custom_dimensions hosted with ❤ by GitHub

Querying custom dimensions

Conclusion

We’ve covered through our examples the following items:

Azure Durable Functions core concepts
Best practices for workflow design
Code example of a Fan-Out/Fan-In pattern
Monitoring Activity Functions with Shared Dashboards
Monitoring Orchestrator Functions with Durable Functions Monitor
Custom logging to take ETL monitoring further

If you have question or would like some more details, don’t hesitate to contact us!

Cet article a été écrit par

Mathieu Soul