TL;DR - I build a serverless ETL process using orchestrated Azure Durable Functions in Python, implement a Fan-Out/Fan-In parallelisation pattern, and dive into monitoring and logging best practices.
Extract, Transform, and Load
Let’s give ourselves an ETL use case: I have a set of 100 restaurants, that upload every hour their latest sales in a CSV file (0.5 GB on average):
Specifically, these files are stored in a Blob Storage with the following file structure:
To help the main office’s analytics, our job is to extract the uploaded data, transform it (applying some simple filtering), and load it into the company’s DataLake each hour. We also have to post the best-selling restaurant of the past hour to an API.
Choosing your service
Several services in Azure come to mind when building an ETL process:
- Data Factory - a low-code, managed ETL service
- DataBricks - an Apache Spark-based big data analytics service
- Functions - severless code execution which can be triggered by HTTP, CRON...
Let’s focus on exploring the Azure Functions option, as it’s by far the cheapest one around!
Azure Durable Functions
Azure Functions are great for simple standalone tasks that require a limited amount of resources (10 min timeout, and 1.5GB of RAM per instance in the Consumption Tier). In our case, a single traditional Azure Function falls a bit short...
This is where Azure Durable Functions comes in: this extension allows us to design stateful workflows while remaining in the Azure Functions service (and in Python!). By orchestrating several instances, we can achieve parallel treatment of files!
Simply put, Azure Durable Functions introduces 4 new Azure Function types:
- Orchestrator functions - These implement the workflow.
- Activity functions - These do the work (computation, I/O...).
- Entity functions - These create state, even in a serverless paradigm.
- Client functions - These trigger orchestrations, and can themselves be triggered by traditional bindings (HTTP, CRON...)
Designing the Workflow
Here are a few key concepts for designing an orchestration:
- Idempotency - As a general rule, ETLs should be idempotent (more details below)
- Limitations - Activity Functions have the same limitations as Azure Functions: data volume of your use case must be under these limitations
- Workflow/Work separation - Computation should never be done in Orchestrator Functions
- At least once execution - Orchestrators only guarantee that each Activity Function is executed at least once
A few more details on the first point: for many reasons (error, change in source data...) it’s very practical to be able to re-trigger the pipeline without having to deal with side effects. This concept is known as idempotency:
“An operation is idempotent when calling it multiple times does not have a different result than calling it once.”
A common Data Engineering application of this concept is using overwrite instead of append. If a pipeline with an append operation runs twice, you may create duplicates!
Another important point: at least once execution means that any Activity Function has a risk of being called twice! In addition to the pipeline, it’s a good practice to make sure that each individual Activity Function is idempotent as well.
Keeping this in mind, let’s list out the basic tasks (activities) we need:
- List files - List uploaded files to process.
- Process file - Extract, transform, load, and compute the restaurant’s total.
- Update ranking - Post new ranking to an API.
The Durable Functions documentation provides a sample parallelisation pattern called Fan-Out/Fan-In. Let’s draw out the workflow!
Coding an Orchestration
Let’s implement the previous workflow in Python code!
Although the code is rather transparent, there are a couple of subtleties:
- The yield keyword acts as a checkpoint for Azure Durable Functions: the orchestrator awaits completion of the called activities at these checkpoints.
- For our ETL to be idempotent, the loading of the data must overwrite and not append.
- Considering the set of total sales allows our ranking to remain correct if a Process Files Activity Function is executed twice (see the previous section).
Azure Durable Functions can be coded in the same language as your functional code, and synchronised in a single repository with Git!
Monitoring Durable Functions
Azure Functions being one of the less managed services, it doesn’t come with a built-in monitoring interface.
In addition, Orchestrator Functions will be “invoked” several times during their execution, each time with a different ID. Tracking them throughout their invocations, as well as the Activity Functions they launch, isn’t an easy task!
Activity Function Monitoring
By default, each Azure Function execution generates a set of logs that are perfect for monitoring individual Activity Functions. Here are a few examples of what can be done!
Activity Function Failures
We can query the exceptions table to count exceptions per activity.
Activity Function Duration
To stay clear of the timeout, it’s also interesting to monitor activity duration (average and max) through the requests table:
With Azure Shared Dashboards, you can create live graphs built on queries of various Azure resources. In our case, we can pin the graphs created by our previous Kusto queries to our dashboard (a tutorial for this can be found here).
Orchestrator Function Monitoring
The Durable Functions Monitor VsCode extension directly queries the queues and storage systems used under the hood by the Azure Durable Functions, avoiding the problem of tracking the multiple Orchestrator Function invocations (see Task Hubs for more details).
Orchestration-level visualisations such as the GANTT chart can provide real insight on the state of an orchestration, and help for debugging:
Custom Logging in Azure Functions
In order to have use-case specific monitoring, we can log additional custom structured data to the default log database, query it with Kusto, and add the generated graphs and tables to our Shared Dashboard!
Structured data can be added to the custom dimensions field of Azure Function traces by the opencensus log handlers. This open-source library specialised in the collection of distributed tracing, metrics and logging telemetry has created specific Azure Monitor exporters, through which you can send these custom logs to Application Insights.
We can then extract this information from custom dimensions with our Kusto request to plot a given restaurant’s total sales over time!
We’ve covered through our examples the following items:
- Azure Durable Functions core concepts
- Best practices for workflow design
- Code example of a Fan-Out/Fan-In pattern
- Monitoring Activity Functions with Shared Dashboards
- Monitoring Orchestrator Functions with Durable Functions Monitor
- Custom logging to take ETL monitoring further
If you have question or would like some more details, don’t hesitate to contact us!