July 4, 2023 • 6 min read

Toward Data Mesh: why your data team is about to explode

Rédigé par Vincent Doba

Vincent Doba

The time of the isolated data team is coming to an end. Unable to deliver reports on time. Creating reports full of errors. Maintaining data pipelines that constantly fail. The isolated data team’s days are numbered. A new paradigm emerges, opening data teams to product teams, promising to solve issues encountered by isolated data teams. And, maybe, leading to the end of the data engineer role.

In this article, I will not treat using data for AI. In most data teams I have been in, AI was always a faraway dream. My day-to-day job was to transform data to provide reports to management.

Isolated data team

What do I call an isolated data team?

Almost every time I worked as a data engineer, I worked in a data team. A data team is a group of data engineers and data analysts. This team retrieves data from all parts of the company and produces reports for all parts of the company.

A Data Team retrieves data and produces reports
A Data Team retrieves data and produces reports

To do so, data engineers create a data pipeline. It consists of jobs that retrieve data from company applications, restructure it, and save it into a database. Next, Data analysts connect to this database using BI tools. With those tools, they produce reports or metrics for managers from other parts of the company.

What are the shortcomings of an isolated data team?

Isolated Data teams take too much time to build new reports

In an isolated data team, you have to pass through several steps to build a new report. You can see them in the diagram below. Members of the product team do the steps in blue. Members of the data team do the steps in green.

Traditional report creation flow pass by several steps and implies data team and product team
Traditional report creation flow

You have seven steps to get your report, each step can be one or two weeks long. So to get a report, you have to wait one to two months minimum. Worse, when you pass from one team to another, you add a lot of lead time. For instance, the product team may have other priorities than providing new data for reports. It thus delays the creation of new reports by months. For example in one of my mission, I waited nine months to get missing data for a report. Finally, we canceled the new report because we could not get this data.

To avoid that, you can tell all product teams to pour all the data of the company into one giant data lake. Or give your data engineers access to all internal databases to put their data in your data lake. Then, when some manager wants a new report, it follows the following creation flow :

Simplified report creation flow does not contain product teams
Simplified report creation flow

By doing so, data engineers don’t need to ask for new data. Everything is already in the data lake and they can pick any data they want. However, it creates two new issues: breaking changes, and data misunderstandings.

Isolated Data teams manipulate data that they don’t understand

So now data engineers don’t have to ask for data from product teams. You’ve gained from 2 weeks to 6 months in your report creation flow. But what data are pushed to your data lake? Usually, dumps of applications’ internal databases, for instance, SQL dumps or CSV files loaded every day on a distributed filesystem.

But those dumps don’t contain any documentation or metadata except the loading date. So when data engineers try to use them, they have to ask product teams what the data means. And we’re back to the traditional report creation flow, a bit better. Instead of asking product teams to generate data, data engineers ask them what those data mean. It creates endless email chains filled with question/answer emails over several weeks.

To avoid sending emails with questions, data engineers end up trying to guess what the data mean. It creates a lot of discrepancies between the real business meaning of data and what the data engineers understood of it. A report built on those data is wrong; for instance, some business rules may not be applied. And when the managers look at the report, they realize that the numbers displayed are not the ones they see in the business application, leading to distrust in the data team.

Reports built by isolated data teams often break due to data changes

Moreover, the product teams don’t feel responsible for the data sent to or retrieved by the data team. Sometimes, they don’t even know that the data team is retrieving their main database. So, when they change their data schema, they don’t tell the data team about it.

As a result, from time to time, the data pipeline and reports break. And again, it creates distrust in reports built by the data team. You don’t trust a report that is randomly not up to date and when it breaks, you have to wait one or two weeks to have it up again.

Congratulations! With your isolated data team, you built a data swamp. You have a bunch of data. Some data changes randomly. There are data that nobody understands.

Data Mesh, infuse data in all your product teams

Data teams have to integrate product teams

By creating isolated data teams, we cut data processing and report creation from product teams. An estranged team has to build reports for product teams, leading to delays, loss of information, and miscommunication. We have to integrate the data team into product teams, instead of isolating it.

Two new roles: Analytics Engineers and Data Platform Engineers

Data engineering groups two roles: build and maintain a data platform and use this data platform to process data to fulfill reporting needs. In her article Data Organization: why are there so many Roles? Furcy Pin calls those two roles the DataSecOps and the Analytics Engineer. Another name for DataSecOps is Data Platform Engineer.

As you can build a data platform in relative isolation, your data platform engineers can stay in your data team. However, to process data you have to work with the product teams. Your analytics engineer will have to move toward product teams. Your data team will shift between the data platform engineer that will maintain the data platform and the analytics engineers that will join, or at least get closer to product teams. You can end with the following organization :

Product teams do data engineering, Data team manages the data platform
Product teams do data engineering, Data team manages the data platform

You have a data team that only does infrastructure work, with all the data processing done by analytics engineers and data analysts integrated into product teams.

A first step toward data mesh

Having analytics engineers in product teams helps implement two of the Data Mesh Principles, domain-oriented decentralized ownership, and data as a product. If the analytics engineers join product teams, it means that their data pipelines are now owned by product teams. And as analytics engineers and data analysts create reports within the product teams, reports, and underlying data are now a product of product teams, not something outsourced to a data team.

Advantages of moving data team to product team

This changes everything. It is now faster to produce reports, as data analysts are next to data producers. Data engineers don’t have to guess business rules, as they now know how the data is produced. Breaking changes in data disappear, as everyone in the team is aware of how it uses data.

I’m in a small company, how can I make my data team integrate product teams?

For small companies, it makes sense to keep a data team. You can’t have one analytic engineer in each of your product teams. In small companies, product teams are next to the data team, so you have less miscommunication than in bigger companies. However, each product team should have a referent in this data team. A data engineer or a data analyst is very close to and knows the business needs of this product team. This is what Snaptravel did for their data team.

By doing so, you do not dilute ownership and knowledge of data from a specific product. The referent owns and knows data from a product team. But, if needed you can leverage all data team members to fulfill a new business need, lead by the referent.

The next step: the disappearance of the data engineer

There will always be data engineering. But will there always be specialized professionals doing it? Once you have split your data team between analytics engineers going to product teams and data platform engineers doing platform maintenance, they can merge into more classic roles of software engineers and DevOps.

With better tooling, the data engineer role may merge with Software Engineer and DevOps role
With better tooling, the data engineer role may merge with Software Engineer and DevOps role

The disappearance of analytics engineers

Previously, developing data pipelines needed its area of expertise. You had to know how to load data, how to distribute your jobs, how to manage historic data, how to orchestrate your jobs, etc... Today, thanks to tools like dbt, Delta Lake, Snowflake, Airflow, Fivetran, Airbyte, and many others, you can have a performant data pipeline without mastering all those pieces of knowledge. And in the future, you will have no-code data transformation tools. With this, do data analysts need analytics engineers to prepare data? If they need new data, data analysts will ask a software engineer from a product team to make them available. Then, data will be automatically loaded using Airbyte or Fivetran and available for analytics.

The disappearance of data platform engineers

Previously, if you wanted a data platform, you needed to manage a complex Hadoop cluster. Not so long ago, you had to integrate a lot of different tools to build a data platform. But today, complete data platforms are blooming. For instance, Databricks, starting from Spark, has developed its data storage format, Delta Lake, and has its own BI tools with Redash, orchestrating tool with Delta Live Tables, and a data governance tool with Unity Catalog, all available on the same platform. Does it make sense to still have a data platform engineer? You can set up a complete data platform in a few minutes. And DevOps can automate its deployment without specific data knowledge.

However, we are yet to see data engineers disappearing soon. Tools and platforms are still burgeoning or do not yet exist. But one thing is sure, the data engineer role will change a lot in the near future.

Conclusion

It is now easy to set up a data platform and to start manipulating data. Isolated data teams don’t make sense anymore. Now they create more problems than they solve. It is time to give product teams ownership of their data. And the sooner the better.

You plan to bring closer your product and data teams but you don’t know how to do it? Don’t hesitate to contact us!

 

Cet article a été écrit par

Vincent Doba

Vincent Doba