Data Mesh Architecture
The data mesh architecture is on the rave nowadays, and for good reason. The data mesh brings to the data lakehouse what microservices brought to monolithic applications that is, decoupling.
Allow me to explain what I mean. In the past, when thinking about how to split up the work on a large application, management often focused on the technology layers which lead to UI teams, server-side logic teams, and database teams. When teams are segregated along these lines, even simple changes can take a long time and require budgetary approval.
Eventually, people came up with the idea to split teams according to business capabilities (i.e. microservices). A single cross-functional team would be in charge of user-experience, database, project management, etc. Rather than having a team of database administrators, you might have a database administrator per team.
While this allows teams to iterate a lot faster, it comes at the cost of data integrity. To elaborate, different microservice databases might store the same underlying attribute (e.g. customer information) using subtly different semantics. This make it difficult for business analysts to find the single source of truth. Thus, people invented the data warehouse. With a data warehouse, you as the user don’t need to question the accuracy of the data or know which source system the information came from, you can just go ahead and query the table. Then, when data scientists started using unstructured data to train machine learning models, we created the data lakehouse architecture.
Unfortunately, the data lakehouse architecture creates a lot of the same issues observed in the monolithic architecture. Something, which I can attest to, speaking from personal experience. In my previous role, I was the sole data engineer on a team of analysts. Unfortunately, at the time of this writing, cross-functional teams in the big data space are still the exception rather than the rule. While interacting with the users of the data lakehouse, I’ve observed that they typically lack the expertise in order to properly architect the gold layer (i.e. data marts) for their own consumption. You may be telling yourself that the data engineers responsible for maintaining the data lakehouse should create the tables for them. However, in practice, this won’t work either because, more often than not, they do not have a business context required. Not only does this kind of team topology introduce problems at the higher layers, but the lower layers as well. For example, we were forced to constantly pester another team to make changes to the data downstream, and even then, it often took months.
This is where the data mesh architecture enters the picture. In a data mesh architecture, every cross-functional team is in charge of the entire data lifecycle. Every business domain composed of one or multiple microservices would have its own OLAP database and distributed file storage system. This could take the form of Databricks DBFS + Delta, S3 + Redshift or Azure Blob Storage + Apache Iceberg. Just like a microservice exposes everything via HTTP REST APIs, a data mesh domain would expose everything via a well defined interface that could be consumed by other teams. In other words, we might publicize a URL that could be used to connect to the tables in the OLAP database or access the files in the underlying distributed file storage system, depending on the use case. Analogous to the documentation that accompanies a traditional REST API, every data domain in the data mesh would be discoverable via the data catalog.