Falling to the left

View Original

AWS Conformed to Enriched confusion

I had a discussion while working on my comparative analysis of data state architectures. Confusion arose when discussing the AWS modern data architecture. It revolved around how the conformed and Enriched layers were supposed to relate to each other.

One side argued that the conformed layer was a prerequisite for the data prior to entering the enriched state. The other argued that enriched is an optional state that is achieved upon fulfilling certain criteria, but that there is no inevitability tied to data passing through conformed in order to reach enriched.

During this discussion, I kept wondering how these two interpretations could arise from the same architecture and even the same document. Therefore, I decided to take a particular look at the AWS modern data architecture and try to see how this had come to be.

Having studied their architecture, I have found that the illustrations in Behram and Abhishek's white paper are likely to cause unnecessary confusion. They depict both the more distributed and the centralised versions of the architecture.

Distributed AWS modern data architecture

Centralised AWS modern data architecture

In the distributed architecture illustration, each domain contains a Conformed and an enriched data state, stacked one after another. This way of presenting the image gives the impression of hierarchy, or consecutiveness, as the illustration presents processes consecutively from the bottom up. Based on the paper's complete form, which is not directly available on the prominent page itself, I believe this to be a miscommunication.

The idea of every distributed domain having a Conformed state originates in the concept of Data Mesh, where each domain is in charge of its own data model, data products, and their distribution amongst other domains. The AWS modern Data Architecture claims to be a blend of Enterprise data warehouse, Medalion, and Data Mesh. The question of dependency between the conformed and enriched states is where this blend is confusing. The architecture uses the terms for the distributed, more data mesh-leaning version and the centralised, more Enterprise data warehouse learning architecture. They are trying to draw parallels in both directions.

Using two illustrations that try to show how similar the two modes of architecture are obscures the vital differences in responsibility and data flow.

Let's compare the distributed illustration with the illustration for the more centralised version of the same architecture. The relationship between the domains and the conformed and enriched states becomes apparent. This illustration shows that the conformed state is an independent data state under central management.

The primary illustrations only show the flow of metadata and not the flow of data itself, thus further obscuring the process. For the centralised version of the architecture, Behram and Abhishek provide an example of how this architecture could work with the AWS services labelled. Crucially, we also get arrows showing what I believe to be the flow of data.

These illustrations show data lines originating from the standardised layer and moving to the enriched layer, bypassing the conformed state. In contrast, some data passes from the standardised to the conformed state before being consumed by the domains.

The equivalent illustration for the distributed architecture shows data entering the domain and being exchanged between the domains. At the same time, there is no explicit transfer of data between conformed and enriched. From this, I assume they treat the domain as a whole unit, per the data mesh doctrine.

After reading the paper and this comparison, a more appropriate interpretation of the distributed architecture is that every domain contains both a conformed and an enriched data state; the data must not be represented in the conformed state before being transferred to the enriched state.

This interpretation would also adhere more closely to the understanding gained from analysing the architectures that the AWS modern data architecture professes to consist of. Generally, the conformed data state appears reserved for managed and recognisable business entities. However, not all data consists of these recognisable business entities; therefore, it makes little sense to mandate that all data pass through the conformed state before reaching the enriched state.

While the discussion is essentially an academic exercise and theoretical interpretation, one could fall on the other side of the fence, assuming the arguments' underlying assumptions and weighting differ.

The apparent confusion illustrates the issue of blending different architectural terminology to appeal to all interested parties. I believe Behram and Abhishek could have created a more robust and more precise architecture if, rather than trying to mix and match terms and concepts, they had focused on building a solid foundation for their idea that would be capable of standing on its own without necessarily relying on understanding form fundamentally incompatible architectures, thus risking confusion when people have a different understanding of these underlying architectures, attempt to apply their knowledge to the AWS modern data architecture.