Falling to the left

View Original

Comparative analysis of Data state architectures

This article delves into the prevailing data state architectures for cloud-based data platforms. I examine their structures, the distinct purposes of each state, the defining characteristics of these states, and the interrelationships between different architectures.

Data states are the labels that are attached to a given data object at some point throughout its lifecycle. These labels provide information on what can be expected from the data, both in terms of characteristics and use.

Later articles will dive into the topic of what a datastate is.

Introduction to the architectures.

The current dominant data state architectures that I investigate are the Medalion architecture developed by Databericks [2], the Data Lake architecture that came before the Medalion architecture for modern data lake-based data platforms [3], and the AWS modern data platform architecture [1] that is recommended by AWS for data platforms created using their services.

Most newer architectures compare themselves against the historical default position of the enterprise Data warehouse design, which was popularised alongside the Kimball approach to data modelling[4]. In Addition, Reis and Husley, in their book on Data engineering fundamentals, describe alternate approaches to designing data stats[6].

We will briefly describe the different Data architectures before we dive into their description of the states themselves.

No Data Mesh here?

While I sometimes hear data mesh described as a data architecture based on the definitions of data states I have provided, Data Mesh does not qualify as a data architecture but rather is an organisational architecture. Data Mesh domain teams would employ data state architecture themselves to maintain control over their data stats. Therefore, they cannot themselves be a data state architecture.

Data lake architecture

The data lake architecture is based on the idea that all data is stored in a data lake, commonly based on some form of blob storage, such as S3 or storage bucket. All data is stored in one location. This forces the need to identify the qualities of the data to distinguish them from each other. The purpose is to prevent the creation of a data swap where the date is indifferent and thus unstable. This differentiation was defined by creating different data states or layers, as they are called.

The naming and definitions of the layers varied widely until the introduction of the medallion architecture (see below). One set of descriptions that adhered to what seems to have been a form of consensus is

  • Raw
  • Standardised
  • Cleaned
  • Application
  • Sandbox

This version considers The application and sandbox optional [5].

Medalion Architecture

An idea originated in Databricks[2] to deal with the variation in the naming of states on the data lake. The aim was to create a simple, intuitive, flexible architecture to introduce a measure of order while not unduly imposing friction on the data lake processes. In Medalion architecture, data is separated into 3 "layers," I.e., Bronze, Silver, and gold.

Bronze is an immutable layer of raw data that forms the basis for all later data objects. Silver is a more refined data state, whereas gold is the most rarefied state, where one should work on the final value extraction.

The medallion architecture has seen great success and is employed far outside of Databricks systems, even outside of data lake platforms these days. I have seen attempts at reclassifying existing enterprise warehouses reclassified as medalion architecture. The seeming simplicity of this architecture makes it a tempting approach.

AWS Modern Data Architecture

AWS, one of the three dominant public cloud providers, 2024 released a Whitepaper by Behram and Abhishek [1], which was intended as an alternative to the traditional Enterprise data warehouse from the Kimball era. Their architecture aims to support the fast pace of data development and unite the benefits of EDW and the more modern data lake Approach. Due to AWS's dominance in the cloud computing space, this architecture has gained a broad reach. While relatively new, it is worth including due to its reach and potential future impact.

The features of the Modern Data Architecture were formulated as:

  • Scaleable, performant and cost-effective
  • purpose-built data services
  • Support for open data formats
  • Decoupled storage and compute
  • Seamless data movements
  • Support for diverse communication mechanics
  • Secure and governed.

To achieve this, Behram & Abhishek describes a data state architecture consisting of:

  • Staging
    • Raw
    • Standardised
  • Conformed
  • Enriched

Where staging is a superset of the state's Raw and Standardised, we also see the best influence of Kimball architecture here.

Kimball Enterprise data warehouse

As the dominant form of architecture in any warehouse that has existed for some time, the Kimball Enterprise data warehouse architecture is the baseline against which these later architectures compare themselves.

While Kimball has a rich and intricate data modelling system, the recommended overall architecture is straightforward [4]. It has a staging area for operational sources, a back room for data transformations, and a front room for presenting the final data model and accessing BI applications.

Reis and Housley on Data state architecture

Reis and Housley's book on the fundamentals of data engineering [6] has quickly become a recurring fixture in discussing how data engineering should be performed in the cloud era. While they do not present an alternate architecture, their idea of the Data Engineering lifecycle nevertheless influences how you think of building up your data state architecture.

In the Data engineering lifecycle, we follow the data as it moves through generation, ingestion, transformation, and serving. The labels that Reis and Housely use are focused on the actions one performs at different states rather than the state of data at rest [6]. For this reason, we see their influence and what influence they have on the commonality of tasks performed on the data at different states.

Data states in depth.

Having introduced the primary actors in this discussion, we turn to the different states these architectures describe and how they relate.

Raw Data

Of all the data states, this state has the most consensus. All the primary architectures have a concept of a raw data state. The state that the data takes as soon as it is ingested into the data platform [6]. The architecture that doesn't refer to it as raw is the medalion architecture, in which the raw state is called the Bronze layer.

The Generally agreed-upon characteristics of the raw data state are that the data is: "in its natural state as it was ingested from its source." -Gopalan (2022) [3]

Data is stored following the sources it is ingested from. Additional transmission artefacts, such as loded_at columns, are permitted, but no changes to the data are otherwise allowed. The purpose of raw is to serve as a historical archive of data, data lineage, audibility, and the capability to reprocess without the need to reread data from the source.

Medalion Architecture states that raw data should never be overwritten and requires robust version handling. This also applies to the General Data lake architecture, which focuses on the ability to travel through time by moving forward and backward through the raw data.

All the architectures focus on the raw nature and advocate for some form of limitation on accessing the raw data. AWS, while not explicitly stating that access should be limited, states that the data in raw is: "mainly used to land the data as-is, and is used for audit, exploration, and reproducibility purposes.⁠" - Behram & Abhishek (2024) [1]

Indicating that access is limited, especially considering that they describe explicit access in later states.

Databricks with the Bronze layer are the only architecture that does not indicate limited access to the data in Bronze. However, this is not surprising given the generally higher skill requirements to utilise a data-lake-based architecture and famously free approach to data flows taken by the medalion architecture.

Enterprise data warehouses refer to a source system directly providing the raw data state. Unlike current cloud-based data platforms, storage was a limited commodity for these platforms, and as such, they avoided ingesting unnecessary data if possible. The storage of all raw data and the abundance of available storage is a cloud phenomenon.

When investigating these descriptions, we see raw data's consensus as the data platform's bedrock. It serves as the foundational basis for disaster recovery at later parts of the data lifecycle. For this reason, raw data is treated as a precious resource. The value of the raw data is not in its current state but rather as a security failsafe and the potential it holds. For this reason, Raw data is treated with additional care to maintain integrity by securing it against loss and limiting access.

Standardised Data

Described as an independent data state by AWS [1] and the Data lake architecture by Gopalan[3]. We also find the Standardised state in the Kimball enterprise warehouse as the staging tables.

The Standardised data state represents the source data that can be ingested in many ways and different data/object formats in a coherent standardised format. At this stage, we see the introduction of simple transformations, such as adherence to naming standards and structures.

Data is still organised according to its sources but is now available to data engineers and analysts to start working with to create the data flows the business requires. In Enterprise data warehouse architecture, the staging tables were the first representation of the sources in the warehouse, making them the source for all future processing.

While the Medalion architecture does not have the typical standardised data state, the description of the silver state strongly overlaps with both the standardised and conformed data states. We see in Silver the introduction of naming standards and common formats, as well as the Silver State as the basis for data Engineers, analysts, and scientists to perform self-service data exploration.

The standardised data state acts as an interface between the raw data and the developers tasked with refining this data. At this state, it has become available in a standard format and with certain conventions imposed on it, thus abstracting away the natural complexity that comes from ingesting data from different sources and maintaining a full historical record at all times. This abstraction allows for a more streamlined process while aiming to maintain as much of the original structure of the source as is feasible.

Coherence with the original source enables communication with the people in charge of data generation [6]. A transparent integration step increases confidence in the data itself. While the raw data forms the bedrock of the data platform itself, the standardised data state is the bedrock for the data processes created on the platform.

Conformed

Of all the data stated, the conformed state causes the most confusion. Behram and Abhishek refer to the conformed state as containing recognised business entities. These entities are clean, well-understood, and shareable [1]. This description harmonises very well with Kimball's description of a conformed model [4].

They further describe two models of a conformed state, depending on whether one chooses a distributed or centralised model. In a distributed model, Behram and Abhishek explain that every domain has its version of the conformed state [1]. This is consistent with the ideas from Data Mesh [8], where operationally independent domains are responsible for distributing the data products in their domain. If we choose the language of the Behram & Abhishek paper, this will take the form of a conformed state representation in every domain.

The alternative approach they describe is a centralised conformed state managed by a central team, where the domains can pull the data they need from the central conformed state [1].

Behram and Abhishek further explain that the conformed state should not be used for operational BI processes or applications [1]. This indicates that the conformed state does not align with all data flows. They argue, however, that the conformed layer can be used for self-service data exploration [1]. Behram and Abhishek do not clarify what they mean by operational BI, given that it is associated with automated systems and dashboards [1].

As mentioned, the Medalion architecture does not have the concept of the conformed state. However, the Silver layer also contains clearly described shareable business entities that can form the basis of later models.

The data lake architecture presented by Gopalan [3] does not have a state that is named Conformed; however, by reading the descriptions of the states, we find that the Enriched state that they describe bears a strong resemblance to what AWS calls the conformed state: "This zone contains the transformed version of the Raw data. Adhere to a structure that is relevant to your business scenario. At this point, the value density is low to medium; however, there is a certain level of guarantee for the data to adhere to a schema or a structure." - Gopalan [3]

When comparing the descriptions of the conformed state above to the Kimball enterprise warehouse's baseline architecture, we find that the conformed data states are functionally contained in the "back room" [4]. This is the section of the enterprise data warehouse dedicated to data transformation and active data modelling. Here, we find the establishment of the recognisable business entities described by all these other architectures. In the back room, we go from source objects to authoritative business entities [4].

Unlike the other architectures, however, Kimball does not see these as the basis for future transformations; instead, they are the end product of the data warehouse. The purpose of the data warehouse is to create the data model and publish it to BI applications in the "front room."

This also sets it apart from later iterations, where multiple groups increasingly manage data within an organisation. Rather than creating a single model that provides data for reporting and analysis purposes, we see organisational units being dependent on sharing data to achieve their goals. Rather than being the end goal, the standard data product now becomes the common ground for collaboration.

Kimball addresses this change to an extent, in the alternative Kimball architecture, where rather than having a single model, one creates a base standard model that later branches out into independent marts [4].

The defined business entity is the common denominator for all these states by comparing the different descriptions of the conformed data state or their equivalents. The product, customer, or sale data objects are cleaned, tagged, maintained, and authoritative. They are connected across the organisation as the enterprise data model. This model forms the single source of truth the rest of the organisation uses to construct their data products.

The architectures I have yet to look at discuss the state of data in the act of transformation. Reis and Housely [6], whose data engineering lifecycle is based around the actions taken at the different steps, focus on the concerns of data as it is processed and the need for appropriate storage and persistence. All architectures so far have concentrated their data move from a source-aligned state to a structured confirmed state, focused on the business entities and a common source of truth.

However, that is feasible in small organisations or with straightforward data models. In my experience, organisations rarely only have a single authoritative source for their business entities. The truth regularly has to be a combination of sources with considerable logic. Data must be managed in a pre-conformed state to allow for the needed transformations in these cases. This state is not covered by the description of any of the architectures. It gives a sense that the expectation is the transfer of data directly without intermediary persistence.

The conformed data state serves as the single source of truth and plays a crucial role in sharing knowledge across the organisation.

Rather than creating anything itself, it focuses on collecting and sharing the data produced by other domains. By collecting shareable objects from domains, the conformed state also acts as a decoupling layer, allowing business entities to change upstream sources without updating downstream consumers, providing additional flexibility for future adjustments.

Enriched

The AWS modern data architecture refers to the Enriched state as the logic layer for data engineers to combine and create new data products based on data from the Standardised and conformed layers. These enriched layers are generally domain-focused [1].

The enriched layer serves as the repository for the domain's final data products, including reports, BI Dashboards, or tables, which later applications consume. This role underscores the enriched layer's importance in the data architecture, as it is the endpoint for data transformation and the source for data consumption.

The whitepaper also discusses the enriched state by referring to the Medalion layer gold [1]. The gold layer is the final medallion layer. This is where we find the data sets that form the basis for reporting and analysis [2]. It comprises the platform's presentation layer. Databricks states that this is the layer where they see Kimball stars and Innmon marts [2].

When reading the documentation on medalion architecture from the silver and gold layer, you can sense that there are relatively compressed states with few persistence points [2]. This makes sense, given that Medalion architecture originates from Databricks. Databricks is based around spark, memory-based processing that can manage long chains of processing without persisting objects, thus allowing for more compressed data states that can contain a more significant number of characteristics. Architectures that are more warehouse-focused or utilise several persistence points benefit more from a more spread-out architecture than the three states of the medallion architecture.

In their description of the cloud data lake architecture, Gopalan uses the term "Enriched" to refer to a state more closely mapped to the characteristics others call conformed. As Enriched is taken, Gopalan uses the term "curated" for the last states they describe in their architecture [3].

The curated state contains the data with the highest value density [3]. The data in this zone is generated by applying transformations to the data from the earlier states. Curated data acts as a presentation layer and source for PowerBI dashboards. This layer performs aggregation, filtering, and correlations in the dashboards [3].

The final architecture in our comparison is the Kimball enterprise data warehouse architecture. The term used for data at the end of the data warehouse is the "Front room". According to Kimball [6], this is where data is served to the BI systems used for reporting or other operational systems to consume the data. In a trend that has become obvious in writing this article, Kimball does not address the data states or characteristics beyond the immediate control of the data warehouse.

All the analysed architectures address this as the final state or the more rarefied state of data. The enriched state, therefore, becomes the presentation layer for these architectures. We see that amongst all the architectures, the enriched state is similar in composition, but only the name is different.

Behram & Abhishek explicitly address this state as the domain logic and presentation layers [1]. As mentioned in the section on conformed data, there is a need for data to be persistent and prepared for the curated state. The AWS white paper makes explicit what seems implicit in the other architectures. The enriched data state pulls double duty. Data in this state is supposed to be the best version of the data available while also being the domain logic layer [1]. This implies an internal differentiation within the Enriched data state, or you end up with data objects that do not have a state.

Sandboxes and worksheet areas.

Several investigated architectures refer to areas dedicated to data scientists or analysts to perform their experiments and tests [5]. These areas are generally described as areas with little oversight. Often, they get names such as sandboxes, worksheets, or playgrounds. The idea is to provide data consumers with a safe space to experiment, free from the rigours of formal data states and data modelling [3], [5], [6]. It is a place where Ideas can be tested and trialled before presumably being transferred to the more formal structures of the data engineers.

While the idea is commendable, these sandboxes can become breeding grounds for unmanaged business-critical processes. I am still waiting to see or hear about a case successfully implemented without the risk of hidden business-critical processes in sandboxes.

A two-tiered system where experimentation is done in a sandbox and later transferred will always need a prioritisation bottleneck. The need to capitalise on a successful experiment will always triumph over any supposed guardrails.

For this reason, I strongly advise against using sandboxes as a system or tool to perform data experiments for data analysts, engineers, or scientists. The potential risks and inefficiencies outweigh the perceived benefits.

Summary

Having analysed these architectures, the broader aspects of data state management are consistent across all the data architectures. Data starts relatively unchanged from the original sources. There is the expectation that this initial state of data is protected and serves as the foundation for the platform as a whole. A core concept that is consistent is the assumption that one should be able to rebuild the sections of the data from this initial data.

All architectures refer to a defined state for recognisable business entities. These states are expected to be clearly described and documented. They generally represent the foundation of other data objects. The level of refinement expected in this state varies from architecture to architecture.

Another commonality is a presentation layer at the end of the architecture. The state is expected to contain the most refined data drawn from the selection of other states. The data stored in this state is expected to be the driving source for all reporting needs and further analysis.

This presents the question of what happens to the state of the data as BI or analytical use cases consume it. That is another commonality: None of these prominent architectures addresses the data at the analytical or BI process, leaving discontinuation for all attempts at a comprehensive architecture. If one is to take these architectures at face value, all BI and analytical processes store all the data in the final presentation layer.

Having investigated the common aspects, the differences between the architectures are based on the technologies that formed the basis of the original architects' designs.

Architectures that originate for technology stacks that process longer chains of changes without persistence or where intermediary states are ephemeral have fewer discrete states. Examples of these are the Data lake and medalion architectures.

Architectures that originate in more centralised structures, like the enterprise data warehouse, have a stronger focus on the conformed data state and the need for standardised processes. An interesting note is the AWS Modern data architecture, as this architecture was created as an amalgamation of architectures that suffer from attempting to straddle multiple designs and bring coherence without significantly altering the original ideas. See this (add link) post, where I address one example of this attempt to combine various architectures.

Conclusion

While the terminology of data architectures often presents them as generic solutions, all the analysed data state architectures are strongly influenced by their original technical infrastructures and make assumptions about the available features. You can accept the architecture created on your selected technology stack if you want to adopt one of these architectures. By doing so, you are likely to find it easier to map your workflows with the data states the architecture provides.

Even ostensibly generalist architectures, such as the AWS modern data architecture, are trying to assimilate to existing architectures, creating the discussed contradictions.

An attempt at a truly general architecture should be able to operate independently of technology and morph technology stacks without needing to change the state of any data object. None of these architectures is truly able to achieve this, and they do not explicitly make this claim. For this reason, I recommend not trying to force an architecture outside of its technological origins.

References

[1] Behram & Abhishek (2024) AWS Whitepaper- Modern data architecture rationales on AWS. URL: https://docs.aws.amazon.com/pdfs/whitepapers/latest/modern-data-architecture-rationales-on-aws/modern-data-architecture-rationales-on-aws.pdf#modern-data-architecture-layers-deep-dive

[2] Databricks (n/a) What is medalion architecture? URL: https://www.databricks.com/glossary/medallion-architecture

[3] Gopalan (2022) The Cloud Data Lake, O'Reilly, Sebastopol

[4] Kimball R. (2013) The Data Warehouse toolkit - third edition

[5] Mitrus P (n/a) Data Lake Architecture: How to Create a Well-Designed Data Lake URL: https://lingarogroup.com/blog/data-lake-architecture

[6] Reis, J., Housley, M. (2022) Fundamentals of Data Engineering - Plan and build robust data systems. O'Reilly, Sebastopol

[7] Wickham, H. (2014). Tidy data. The American Statistician. 14. 10.18637/jss.v059.i10.

[8] Dehgani Z. (2022) Data Mesh. O'Reilly, Sebastopol