The Case for the Data Development enviroment
Introduction
One of the concepts that seems most problematic for people to understand is the Data development environment. It tends to lead to confusion and misunderstandings. Therefore, this page aims to explain, clarify, and justify its necessity.
Environments
The term' environment' is a cornerstone in the world of development, often used alongside 'platform '. While every developer has their own understanding of what an environment is, it's important to note that it can take various forms and serve different purposes.
- It is the collection of packages that you need to work on your application (.venv)
- It is a collection of code in version control in the branch named "dev/prod" (git-flow)
- It is the Cloud provider account named "dev/prod-X."
- It is the database with the prefix/suffix dev/prod/
- It is the network path that end users or developers only access
- It is the weather and every British favourite topic
An environment is not a single thing; it can also be nested. As such, the term Environment needs to be defined in every context in which it is discussed.
Environments definition
In this context, an environment is more than just a collection of resources. It's a strategic combination of AWS resources, third-party SaaS services, access, security rules, and volatility expectations. Designed to meet the diverse needs of an organization.
The hierarchy of environments (Dev->test->prod)
Before delving into the types of environments, it's crucial to understand the hierarchy of environments. This understanding is fundamental in all discussions about environments, whether explicit or implicit. We start from the development stage and progress through one or more 'environments' until we reach the final production stage.
What changes from one to the next? In general, as we go up in the hierarchy of the environment, we increase the expected stability and reduce privileges and volatility. That is to say, we accept more instability and have the opportunity to perform more acts freely in the lower levels of environments, while as we move up the hierarchy, we expect solutions to be more stable, changes to be controlled, and access and privilege are done on the need only principle.
Another thing that is common with the hierarchy is that as we rise in the hierarchy, we move closer to operations and the end users. As we increase in environment hierarchy, we get closer to the people who intend to use/consume what we are creating.
Users as differentiator
In general, the expectation is that end-users or consumers of the product delivered interact with the production version of the deliverable based on the assumption that they are entitled to the most stable and best-controlled version of what is offered.
It is from this assumption that we come to the differentiation of platforms and the role of environments. We should consider every set of end users as being entitled to their own set of environments in order to best serve and work with them.
From this, one gathers that entire platforms can exist within another platform's production environment. This occurs when the platform itself is the product delivered to a set of end users. These end users again require their own set of environment hierarchies to separate their development flows to work and present a stable, secure environment for their end users.
Infrastructure, Application, and Data Platforms
This is a preamble for the introduction of the idea of the hierarchy of platforms, particularly the hierarchy of the infrastructure and data platform. I highlight these two in particular, as this area often gets confounded.
The infrastructure platform is the set of IaaS, PaaS, and SaaS that make up the infrastructure that manages computing, networking, and security. It ensures that anything built on this system is secure and controlled.
Application platforms
The level of focus on the data contained on the platform rather than the services and applications running on the platform differentiates the Data platform from an application platform or development platform. In the latter cases, the focus is again on the deployed systems rather than the processed data and interactions. This focus sees application and infrastructure platforms combined. Applications are being developed in the same environment engineers use to test and develop the infrastructure.
While not ideal, this is acceptable as applications and infrastructure can be isolated from any sensitive or vital information. Development can be conducted using synthetic or fake data without impacting the viability of the results. Some might consider it beneficial to be able to work with synthetic data.
Data platforms
In contrast, the Data Platform is a platform for managing and developing data and data products used by business analysts and non-technical staff to interpret the results. As such, the Data platform has a different set of end users compared to the infrastructure platform. The Data Platform is further concerned with the development and processing of data. It uses SaaS and PaaS (AWS) to perform its tasks, but the object of interest is the data itself, not the services that make up the platform.
As such, the data platform, while a collection of services, is to be considered more like an application on the infrastructure rather than a part of it. The Data engineers are to be considered a set of end users consuming an application/platform rather than the people developing the platform itself.
They are consuming the Data platform to provide analysts with a "platform" of well-designed data products from which to draw insights.
The need for a separate Data development environment
Data developers and engineers need access to real data to perform their work. Data is part of the required infrastructure for the data engine, and networks, computing, and APIs are part of the required infrastructure for application developers. Application development becomes possible with the ability to call endpoints or commission compute resources. Data engineers can only create data flows and products with excellent and accurate data.
However, the data in question is often considered business-critical or highly sensitive and needs to be protected to avoid data leakage.
At the same time, the data's end users and the developer community at large are pressuring data developers to adopt application development best practices in the sphere of data development. Some of the first things data developers are expected to adopt are the separation of development and production and the creation of an environment hierarchy. In naive implementations, Data engineers are expected to operate within the same environment as infrastructure and application development.
The environments used for developing infrastructure and applications sometimes need more controls and security to contain actual data, thus prohibiting the access of actual data in development, thus expecting the same synthetic data that application developers use to be adequate for data development.
However, this ignores the central role that data plays in developing data products and resources created within the data platform and expects data developers to work semi-blind.
Synthetic data is valuable
Synthetic data has substantial benefits when created purposefully and dedicated to the data platform. However, creating purpose-built synthetic data is a considerable undertaking. It must share generation characteristics, behaviour, edge cases, and insight. This undertaking is exceptionally hard to do while providing limited value compared to using real data with good masking and access controls. As such, while it would be the optimal solution, it is generally to be considered an extension to be pursued as a nice-to-have solution.
Providing a development area for data Engineers
We aim to adopt the best practice of having an environment hierarchy to create data products and an insight foundation. We also need to use real data to enable quality work from the engineers. In order to use real data, all environments that are to be used by the data engineers need to be secured in such a manner that real data can be stored there and accessed by the data engineers with a business case to work with that data.
This further strengthens the view that data Engineers are consumers of the platform while needing their own set of environments to work in. This is because the business end users of the data platform expect stability and even tighter control over the data they use to make decisions.
In this case, control does not necessarily mean that the developers cannot see the data itself but that the data used for production is created from known and controlled processes, that developers cannot change the data directly without reason, and that the processing is available and run on time.
The result of not having a Dedicated development environment
Data platforms and central data warehouses have existed much longer than the requirement for using an environment hierarchy. And if you ask the people maintaining and managing these data warehouses, the common response to where they work is in what we would term production today. There exists only one copy of the data warehouse, and the responsible people make changes and fixes directly in this environment, trying to avoid the creation of errors or breaking things as much as possible.
This is also the result today if the development environment provided contains data of insufficient quality and quantity. Engineers start to do development within the production area, either directly by accessing data in production or by making guesses in the lower environments, promoting it to production, and only then getting the feedback if their changes were correct, increasing feedback loop time and turning production into a defacto development environment, any deployment is as likely to break something in production as if developers made the change directly.
This can also be used as a diagnostic when development processes start to move into the production environment via proxy or directly; it indicates that something is wrong or insufficient in the development environment so that it does not meet the data engineers need to perform their work. This can be data quantity, quality, and access to feedback on change. One risk within every environment hierarchy, which very quickly can cause a data development area to be abandoned, is if there is a drift between the development and production environment—effectively rendering the development area obsolete. Care and attention must be placed in creating the development area to provide a mechanism that keeps production and development in sync.
Production data in development
One point I want to address directly is a common misunderstanding when discussing making real data available to Data engineers in their development area. This is the notion that I aim to move actual data into the general development area. The hope is that it is made abundantly clear that it is not the intention. The aim is to create a dedicated environment that can contain actual data and that Data engineers can use to develop data products and pipelines, which then again are deployed to another environment that represents the data production environment where business users and analysts get access to the data and can extract insight from the data.
Another way to think of this could be to imagine taking the current production environment and subdividing it into two different layers: one that is volatile and where changes occur daily, where engineers build and experiment, and one that is more stable, containing the approved version of products and flows to be presented to the next set of people in the value chain, from CPUs and hard drives to business decisions.
At no point in this value chain should security or maintainability be reduced as one moves along the chain from one link to the next. At the same time, each link needs to be given the tools and components required to provide value to the next link.