Data Engineering Fundamentals (review)
Biblographic information
Reis, J. Houslay, M. (2022) Fundamentals of Data Engineering - Plan and build robust data systems, O’Reilly, Sebastopol ISBN 13 : 978-1098108304
Introduction
This book is written by Reis and Housley and presented as an introductory text for aspiring Data Engineers and other Developers who find themselves in a Data Engineering role, with little to no prior knowledge. It aims to provide fundamental insights into all aspects of the daily life of Data engineers and thus provide a foundation knowledge base.
The Authors further state their wish to provide a text that is decoupled from any one particular technology and thus provide a more nuanced approach. They additionally hope and state that their aim is to provide a body of knowledge that can last throughout time in an industry that changes at lightning speed.
Readers Intent
Reis and Housley are self-declared Data scientists turned Data Engineers, as there was a lack of people to fill that role wherever they went, and they found themselves taking on that role. As someone who has been through the very same journey as the authors, and actually having heard about this book through the jungle telegraph prior to its release. I was very much looking forward to reading this book to see what they had taken from this journey and what they saw as fundamental knowledge.
Discussion
This book like so many within this genre of discipline exploration can generally be split into three thematic sections. the first one aims to provide justification for its existence, by placing the role of Data Engineering into a historical context and in comparison to other Engineering disciplines. Reis and Housley draw on a wide selection of definitions for engineering practices, both from within Software and outside of it. An interesting observation that comes as a consequence of having read David Farely’s book on modern software engineering, is the comparison of data engineering with production engineering practices rather than design engineering.
Having placed the Data engineer in his/her context and definition, the authors carry on to what very clearly is the central idea within the book, The Data Engineering Lifecycle. This is the conceptualization of all aspects of a data engineer’s daily work, as well as the undercurrents underpinning the central workflows. I found their compartmentalization of the flow of data the stages that the data passes through, and how it relates to the work of Data Engineers to be highly valuable. It provides a clear picture of the length and breadth of the responsibility sphere that you need to be aware of and deal with. This might in fact be one of the first texts to actually systematize in such comprehensibility the domains of data engineering.
They step far beyond just simple ETL/ELT processing but rather include, both the generation of data and the impact that has on all data processes. the lifecycle also doesn’t end with the presentation layer of data but extends all the way into communication and maintenance of your data flows, and the responsibility of choices made in the the lifecycle.
Reis and Housley also include in their lifecycle the undercurrents that might not be top of mind for many data engineers, but that form the basis of many of the practices, and realities that make up the daily life of a Data engineer. This is however also the place where we see something that will become a theme later in the book and that is their choice of terms and for lack of another word "buzzwords".
While I personally endorse many of the ideas expressed in the DataMesh philosophy, the inclusion of Data Mesh as an undercurrent in the Data engineering lifecycle strikes me as odd and might prove to be the first thing that won’t stand the test of time. DataMesh is a highly popular concept at the time of the book’s release and is likely to go through many of the same reinterpretations and spinoffs that most to all named methodologies suffer.
This choice of terms that are by many considered buzzwords continuous, which brings us to the next section of the book. The in-depth explanation of the data engineering lifecycle. The majority of the book’s pages are dedicated to a step-by-step walk through of the individual stages of the Data engineering lifecycle, with a whole chapter dedicated to each stage. This is also where in my opinion the authors made the biggest mistake.
While the idea of stages and processes like Generation, ingestion, storage, and presentation makes sense from a conceptual stage. considerations in one stage have consequences for other stages. The mistake the authors make is to try to force reality into these neat boxes, isolating them from each other. The consequence is a book filled with topics, subjects, concepts, and ideas, without ever providing sufficient detail to understand why, or enough context to get the big picture. When taking a subject such as streamed data, or data warehouse technology and only ever introducing and talking about the aspect that relates to data generation. Then starting from scratch when discussing transformation or ingestion breaks the reading flow, and left me more frustrated than enlightened. I found the self-referencing (i.e. "see page x for more information") especially frustrating.
I believe the authors would have been left with a better book and been able to better Conway their ideas had this section been flipped on its head. Rather than a chapter per stage in the lifecycle, a chapter or two per major technology such as Streamed data, Data warehouses, batch processing, cloud computing, etc. Where the Data engineering lifecycle would be superimposed on top. That is to say explain how the particular concept changes, impacts or otherwise interacts with the lifecycle. This would provide the opportunity to follow a single idea throughout the lifecycle, without having to context switch from one to the other or having to flip back and forth through the book to keep the narrative flowing. The disjunctive nature of the book also causes there to be a lot of fluff in the form of reintroducing ideas, and rounding of half-told stories as they stretch into another stage of the lifecycle. this could also have been avoided by reorganizing the book.
Final Comments
As it stands I am left disappointed. The idea of the book showed great promise, and an opportunity to fill a niche in the literature. As it stands now in my opinion the book fell short of fulfilling this lofty promise. That is however not to say that the book is without merits. The idea of the Data Engineering lifecycle is sound, and definitively something worth reading about. I would however not recommend this as your first foray into the world of Data Engineering literature.
The book functions excellently as an introduction to words and concepts that you will hear in your life as a data engineer, but do not expect to have a full understanding of what they mean simply by reading this book. I can quite easily see myself recommending this book as part of a curriculum where I would be able to provide supporting literature, to fill in the gaps. In that case, I would probably also create reading paths, that link the relevant pages for any given concept together, rather than reading chapter by chapter.
All in all, it is not the book I hoped for, but the area it tries to fill is so starved for literature, that I would recommend it with the caveats.
And should the authors read this, if you ever make a second edition, consider the flipping of the chapters. A final request from a reader, who likes to follow up on reading, it would be awesome if you could provide a webpage with the additional resources as links for those who purchased the book, rather than having to manually type condensed links.