Buy vs. Build: Navigating Architectural Decisions
So today's post touches on a very large topic, one that cannot be covered in a single post, and will be something I come back to on several occasions. However I wanted to start somewhere, and now is as good an opportunity as any other.
Recently I have been thinking about this quite a lot, it is a topic at my day job, it has been with me throughout my career, and most recently I was listening to a Changelog and Friends episode (It dependencies 2023.11.17: The changelog: Software development podcast. ). As a result, I thought I would tentatively throw my hat into the ring on the matter.
The great debate
Buy vs Build appears to be the watershed discussion that cannot be bridged. Either one is a build or one is a buy person. When the answer to so many questions in technology is “it depends”, it is interesting to watch how vehemently this statement is contested. And just so we get it out of the way, I am an It depends person, but leaning slightly to the buy end of the middle, for reasons I will come back to.
From observing the public discourse on the matter within the data world in particular. The question at stake seems always to be whether one should build some kind of tool or technology or buy it from some kind of vendor. It is generally couched as a discussion of principle or design.
I have the impression that the question often devolves into an argument between traditional Software developers and BI/analytical people. And beneath the facade of the topic at hand lies the question of value. What brings the organisation value?
Akmar Nazari is referenced by the Changelog host Justin to have said: that one should only do/build the thing that only your team can do and truly gives you a business differentiation. Everything else you buy in. And that is the mentality that I commonly observe in the hardcore buy camp. The question then becomes who is to determine what is it that brings business differentiation. And here I unfortunately see a lot of myopia set into the discussion.
Every business unit will see itself as a business differentiation. And the larger the business and the more internally divided and insulated the different areas are, the more the myopia sets in. Software developers will always see the tooling, and applications as something that gives a distinct differentiation. BI and Data people will always see the data as the most important aspect. Security will see themselves as the shield that keeps the business safe and without them nobody would be here. So who is to determine what is the Nazari's business differentiation. I don’t know, but I do know that it cannot be the people themselves.
Alternative Metrics
So are there other metrics that we can use to determine when to buy vs. build?
Risk
The risk might be one candidate. What are we comfortable losing, failing, or stopping? Many on the buy side of the discussion, myself included, see software that is built in-house, as a liability and a risk. Especially if it isn’t the main focus of a company or unit's work to create these things. Anything that you build in-house is a commitment to maintain, and support for as long as that system is in use. That means that at any point in time, you need to maintain a staff that is skilled in both the operation of the system as well as the technology used to build it. A system that is left unattended for any length of time, is likely to fall into obscurity, and neglect making it an ever-increasing liability in your stack. By that metric we probably should not build any tooling, just buy EVERYTHING**.
That argument unfortunately neglects to acknowledge the risk that is posed by the bought services. These vendors are now a liability and risk just the same as the tooling that you built yourself. Yes, you can probably assume that you now are sharing the risk of technical failure, with several other customers, incentivising the vendor to fix technical issues. However you are not secured, should the vendor disappear. Or they make radical changes to their system, making it incompatible. These changes are out of your control when buying systems. It is your responsibility as an organisation to maintain sufficient skilled staff to adjust system dependencies downstream to match changes made by vendors.
If we assume that you have decided to buy a system or tool, now we can just plug and play. Unfortunately, the reality is normally not that pretty. Vendors have an incentive to make it easy to make the first steps and set up connections or access. This initial setup rarely meets anything but the most basic use cases. Most tooling will need tweaking and adjustments to fit as seamlessly as possible into your stack. Any sharp edges or integration issues, in the long run, become pain points, either for users or for maintainers.
The more tools you need to integrate, the more complicated the web of interactions and dependencies becomes. Requiring ever more of you as an organisation to support. No vendor out there can support every combination of tools and styles of workflows, often leaving it to the companies themselves to figure out and support their system. Leading organisations to create “glue-tooling”*. In this regard a bought tool stack is not much different than a build tool stack. While the individual components might not be built in-house, the system as a whole should be treated as such. Thus requires a dedication to maintain staff with the skills to support and maintain the system.
If Risk as a metric for when to buy or build appears to be difficult with even the entirely bought*** have several of the same risk profiles as build systems, albeit with slightly lower headcount on the staff side for bought tooling over completely self build systems. Is there another metric? I have several times spoken about the need for skills. Might skills be the determining factor for when to buy vs build?
Current skill
In my personal opinion, the skill of the people working your system is the most important characteristic to improve. Using it as a metric for buy vs build is challenging. The buy argument in this case would focus on buying tooling to match the current skills of the workforce planned to maintain the system, while the build argument would focus on having the skills to create the system that you need. The problem with this argument is that it is mute. In either case, it is unlikely that you would consider making any changes if this was your guiding metric. The system that you have the skills to maintain is the system you already have. If you have the staff with the requisite skills to build a system that you need, it is highly likely that they already have built it.
However if we for the sake of argument say that there needs to be a change, continuing to do what you do is not an option. The argument remains moot. You do what you already do. If you have built a system you build a system. If you have bought the system to be replaced you buy it, as you likely do not have people to build it. Severe challenges arise if the external force pressure comes from economic or staff composition changes. In this case, there is only one way to go. From build to buy. The integration and maintenance of a primarily bought/open source stack require much fewer people than the maintenance of an entirely in-house built stack. It is impossible to go from a bought system stack to a build system without large changes in staff composition. And this transition I would only recommend this for the most dedicated of companies which can absorb a large portion of reduced productivity and increased cost which come with such a transformation.
What about user skills? The argument here is much the same as with the maintainer skills. The system you already have is the one that your users have the most skills in. They will have learned the quirks of any vendor system, while any in-house systems will also be learned or adjusted. Again arguing against any change.
This is a different discussion that we will come back to another time, but occasionally I hear the argument that one changes systems to enable new possibilities. If it is possibilities you wish to unlock remember that any such potential must be matched by the skill of your people. If there is a discrepancy between these potential benefits, and what your people can support these are unlikely to be realised.
Cost
Business differentiation, risk, and skill are all sub-optimal metrics for deciding when to buy vs build. What about the cost? Cost is one of the more frequently touted arguments for starting change, wanting to reduce cost. In part driven by the promise that the cloud will be cheaper than the on-premise systems of yesteryear. So might cost be the metric to use? Possibly. In my experience, the challenge with using cost as a metric for when to buy vs build is that it is hard to get solid numbers to measure with. How so. Unfortunately, it is not as easy as comparing money out vs money out. Most companies operate with individual budgets for stuff like hardware, people, licenses, training, development and maintenance. And it can be surprisingly hard to get a good grip on exactly what money goes where. To make matters harder, the cost of any not yet implemented solutions is even harder to impossible to know in advance. While salespeople will happily present costs on licences, and estimated time improvement, these are all hypothetical. As such cost can only ever be a trailing indicator. That is to say, only when, you already have made the change and made the transition will you have the numbers you need to determine if there is a true cost benefit?
People as a leading metirc
Without predictive power cost is out as well, assuming we are making fact-based decisions. There isn’t much more to my knowledge that is used or could be used as a metric, that isn’t covered by these four metrics in some way. So what is one to do? It depends. Why do you need to consider changing? So far the best predictive tool to make decisions is the skill of your maintenance staff, and users. Regardless of what you choose, if you do not have the people to support it it will not work.
As such if you have people to build systems already, are comfortable with the risk profile of maintaining systems indefinitely, and are happy with the flexibility and independence that comes with a homegrown system. Keep doing that. Likely, your organisation is largely set up to support this pattern. If you seek to change the makeup of the organisation or want to outsource parts of the technology as it is not deemed “business differentiating“ you are in a luxury position. The people at hand are likely already used to several of the processes needed to make that transition. It is however important to remember that such a change is drastic for the people involved, and sufficient time should be dedicated to preparing for such a shift.
If you do not have the people to build and maintain a system in-house. You should not embark on it. As stated earlier it takes an exemptionally dedicated organisation to transition into a software building company, if it is not so already. And that transition should be made before starting on such an undertaking.
A slightly less radical change is moving from an existing vendor-based system to another vendor-based system. Here again, it is the people maintaining that will be in focus. Before you decide to embark on a vendor transition, make certain that the people maintaining have the skills to support the new vendors' system. Are they able to configure and set it up, can they create and maintain the necessary ”glue”?
Incidentally, this is where I see several challenges in the current trend in modernisation within the data space. The incentives for vendors to make the first steps easy can give a false impression of the level of work, support and skill needed to fully support the tool. But I will save more on this for later.
The common denominator in these assessments is people skills. Either you have the skills to support your desired architecture or you don’t. In the cases that you do not have the skilled resources available but still must or want to change your setup. Your organisation must focus on up-skilling. And it is here we possibly find the one metric that might provide us with a handfast data to judge buy vs build. What are you willing to upskill? What architecture are you willing to train or recruit for? Are you willing to pay the time and monetary cost it is to start training your in-house staff? Are you willing and able to pay what it takes to attract the people with the skills that you need? If the answer to these questions is yes, you should do that. Try to recruit, and see if you can attract the talent. Start training programs, and see if your staff takes to the training.
Both the training and recruiting processes give you an indication that you will succeed in your change. If you cannot attract people with the skills you need or don’t want to recruit them. Your staff resists the training, or are unable to acquire the skills, or you don’t want to train them. All four of these scenarios give you a relatively cheap and leading indication of the success of your architectural changes.
It is worth mentioning that cheap is a relative term in this case. Training and recruitment are by no means cheap, both in monetary terms and in reduced productivity (during training) or unrealised potential (in recruitment). But these costs are irrelevant compared to the costs of a failed architectural change. These include all the costs of training and or recruitment, in addition to the potential harm to morale, employee retention, reputation, the need to maintain multiple systems, degradation in service that are “abandoned”, and increased cognitive load of the increased overall complexity.
Closing thoughts
To round out this lengthy discussion, and summarise the main points. The best approach is to continue along the architectural patterns that you already know. If you are a software-building organisation you can and probably should continue to build systems, if you are a vendor or consultancy-based organisation you are likely to succeed in any changes that continue these patterns. In the case where you decide to change these approaches, for one reason or another, we have looked at business differentials, risk, cost, and skills as possible metrics to determine whether to buy or build a system. Business differentials and risk are bad metrics to decide whether to buy or build, as the definition and appetite for either of these will heavily depend upon your position in the organisation, and as such are subject to interpretation. Cost so often used as a metric is unsustainable as it will only ever be a trailing metric if we consider the cost of the change as a whole. And to decide to buy vs build, we need leading data to make the decision.
We have addressed that regardless of where you are on the spectrum of buy vs build, there will always be a distinct set of skill requirements. By default, you have the skills to maintain the current design. So any change in design will require a change in skills. We posit that by attempting to acquire the needed skills you can get an indication of the probability to succeed in your change. It will need to be part of any actual change. You can also get a more realistic number on parts of the cost of change.
By these qualities of the metrics considered here, the ability/willingness to obtain the skills needed to maintain the change you seek is the best metric to determine if a system change is achievable.
This has been a lengthy discussion and there are several other aspects to this discussion. But I wanted to start off the discourse. I also would like to point out that this is based on logical argument and personal experience. For this to have any real-world validity, it needs to be tested and retested. And while I believe the premise is sound, there might be aspects that change this.
Endnotes
*: scripts and processes intended to help vendor tooling to communicate, or improve user or developer experience.
**: Exception being entirely end-to-end solutions from a single vendor.
***: That includes analysis of data, and pipelines, as unless you are a stats office or research institute, that isn’t your core business.