Engineering discipline is not for you, now. It’s for the person who has to fix your code a year after you leave, when they should be celebrating their partner’s birthday party. What you are about to read is a true story, and it’s not an exaggeration or multiple stories stitched together. It’s all real.

A promising start: automating a business process

A certain business process had been automated three years before. A small team of engineers had formed around a business need to meet some legal compliance issues, wrote the first version, and then left it in the hands of a product team that included one of the initial engineers.

Settling into long term maintenance

It was plain sailing for a few years. The code expanded to automate more aspects of the compliance problem, and a third party system integrated to handle some issues that were not first-class concerns of the business logic. The team proceeded naturally, with some members leaving, and others taking over.

The first crack: fewer available maintainers

The first crack was that the third party system didn’t catch on with other parts of the business (even though it was widely applicable). It became a unique snowflake, poorly understood by anyone who wasn’t in the original team. Therefore, fewer candidates with good contextual knowledge remained in the face of natural attrition.

Team attrition? Thank goodness for contractors

Attrition was worrying, accelerated by some interpersonal issues and the standard flow of enquiries from hiring competitors.

The apparent saving grace against attrition were the contractors who had been present for the maintenance of this code. They appeared happy, and the relationship between the customer and client was strong, so they weren’t going anywhere.

When the last internal person left, the contractors became more isolated from the rest of the business. Three internal product owners engaged with the space and then moved on, but these contractors soldiered on.

However, they had less oversight and lacked awareness of the broader goals of the company. The longer this continued, the more unique and less understandable the code became.

Funding difficulties means no more contractors

Unfortunately, the business hit funding difficulties when the VC darling of the local tech community pulled out of a funding deal. The details are irrelevant here, but suffice it to say that the show this investor puts on for the local tech community doesn’t square with the way they handled this.

All contractors (and a third of the permanent workforce) were out within a month of the deal failing, leaving no-one who knew the code.

Four engineers replace two entire teams

Four engineers found themselves in the space vacated by two entire teams, looking after various systems including this code. Over months they pieced together a picture of how it was supposed to work, learning (amongst other things) that the process had been broken for months before they took over. Fixes went in, and increased monitoring was added.

Attrition continues

Unfortunately attrition had not taken a break during the crisis: ten months later one departure, another pending resignation, an extended period of leave for a third member, and a hiring freeze left the team poorly positioned for the ongoing flood of production issues.

The final two: Carlos and Fred

The two available members (let’s call them Carlos and Fred) soldiered on, trying to make sure that the old code would be safe with fewer and fewer people available to maintain it (if you’re counting, the two would soon be reduced to one after Fred completed his notice period).

Crisis point: Friday at 3pm

One Friday, this code didn’t produce what was expected. The process needed to be complete by Saturday at 5pm, or lawyers had to get involved. It was Friday at 3pm and Carlos and Fred had to jump on it asap.

Fred’s family matters

Fred was expecting to celebrate his partner’s 30th birthday the next day. He left at 5pm, grateful for Carlos’ willingness to handle the problem for what would still (he assumed) be a few hours. Fred enjoyed a pleasant evening of birthday events with the family.

Carlos works all night

The next morning, Fred woke to see that Carlos had been working all night on the problem and couldn’t continue. The happy path was fixed, but the lawyers would still need to get involved for the exceptions. With a few hours effort he believed he had a fighting chance of tracking down the exceptions, but again, it was his partner’s birthday.

Messy code is worse in a crisis

Distracted by party preparations, it was not a good time to deal with: logging that was unfamiliar and slow to search; builds that took 45 minutes to get anything into production; flaky tests that required the 45 minutes to start again; data investigation tools that took hours to fetch the necessary data (and it was later realised, would have produced the wrong data); and some poor abstractions from years of maintenance without strong awareness of the big picture.

What should have been a simple matter for Fred (read some logs, run some code based on the results) ended up going nowhere. Eventually the truth had to be faced: they’d pick this up on Monday, with or without the lawyers.

The moral about discipline: it’s not for you

The moral is related to frequent engineering decisions, or what I’m calling engineering discipline. Is your code understandable and resilient, are errors easily understood and recoverable, are builds fast and stable, are feedback loops fast enough to learn and make change, is your system unusual in a way that will reduce the number of people who can contribute usefully?

When you make these decisions, consider that it will not be you in your current context, with your current teammates for whom it matters.

Your discipline will affect someone else

Discipline will matter a year after you and all your teammates are gone, when someone has worked through the night or is severely distracted by important, once-in-a-liftetime family events. It will matter when lawyers need to get involved if things go wrong, and it will matter when the company has less money to spend on tech than it does now.

Caveats

There are a few caveats here:

First, companies sometimes make the decision to grow today at the expense of sustainability because there will be no point in sustainability if the business does not exist tomorrow. Product leaders or management may even oppose some of what I’m calling good discipline for this reason. I feel fortunate that in my current role, people are particularly receptive about ideas on how to avoid disasters, but it was not always like this.

Less charitably, middle management may be ignorant of the consequences, and their incentives are also often out of line with long running systems. They may even be consciously aware that the next rung of the ladder will come before they would have to deal with the consequences of making unsustainable decisions.

Conclusion: be disciplined, teach “upwards”, establish a culture

As engineers it often falls to us to educate people on the benefits of engineering discipline and sustainability. Much of this discipline will actually both make you more sustainable and allow you to move faster (there are many books on this topic). Unfortunately people who have not seen good discipline in action may struggle to understand.

Lastly, no single person’s decisions matter on their own, because that person will not be around to see the results. What matters is that there is a culture that can push everyone in a healthy direction.