War story: 'just five times as long'

I’m reflecting on some events in my career, trying to process some memories and learn the lessons. This story is one of those.

The Story

A batch process ran every Friday for users who reached a certain point in our workflow. Business users made contact one week because although this process ran regularly (apparently) without problems, this time it was for a few thousand users instead of the usual hundreds. They just wanted to know that there was some backup if things went wrong.

None of the programmers who had written this code were still in the company. In theory our team had the most context, but that boiled down to a trivial bug fix. “I don’t really know how this is supposed to work, but line 16 is obviously wrong”, I thought, as I approved the fix. I wondered how that bug got in there unnoticed. “Hasn’t this process been running every week for a year now?”

In any case, the process ran weekly, and now it just had to run at a slightly different scale. The worry seemed almost laughable. Hundreds of users? Thousands? 5x of normal? We’d really lost perspective on what “a lot” means for software deployed in the cloud.

“It’ll just take five times as long. How great is it that computers never get tired and will never make a mistake on item number 1998 out of 2000?”, I thought. “Nothing likely to go wrong”, I thought.

Nonetheless, I simulated the process with an hour or two to go before the real deal, in our staging environment. Things went instantly wrong. There’s a limited number of characters allowed in URLs and the code put user IDs into URL parameters, creating an absurdly long URL. Sort of ok for a few hundred, broken for a few thousand. A short fix later, and we were using a POST request instead of long URLs. Now for the real thing.

It still didn’t work. The server rejected the form on security grounds, but we didn’t figure that out immediately. Given the emergency (this particular process wasn’t going to happen again for several weeks, but it needed to happen Right Now) we reverted the first fix, and then hardcoded the list on the server.

It still didn’t work. The process started and then stopped. Certain users had problems, a few in a hundred, and after a while things just stopped.

The admin users took a risk: “I just pressed the button again, it seems to be going now!”. I had no idea what was going wrong, and would have experienced severe distress had I heard this plan in advance. “What if the process is creating broken data? And you want to do it again?”, I thought, after it was too late.

The process was already running, so we let it continue. It broke again, the admin users pressed the button again and so on. Friday afternoon turned into late Friday evening, with about half the users unprocessed. Everyone decided to come back on Saturday.

At some point on Saturday I worked out that the process ended up in another service, and the first service gave no useful feedback on errors in the second.

I looked at the hardware where the other service was running. There was an obvious memory leak, a red line trended upwards and then stopped when it couldn’t go higher. Time to go look at the logs to see what was happening.

The logs weren’t there. Faint cusswords were heard as I contemplated the sort of programmers who could set up a system without logs, until I realised that the logs were being generated. This service was just hosted in a different region from all of our other systems, and our central logging service didn’t work with the different region.

I found a substantially less user-friendly way to view the logs, and there wasn’t much. I decided to redeploy the service. There hadn’t been a build in a long time, but there were some failed timed builds, and these timed builds did go wrong sometimes, I knew. I re-ran the timed build, and it failed, again.

I now knew that there was no way to get the system to behave in a deterministic way, nor even to add some new logging code. I asked someone who knew more about our infrastructure. This snowflake system in a different region had caused all kinds of problems, but that feedback was quite distant from the decision makers, and too easily handwaved away. Everyone knew it was different, but it had to be hosted in that region For Legal Reasons.

Even in this case, everyone would lose interest soon. The last button press had left less than a hundred users un-processed. One more button press! It was only a few people going wrong in a hundred, and anyway, we didn’t normally run it for this many people.

Not so bad

From one perspective, it looks like things weren’t so bad: a business process worked for 5x the usual number of customers, with no preparation. Without thinking too hard about it, it seems like the worst of it was that some people lost a significant part of their weekend. But we weren’t going to do that again, so thanks everyone, let’s all go get a drink and forget about it.

For my own - and for readers - benefit, I ignore the above instinct. I’ll indulge my nervousness about things that are not predictable and continue.

Why so unpredictable?

We were responsible for code that we didn’t know at all, which left us at the mercy of our assumptions.

Unreliable feedback

Some of those assumptions were based on unreliable feedback:

Feedback from business users only: Business users don’t have the same expectations of code as programmers do, and their feedback can’t be treated the same. From a business perspective, getting the expected result after a few more button presses is fine. Unreliability could have been entirely normal for this process in the past. If a programmer was reporting on things, I would expect to hear some suspicion about unexpected stops and starts. If something unexpected is happening there, what else is going wrong?

Unhandled edge cases: I mentioned a bug that we fixed once before we got to the big run. That bug had caused the process to stop under certain circumstances. Some data would have to be massaged by the users, and then they had to restart the process. In other words, everyone who observed the process was conditioned to accept that is was a stop and start affair. Interruptions for other reasons were not being examined with the suspicion they deserved.

Assumptions about infrastructure

We (reasonably) expected repeatable builds with easy deployment, and centralised logging. That we didn’t hear about those being missing when this system was handed over to us is concerning. There were some communication difficulties at the time, so it’s possible that we actually did, but couldn’t take it all in.

Assumption of reliability at (slightly) higher scale

5x is a trivial increase in a world that lives with the Reddit Hug of Death. It’s a small factor, it was predicted, and unlike normal Internet behaviour, this was not simultaneous demand. We believed that the process ran predictably once a week, this was just like running the same process five times in a row.

It’s not like servers need time to recover after doing something. That’s why we automate instead of asking humans to do repetitive things. So why not?

I’ll go further and say that anticipating problems at 5x the load in ordinary cases would be premature optimisation and should be discouraged.

In the case of memory leaks, we do get an exception: the machine loses memory until it needs to be restarted. It actually does look fairly similar to a human who needs to take a break and come back fresh.

The kind of job that was being done per user in this case was actually something that could lead to memory leaks. Some pre-emptive awareness of that risk and explicitly documenting it would be best.

Automated monitoring might have helped here as well. It is probable that the monitoring would be more difficult because of the unique deployment of this system.

We got Lucky

This could have ended very much worse.

The process was on the critical path: if it didn’t happen, the business couldn’t do anything for these users, and wouldn’t make any money from them. Moreover, it had to happen far enough in advance of the next step, or that would be delayed too. Delays were pretty bad as this was a time of peak distrust from our users, and cash flow was extremely tight. Losing this money from those customers could have sunk the business.

There was also a risk of bad data being created. Money could have changed hands in subsequent steps of the workflow without us being aware of it, or we could have left suppliers unpaid after we thought we had transferred money for them. To make this consequence more human: people could have travelled to different continents on the assurance of funds from the business, and found themselves without money to feed themselves.

None of this actually did happen, but consider what was between us and this: code that we didn’t understand, difficulty getting more information about what it was doing (logging problems), and inability to fix any problems we did spot (build problems). We were on thin ice.

Long-term acceptance of luck is bad

Forgetting the things that could go wrong, there are still problems with the things that did go wrong.

First, there is some fairly intense scrutiny on later stages of this workflow, so perhaps the most drastic errors would have been caught. However, scrutiny on unreliable processes has a cost, and it tends to get into a vicious cycle:

something is unreliable so you build a process to review the results
you start relying on the review and become less fazed by unreliability
more things go wrong
you build an even bigger review
go to 2

This costs more time and energy. Either you risk burnout and people leaving (not a good start that people were already giving up a weekend), or it become’s someone’s full time job to watch those processes. It will not be a fulfilling job, because the best they can do is Make It Not Go Wrong.

How to avoid all this

From my reflections, I see a list of “insteads” that could help avoid this sort of situation:

Instead of tolerating code you don’t understand, work to understand it, or if there’s no time, be aware of it as a risk
Instead of accepting feedback from the business at face value, be aware of technical things that they might not notice
Instead of tolerating edge cases that require human workarounds, be suspicious of them and try to make them explicit in the code and UI
Instead of considering all code equal, be aware of unusual dependencies, e.g. things that could cause memory leaks and document or monitor them
Instead of treating all errors from dependencies equally, make sure that logging and error handling strategy is appropriate for different cases
Instead of accepting unusual deployment and infrastructure, make very sure you understand the things that become more difficult, and the cost of that difficulty. Communicate those risks

There’s an element here of learning, which comes from experience that can be gained quickly (by being around other experienced people who are prepared to share responsibility, perhaps also by reading some books and blog posts). It can also be learned more slowly, by never having any responsibility, or never looking after anything critical.

There’s also an element of power. A developer who is not well known to the business, or an external contractor will have difficulty communicating what they know: “I know the button works this week, but we don’t have logs”. It’s too easy for that to be drowned out by the person who wants them to move on to the next feature.

A good tech culture helps here: other programmers in the business can hear about your problems and add their voices in raising the alarm. In the case of the team that built this code originally, they were isolated from the rest of the tech team in this business. They were always up against it.

This post was originally one bullet point in another post. Longest bullet point ever! Also, conclusions are hard.