Incidents, tickets, and standardized learning
One of the struggles of learning from incidents in a business environment (i.e. “at work”) is often the time factor required to really dive into an incident and learn from it. Incidents are a vast source of interesting discoveries and ways to learn more about your systems and the humans that keep them running. But learning is really hard to time box. And in the usual work setting you kinda go by the defaults of your calendaring software - which is usually either a 30 minute or 1 hour meeting - to allot time for learning from an incident in a group setting. And given how much talk there always is about the expensiveness of meetings and how many meetings could be emails (or worse a Slack thread) having a 1 hour meeting about an incident already feels a bit much sometimes. I’ve had many conversations about this in the past, especially when I used to facilitate lots of learning reviews for incidents myself as well as teach facilitation to my coworkers. And usually someone wanted to know how we know a 1 hour meeting is enough to get through an incident. And the honest answer is it’s never enough. But 1 hour is the somewhat arbitrary amount of time you usually are able to schedule amongst a large-ish group of people without too many conflicts. And yet it often already feels hard to justify why it’s needed. Especially having a meeting where the outcome might not be as easily quantifiable as lots of people would like. The expectation is always that you have something to show for after that hour. Like a bunch of remediation tickets to close, some reasons why exactly something happened, and steps and actions why it will never happen again. Ideally one reason, one cause for why it happened which can be easily prevented in the future.
I’ve thought a lot over the years about why that is. What is so alluring and comforting about single cause incidents for a business. And conversely what is so hard about accepting the fact that we will never be able to prevent another incident from happening. We’ll never fully “solve” an incident and we’ll never be able to describe and map a system to the extent where we can know the full impact of every change in the future.
The base unit of work at a company is a ticket
What I’ve mostly come down to when thinking about this is the realization that the base unit of work in a corporate context is a ticket (or an issue, or whatever your tracking software calls it). The way you know what needs to be done and that you have done a certain amount of work is because there exists a ticket in a tracker. Many conversations (and “process optimizations”) in a usual work setting at certain times include a discussion about the fact that everything that is being worked on needs to be tracked in a ticket to make sure it’s all visible and surfaced. And once the work is done (or abandoned) the ticket needs to be closed to communicate its status (and sometimes status needs to be communicated on open tickets as well). These updates are then used to roll up into higher level tickets and summaries to higher up management to communicate to them what got done and what teams are working on. There is much more that could be said about this but the point I want to bring across is that the core measurement of work planned, to be done, in progress, done, or even abandoned is to have a ticket that can be closed.
Incident investigations are about learning
And now the contrast here is that reviewing and investigating incidents is about learning. Learning what was previously unknown and thus contributed to a surprise that manifested as an outage or reduced availability or data loss, or any other unwanted event that we commonly call an incident. And learning - as probably almost everyone has witnessed in some form before - is far from linear and sequential. Sometimes it’s very quick, but usually it takes its sweet time. Especially in learning through research, where there isn’t anyone already who knows the answer and who can say whether a hypothesis or an understanding is true or false, learning is far from linear. It’s full of dead ends, red herrings, misunderstandings, re-discoveries, reformulations, conversations, disagreeing opinions, and probably late nights and long weekends.
Trying to now wrap learning into a time boxed setting like work where it can be reflected by a ticket that can be closed is surely a challenge. The irony here is that a form of this already exists in arguably the main arena of learning: education. Every school and college setting knows the setup of students having to learn (and ideally understand) a topic in a given time and then pass a test to be able to mark it as done (or as learned). Essentially a trade off that attempts to summarize learning into a checkbox style format where the options are more or less either pass or fail. And many, many discussions have been had and continue to happen about this suboptimal setup where it incentivizes students to learn for passing the test and not for understanding the topic. It gives rise to many frustrations where someone who is able to recite the words or equations (potentially without having understood the meaning behind them) is given the same or a better grade than someone who took more time to dig into what something means and study additional material around the topic to improve their understanding - but didn’t do as good of a job demonstrating that in a test setting. And it leads to a considerable amount of people - not least in the software engineering world - actively despising education and its standardized testing.
But the truth of the matter is that learning can have more than one goal. There is no “true” and “false” learning. One way of learning has the goal of passing a test and the other has more the focus on establishing and deepening one’s understanding of a topic. Both are valid and they satisfy different requirements. But you should know which one you choose and what you will get out of it. You can’t not learn, but you can definitely learn different things depending on how you approach it.
Question of focus
Tying this back to corporate incident investigations we are presented with a very similar choice. Do we want to review an incident for understanding the complex interplay of contributing factors that allowed it to manifest? Or do we want to be able to just close out the ticket already and move on? Both are valid in their own way. Because there will honestly never be enough time to investigate every incident in full. There is other stuff that needs to be done as well and you’ll never get the 6 or 12 or 25 people that were closest to the incident and know the most about what happened in the same room to share all their knowledge and experience to untangle the full incident (which is already a trade off because ideally a facilitator should interview them individually to make sure there is no barrier to sharing).
However the huge downside and difference to the education setting is that there is no actual “passing the test” in work incidents. Just by doing the due diligence to be able to say we can close the ticket we don’t actually gain or pass anything. We just miss out on learning. So while it’s appealing to make it look (and feel) like an incident review was “completed”, following a linear, causal accident model to get to a single cause that lets us close the ticket, we just cheat ourselves out of valuable learning and insights.
Sidenote: MTTR
Viewing incident review as learning and incidents as a source for and opportunity to do so, also makes it a lot more clear how some common “measures” of incidents are not as useful as one might think. Let’s take the very popular metric MTTR (Mean Time To Recovery) for example, which generally is intended to denote how long it takes on average to recover from an incident. On the face of it, it makes sense. Because we want to be available as much we can and work on making sure incidents - when they happen - are as short as possible. However viewing incident handling (which comes before the review but has some overlapping properties) through the lens of learning basically lets us structure it in roughly the following phases:
- Understand what’s wrong (e.g. too many 500 errors on the website)
- Understand how this situation manifested (e.g. a combination of more traffic, a ramped up feature flag on a specific code path, and an upgraded app server version)
- Understand what to change to mitigate (e.g. ramp down the feature flag for now)
Of course incident handling is more concerned with finding the most easy to change contributing factor that is still necessary but only jointly sufficient to give rise to the current incident and change it to make the incident go back from active to passive. So it’s less about getting a full (as much as possible) view on the incident - that’s what the review is for. Nevertheless in this view on incident handling it’s really three phases of understanding that we go through, so what MTTR really measures is MTTU (mean time to understanding). And thus we are back in the same situation. We are basically trying to force learning into an arbitrarily timeboxed, measurable, and summarized metric (similar to issues closed). Which again is something you can totally do. It might not be very useful and not serve you in the way you’d like it to do. Which makes it even more important to understand the limits and usefulness of a metric like that to not overly rely on it for the wrong reasons. Plus there are plenty more reasons why these measure are generally not all that useful.