29 Nov 2021
For decades now, the airline industry has been heavily improving in security and reliability. For this reason, flying is considered one of the safest ways of travelling. This safety is due to a very well defined incident management process. The process is present in all stages, from the moment the pilot detects an anomaly to the decision making, the outcome and future steps.
It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you'll do things differently
Benefits the whole organisation
When your company makes a mistake, it appears everywhere, but what your company did after is easily ignored. Having an incident management process will prevent most incidents from happening again.
But your product is not the only important thing, the people that build it is too.
By creating an operating model that has learnt from previous mistakes will prevent people from repeating them. It will allow them to perform better in their job with more confidence and focus their energy on innovating instead of avoiding errors.
As mentioned before, the incident management process is involved in all stages. To make things easier, let me use a timeline format.
As soon as we detect an anomaly
The airline industry uses the DECIDE model, where each letter indicates a step to correct the current course and avoid a disaster.
Detect what went wrong.
Estimate what time they have to act is.
Choose a decision that will bring a desirable outcome.
Identify what actions this decision involves.
Do, as in, execute the decision.
Evaluate the outcome of the decision and evaluate if we need further actions.
A way to make this process efficient is by having an action plan or a disaster recovery plan. The name is not essential. What matters is that anyone in the team knows, by being documented, what steps it should take to detect what went wrong and what variables to leverage when choosing a
If the outcome is the desired one, the DECIDE process is closed, and the post-mortem starts.
We prevented the disaster. Now what?
Something happened, and we need to understand why and how we can prevent it from happening again. We are going to obtain these learnings from the post-mortem.
First, we need to set a timeline of events. For example:
- At 16:07, we detected an abnormal number of users logging in to the platform even though no communication was sent to them.
- At 16:21, some team detected that there might have been a database breach to the users' database.
- At 16:23, the platform was disabled and put in maintenance mode to prevent any further logins until the event was investigated.
- At 16:35, all passwords were cleared, the platform re-enabled, and communication of breach was sent to the users.
This timeline is telling us a part of the story, from the moment users have been affected to how we prevented it from escalating. We will need now to establish what took us there. For this, we should prepend these events to the timeline. For example:
- At 12:45, some team deployed in production some feature that enabled users to do some action. (Link to the pull request of the code). This pull request had an unsecured API that allowed anyone to connect to the database and do manual queries.
We have established what has caused the user breach, but we still don't know why it happened and how to prevent it.
Is it possible that the review of the code was not good enough? That the team was rushed into doing something without the time required to do it well? Was the unsecured API part of the tasks?
By identifying the why will allow us to create a way to prevent this. Let's imagine that the API was intended to be there, just not unsecured. Then we can say that establishing some automatic testing that no API should be reachable without credentials would have prevented this incident. We will add this new layer to what we are going to call the Swiss Cheese model.
As you can see in the graph, this model consists of several layers that should catch or prevent an event from escalating enough to end up in a disaster. The wholes are variables that we cannot contain.
Not done yet
We have now established what happened, why it happened and how to prevent it, but steps are left.
We need to evaluate the impact. Was there a loss or profit? How many users have been affected? Is there any industry or market related process we should start? Assessing these variables will allow us to understand how critical the event was or could have been. Everyone will immediately understand why following up on these incidents improves processes and why such methods are in place, sometimes at the cost of feeling bureaucratic.
Another benefit of this is that it will allow management to understand if maybe speed has too much leverage than quality and how to restore the balance.
But, why do I need a manager for this? It's tough to make sure that all teams follow this process correctly.
Having an incident manager or incident management team will make sure everyone understands it and does as required. If any legal process should follow the incident, having a clear understanding of it could prevent a company from going bankrupt.