Being reminded of an incident resolution target time when you’re trying to fix a problem increases stress on the engineers fixing the problem. The additional stress will degrade performance. Maybe not what the incident commander and managers wanted, is it?

USAF 1951resolution targetdownload

Worse Decision Making Sometimes managers rush their decision making or apply half-baked fixes that engineers advised against in order to hit a resolution target or incident SLA. Or my favorite example of insanity: “this incident has been running more than an hour, which is the resolution time for a P2 issue, so let’s downgrade it to a P3.” WAT?!

If you have long or unpredictable resolution times, you can use learning reviews (aka ‘post mortems’) and continuous improvement to improve understanding and reduce complexity of your system.

Resolution TargetPDF

The ability to fix a problem in a predictable amount of time is governed by the relative complexity of that system and situation. If you have an Obvious or Complicated situation, perhaps you have a runbook or an expert that can resolve the incident predictably. If neither of those statements are true, then the only thing you can be confident of is that your incident resolution times won’t have a nice predictable normal distribution.

USAF 1951resolutiontest chart PDF

There’s no point in duplicating business and customer impact of an incident with a target resolution time.  If incident responders don’t understand or don’t care about the impact of an incident to their customers or their business, that organization is likely suffering from a lack of management and leadership. Driving understanding of what is important to customers and the business is a primary responsibility of managers and leaders. Further, if leaders can’t describe a scale of incident severity with sufficient precision that everyone can use it, how will they set the resolution targets? My guess is ‘arbitrarily’.  And why would responders care?  They won’t.

So was Kaimar talking about Obvious or Complicated systems and J. Paul Read Complex systems?  Maybe?  Either way, I don’t see how a target resolution time is relevant for incident responders.  Responders are going to use the knowledge and tools they already have on hand to resolve the incident as opposed to building response capability during the incident.

USAFresolution TargetCalculator

Image

Receive #NoDrama articles in your inbox whenever they are published. Reply to Stephen and the QualiMente team when you want to dig deeper into a topic.

USAFResolution Target

Image

So something like this would be a bad indicator to determine the priority of the incident (as it competes with others) without having to (re-)negotiate the priorities with relevant stakeholders to determine business impact? Genuinely curious jtbc :)

On Friday, J. Paul Reed sparked an interesting thread when he said that organizations that include anything resembling a “target resolution time” in their incident management process means the organization doesn’t trust their engineers to do their jobs manage incident resolution times to a target:

If your engineers already understand an incident’s impact to the business, a resolution target is duplicative, at best. During critical incidents, engineers are (or should be) the ones gathering the data, using that context to determine severity according to a standardized severity scale, and explaining that context and appropriate options to managers.

USAFResolution TargetPDF

My take is that J. Paul Reed is right in practice. This thread triggered flashbacks to all of the negative effects listed in the thread and more.

Image

As you stabilize the system, you should see the time to resolve incidents improve.  However, incident resolution times are a lagging indicator of team and organizational performance, not constructive guidance during incident response.

However, they are more valuable as a measure of how much understanding and control you have over a (sub-)system.  When the team building and operating a system doesn’t have understanding and control of that system, it’s a better measure of management, leadership, organizational culture, and system architecture than of the engineers on the incident call. (J. Paul Reed and Kaimar Karu might agree)