The Evolution of Incident Management

Have you ever thought about the history of incident management?

If you’re an SRE, you may be so caught up in the day-to-day work of reliability management and incident response that you never take the time to step back and reflect on your changing role and your responsibilities. And that’s a shame because SREs didn’t invent incident management concepts and strategies on their own.

Rather, the way SREs approach incident response, structure incident management teams and prioritizing incidents owes much to incident management strategies developed in the offline world decades ago. To fully understand what it means to be an SRE today, you must appreciate this deep history of incident management and response.

So let’s take a look at that history and look at the origin of modern incident response concepts.

Historical issues in incident management

Companies have always had incidents, of course. Fires, floods, infrastructure breakdowns and similar crises have been happening for millennia.

For most of history, however, humans had no efficient and useful way to deal with such incidents. Response efforts were ad hoc, and their effectiveness owed more than a small part of their success to sheer luck.

Particular challenges included:

  • Lack of effective and consistent communication between stakeholders.
  • Variable organizational structures that made it difficult to identify leaders, coordinate response efforts and delegate tasks.
  • Inconsistent response strategies.
  • Different approaches to assess the priority of incidents.

Historically, organizations may have been able to manage incidents well enough if incidents required a response from a single small group. But the more stakeholders involved, the more difficult it was to respond quickly and effectively.

Putting out the fires: the birth of the ICS

Things started to change for the better when stakeholders started thinking about better ways to put out fires, literally.

In the 1960s, California fire chiefs realized they were struggling to effectively respond to the wildfires that erupted each summer. Each year brought worse fires than the previous ones, with more land scorched and more buildings lost. the Laguna fire of 1970 brought things to a head and was the catalyst for a new approach to incident response for the fire service.

After assessing what was wrong, fire chiefs determined it was not a lack of equipment or personnel. It was poor coordination between the various firefighting agencies responding to the fires. Lacking a clear chain of command and a systematic approach to firefighting, agencies have struggled to deploy their resources quickly and therefore effectively.

To solve the problem, California fire chiefs developed what became known as the Incident Command System, or ICS. The ICS has defined a hierarchy for incident response with an Incident Commander at the top. It also defined several categories of incident response processes, including operations, planning, logistics, and finance. And it established a consistent set of terms that stakeholders can use to describe their actions during incident response, making it easier to communicate clearly.

Although ICS was originally designed to fight fires, it has become the de facto standard for organizing incident response strategies of all types.

From ICS to NIMS

The incident response story doesn’t end with ICS. A new chapter began in the early 2000s when the US federal government developed an even more comprehensive approach to incident management called the National Incident Management System, or NIMS.

NIMS was born following the terrorist attacks of September 11, 2001, which highlighted the importance of effective communication not only between different agencies of the same type (such as firefighters), but between entirely separate organizations. To achieve this, NIMS has developed the principles of ICS.

In addition to adopting most of the incident command principles and practices included in the ICS, the NIMS included standards for resource dispatch coordination. It has also embraced the concept of an emergency operations center, which in some ways resembles a network operations center in the digital world.

In some ways, NIMS looked like a compliance framework (although to be clear, it is not). It includes fourteen management principles, similar to compliance controls, that organizations should implement to manage incidents using an NIMS approach.

Incident management today

Obviously, putting out wildfires and responding to terrorist attacks is quite different from dealing with data center outages or deploying a buggy application. ICS and NIMS were not specifically designed for site reliability engineering or IT teams.

Still, the influence of ICS and NIMS on how SREs think is pretty clear. Terminology such as “incident commander” derives from these frameworks. The same goes for concepts such as shared responsibility for incident response processes and the importance of involving all stakeholders, not just technical teams, in incident response.

ICS and NIMS may not be familiar acronyms to most SREs. But they should be, because they are the historical sources of the incident management philosophies that form the basis of SRE work today and offer valuable lessons for any SRE on the job today.

Comments are closed.