Security Door Signs | Seton - door closing sign
It's imperative to offer flexible communication channels throughout the incident response process that allow teams to stay in touch by their preferred method. Jira Service Management integrates multiple communications channels to minimize downtime, such as embeddable status widget, dedicated statuspage, email, chat tools, social media, and SMS.
In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.
An incident management solution like Jira Service Management will help in each step of the response process, from organizing your on-call schedule and alerting to unifying teams for better collaboration to running incident postmortems.
To gain a shared understanding of priorities, roles, and processes, any team that’s starting or revisiting their major incident management process should start by getting clear on the answers to questions like:
Major incident management (often known here at Atlassian simply as incident management) is the process used by DevOps and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.
The definition of emergency-level varies across organizations. At Atlassian, we have three severity levels and the top two (SEV 1 and SEV 2) are both considered major incidents.
The most recent addition to the family, the Price Western range of Taser X2 holsters brings together the most popular holsters from the Taser X26 range, updated to fit the new Taser X2 perfectly. Each item in the range is specifically tailored to the Taser X2 model to ensure a perfect fit and security.
Throughout this process, the incident manager keeps a close eye on how things are going. Are particular team members overtasked? Does someone need a break? Do we need to bring in a fresh set of eyes? More delegation happens as needed.
By the time an incident reaches our teams, it’s already got a SEV 1, 2, or 3 attached. We consider SEV levels 1 and 2 to be major incidents, while a SEV 3 indicates a lower-impact incident.
Once we’ve confirmed that the incident is real, communication with our customers and employees becomes top priority. As we say in our handbook:
Root cause analyst or problem manager: The person responsible for going beyond the incident’s resolution to identify the root cause and any changes that need to be made to avoid the issue in the future.
Unfortunately, when it comes to incident resolution, there’s no one-size-fits-all. Which is why at this stage of the process, we take the time to:
When responding to an incident, communication templates are invaluable. Get the templates our teams use, plus more examples for common incidents.
Roles and responsibilities will vary based on your organization’s culture, team size, on-call schedules, and more. Some common major incident roles include:
Customer support lead: The person in charge of making sure incoming tickets, phone calls, and tweets about the incident get a timely, appropriate response.
The page alert we send out at Atlassian includes information on the severity and priority of the incident, as well as a summary, making it clear—at a glance—whether this is the top priority or can wait if another incident is in progress.
Incident management processes vary from company to company, but the key to success for any team is clearly defining and communicating severity levels, priorities, roles, and processes up front — before a major incident arises.
Sometimes major incidents require a single incident manager and a small team. Other times, a situation may call for multiple tech leads or even multiple incident managers. The original incident manager is tasked with figuring out when that’s the case and bringing on the appropriate people.
At Atlassian, our incident management process includes detection, raising a new incident, opening comms, assessing, sending initial comms, escalation, delegation, sending follow-up comms, review, and resolution.
As the incident continues to progress, another round of communication outside the tech team will help keep customers and employees calm, trusting, and in the loop. This is easy when collaborators can manage alerts across different communication platforms to stay on top of incident response.
The incident lifecycle (also sometimes known as the incident management process) is the path we take to identify, resolve, understand, and avoid repeating incidents.
The incident manager has been alerted and the communication channels are open. Next step: assessing the incident itself.
First, an incident is detected either by our technology, customer reports, or personnel. Whoever detects the incident (be it a technician who notices the issue or a customer service rep who gets a call from a frustrated client) is responsible for logging the incident in our system and identifying a severity level.
We have a strategic incident communication plan and provide regular status updates that follow a simple format. We also send an email to a set list of stakeholders that includes our engineering leadership, major incident managers, and other key internal staff. As previously mentioned, all of these communication methods are customizable within Jira Service Management and can be tailored to any organization's incident response plan.
In Jira Service Management, responders can group related tickets and add collaborators to the issue to coordinate alerts. Responders can also automatically record all actions with a rich incident timeline and access automation and knowledge base articles to rapidly investigate and remediate incidents.
Sometimes, an incident is resolved quickly by the on-call team. But in cases where that doesn’t happen, the next step is to escalate the issue to another expert or team of experts better suited to resolve this specific incident.
If a customer-facing service is down for all Atlassian customers, that’s a SEV 1 incident. If the same service is down for a sub-set of customers, that’s SEV 2. Both fall under the heading of major incident and require an immediate response from our incident management teams.
Once the incident manager gets an alert, their first order of business is to communicate that the incident fix is in progress. They change the status of the incident to fixing and set up the team’s communication channels.
Use postmortem templates with Jira Service Management to easily create and export post-mortem reports—along with associated incident timelines—to Confluence so responders can continue to collaborate with cross-functional teams to track follow-up actions and avoid similar incidents in the future.
Our incident lifecycle ends when the incident is resolved, but that isn’t the end of our process at Atlassian. We also want to do everything in our power to ensure an incident doesn’t repeat. Which is why the next step is a blameless postmortem, designed to identify the cause of an incident and help us mitigate our risk in the future.
“The goal of initial internal communication is to focus the incident response on one place and reduce confusion. The goal of external communication is to tell customers that you know something’s broken and you’re looking into it as a matter of urgency.”
Once the issue has been escalated to someone new, the incident manager delegates a role to them. At Atlassian, these roles are pre-set, so team members can quickly understand what’s expected of them.
Once we’ve answered those questions, we can confidently move forward with diagnostics and proposed fixes or change the SEV level and priority level of an incident as needed.
Communications manager: A communications pro (often from the PR or customer support teams) responsible for communicating with internal and external customers impacted by the incident.
Tech lead: A senior-level tech pro tasked with figuring out what’s broken and why, determining the best course of action, and running the tech team.