If there is additional follow-up work that requires more time after an incident is resolved and closed (like a detailed root cause analysis or a corrective action) a new issue may need to be created and linked to the incident issue. It is important to add as much information as possible as soon as an incident is resolved while the information is fresh, this includes a high level summary and a timeline where applicable.

Due to the overhead involved and the risk of detracting from impact mitigation efforts, this communication option should be used sparingly and only when a very clear and distinct need is present.

Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

If, during an S1 or S2 incident, it is determined that it would be beneficial to have a synchronous conversation with one or more customers a new Zoom meeting should be utilized for that conversation. Typically there are two situations which would lead to this action:

Type /incident declare in the #production channel in GitLab’s Slack and follow the prompts to open an incident issue. It is always better to err on side of choosing a higher severity, and declaring an incident for a production issue, even if you aren’t sure. Reporting high severity bugs via this process is the preferred path so that we can make sure we engage the appropriate engineering teams as needed.

If a second incident zoom is desired, choose which incident will move to the new zoom and create a new meeting in zoom. Be sure to edit the channel topic of the incident slack channel to indicate the correct zoom link.

The EOC Coordinator is focused on improving SRE on-call quality of life and setting up processes to keep on-call engineers across the entire company operating at a high level of confidence.

By default, the EOC is the owner of the incident. The incident owner can delegate ownership to another engineer or escalate ownership to the IM at any time. There is only ever one owner of an incident and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. The incident issue should be assigned to the current owner.

If during an incident, the EOC, Incident Manager or, CMOC need to be engaged, page the person on-call using one of the following. This triggers a PagerDuty incident and page the appropriate person based on the Impacted Service that you select.

Please note that when an incident is upgraded in severity (for example from S3 to S1), PagerDuty does not automatically page the Incident Manager or Communications Manager and they must be paged manually.

Gases include compressed, liquefied, dissolved, refrigerated liquefied, aerosols, and other gases. They are defined by the hazardous materials classification (Class 2) as “substances that have a vapor pressure of 300 kPa or greater at 50°c or are completely gaseous at 20°c at standard atmospheric pressure.” Gases are considered dangerous because they pose an imminent threat as a potential asphyxiate and because they are often extremely flammable.

Class4hazardousmaterials

The current EOC can be contacted via the @sre-oncall handle in Slack, but please only use this handle in the following scenarios.

The Infrastructure Leadership is on the escalation path for both Engineer On Call (EOC) and Incident Manager (IM). This is not a substitute or replacement for the active Incident Manager (unless the current IM is unavailable).

The IM won’t be engaged on these tasks unless they are paged, which is why the default is to page them for all Sev1 and Sev2 incidents. In other situations, page the Incident Manager to engage them.

In both cases of Degraded or Outage, once an event has elapsed the 5 minutes, the Engineer on Call and the Incident Manager should engage the CMOC to help with external communications. All incidents with a total duration of more than 5 minutes should be publicly communicated as quickly as possible (including “blip” incidents), and within 1 hour of the incident occurring.

Class 3DangerousGoodslabel

Oxidizers are substances that can produce oxygen. They are within the hazardous materials classification (Class 5) because the right circumstances they can contribute to the combustion of other hazardous substances, though they are not always combustible themselves. Oxidizers can be defined as “substances that can cause or contribute to combustion, typically by producing oxygen as a result of a redox chemical reaction.” Organic peroxides are considered dangerous goods because they are thermally unstable and can exude heat while undergoing exothermic auto-catalytic decomposition. These materials can also undergo explosive decomposition, burn rapidly, be sensitive to friction, or react dangerously with other substances.

GitLab uses the Incident Management feature of the GitLab application. Incidents are reported and closed when they are resolved. A resolved incident means the degradation has ended and will not likely re-occur.

Class 3 Hazardousmaterials divisions

This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the Key Service Metrics Dashboard

If a separate unrelated incident occurs during the maintenance procedure, the engineers involved in the scheduled maintenance should vacate the Situation Room Zoom in favour of the active incident.

Class 3 hazardous goodslist

In some cases, we may choose not to post to status.io, the following are examples where we may skip a post/tweet. In some cases, this helps protect the security of self managed instances until we have released the security update.

The GitLab support team staffs an oncall rotation and via the Incident Management - CMOC service in PagerDuty. They have a section in the support handbook for getting new CMOC people up to speed.

If an incident may be security related, engage the Security Engineer on-call by using /security in Slack. More detail can be found in Engaging the Security Engineer On-Call.

Flammable solids are defined as “materials under conditions encountered in transport, are combustible or may cause or contribute to fire through friction, self-reactive substances which are liable to undergo a strongly exothermic reaction or solid desensitized explosives.”

Hazardous materials are broken down into 8 main classes, and the 9th miscellaneous class covering all other materials that don’t fall under the first 8.

For serious incidents that require coordinated communications across multiple channels, the Incident Manager will rely on the CMOC for the duration of the incident.

These definitions imply several on-call rotations for the different roles. Note that not all incidents include engagement from Incident Managers or Communication Managers.

Explosives meet the hazardous materials classification (Class 1) because they have the ability to produce hazardous amounts of heat, sound, smoke, gas or light. They are also capable, through a chemical reaction, of producing gases at speeds, temperatures, and pressures that can cause disastrous damage.

As the name implies, miscellaneous hazardous materials classification (Class 9) are substances that present an imminent threat that is not covered within the definitions of the other 8 classes. Class 9 miscellaneous dangerous goods present a wide variety of potentially hazardous threats to human health and safety, infrastructure and/or their means of transports. They are defined as but not limited to “environmentally hazardous substances, substances that are transported at elevated temperatures, miscellaneous articles and substances, genetically modified organisms and micro-organisms and magnetized materials and aviation regulated substances.”

Corrosive are substances that degrade or disintegrate other materials upon contact through a chemical reaction if leakage, or damage occurs to the surrounding materials. It is capable of destroying materials, such as living tissues. The department of transportation considers an acid with a pH <2 or greater than 12.5 to be corrosive.

Coordination and communication should take place in the Situation Room Zoom so that it is quick and easy to include other engineers if there is a problem with the maintenance.

The EOC will respond as soon as they can to the usage of the @sre-oncall handle in Slack, but depending on circumstances, may not be immediately available. If it is an emergency and you need an immediate response, please see the Reporting an Incident section.

When an incident is created that is a duplicate of an existing incident it is up to the EOC to mark it as a duplicate. In the case where we mark an incident as a duplicate, we should issue the following slash command and remove all labels on the incident issue:

In some scenarios it may be necessary for most all participants of an incident (including the EOC, other developers, etc.) to work directly with a customer. In this case, the customer interaction Zoom shall be used, NOT the main GitLab Incident Zoom. This will allow for the conversation (as well as text chat) while still supporting the ability for primary responders to quickly resume internal communications in the main Incident Zoom. Since the main incident Zoom may be used for multiple incidents it will also prevent the risk of confidential data leakage and prevent the inefficiency of having to frequently announce that there are customers in the main incident zoom each time the call membership changes.

Occasionally we encounter multiple incidents at the same time. Sometimes a single Incident Manager can cover multiple incidents. This isn’t always possible, especially if there are two simultaneous high-severity incidents with significant activity.

Class 3dangerousgoodsexamples

We manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.

If an alert silence is created for an active incident, the incident should be resolved with the ~"alertmanager-silence" label and the appropriate root cause label if it is known. There should also be a linked ~infradev issue for the long term solution or an investigation issue created using the related issue links on the incident template.

For general information about how shifts are scheduled and common scenarios about what to do when you have PTO or need coverage, see the Incident Manager onboarding documentation

The EOC Coordinator will work closely with the Ops Team on core on-call and incident management concerns, and engage other teams across the organization as needed.

If you are a GitLab team member and would like to report a possible incident related to GitLab.com and have the EOC paged in to respond, choose one of the reporting methods below. Regardless of the method chose, please stay online until the EOC has had a chance to come online and engage with you regarding the incident. Thanks for your help!

In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.

Class 3flammable liquids transportation

Near misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.

Corrective Actions issues in the Reliability project should be created using the Corrective Action issue template to ensure consistency in format, labels and application/monitoring of service level objectives for completion

Further support is available from the Scalability and Delivery Groups if required. Scalability leadership can be reached via PagerDuty Scalability Escalation (further details available on their team page). Delivery leadership can be reached via PagerDuty. See the Release Management Escalation steps on the Delivery group page.

It is not always very clear which service label to apply, especially when causes span service boundaries that we define in Infrastructure. When unsure, it’s best to choose a label that corresponds to the primary cause of the incident, even if other services are involved.

In the case of a high severity bug that is in an ongoing, or upcoming deployment please follow the steps to Block a Deployment.

Implementing a direct customer interaction call for an incident is to be initiated by the current Incident Manager by taking these steps:

In order to help with attribution, we also label each incident with a scoped label for the Infrastructure Service (Service::) and Group (group::) scoped labels among others.

The following labels are used to track the incident lifecycle from active incident to completed incident review. Label Source

To make your role clear edit your zoom name to start with your role when you join the Zoom meeting. For Example “IM - John Doe” To edit your name during a zoom call, click on the three dots by your name in your video tile and choose the “rename” option. Edits made during a zoom call only last for the length of the call, so it should automatically revert to your profile name/title with the next call.

Corrective Actions (CAs) are work items that we create as a result of an incident. Only issues arising out of an incident should receive the label ~"corrective action". They are designed to prevent the same kind of incident or improve the time to mitigation and as such are part of the Incidence Management cycle. Corrective Actions must be related to the incident issue to help with downstream analysis.

The current Root Cause labels are listed below. In order to support trend awareness these labels are meant to be high-level, not too numerous, and as consistent as possible over time.

A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are:

Clear delineation of responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation.

Radioactive materials are defined by hazardous materials classification as “any material containing radionuclides where both the activity concentration and the total activity exceeds certain pre-defined values.” While undergoing radioactive decay, radioactive materials can emit harmful ionizing radiation.

Toxic materials fall under the hazardous materials classification (Class 6) because of the ability to cause serious injury or death if swallowed, inhaled or contact is made with skin. Infectious substances are also classified as a dangerous good for containing pathogens, which includes bacteria, viruses, parasites and/or other agents which can cause disease to humans or animals when contact is made. Dangerous goods regulations define pathogens as “microorganisms, such as bacteria, viruses, rickets, parasites, and fungi, or other agents which can cause disease in humans or animals.”

If a related incident occurs during the maintenance procedure, the EM should act as the Incident Manager for the duration of the incident.

Flammable solids fit within the hazardous materials classification (Class 4) because they are highly combustible, are capable of posing serious hazards due to their volatility, combustibility, potential in causing or propagating severe conflagrations and can even cause fire through friction.

Some of this may feel counter to GitLab Values; this is not designed or intended to diminish our values but to acknowledge and reinforce our need to mitigate customer impact as quickly as possible.

Flammable liquids or combustible liquids are volatile, and can often give off a flammable vapor. They are defined by the hazardous materials classification (Class 3) as “liquids, mixtures of liquids or liquids containing solids in solution or suspension which give off a flammable vapor, and have a flash point at temperatures not more than 60.5°C or 141°F.” Flammable liquids are capable of posing serious threats because of their volatility, potential of causing severe conflagrations and combustibility.

When paged, the Incident Managers have the following responsibilities during a Sev1 or Sev2 incident and should be engaged on these tasks immediately when an incident is declared:

Issues that have the ~"corrective action" label will automatically have the ~"infradev" label applied. This is done so teams these issues are follow the same process we have for development to resolve them in specific time-frames. For more details see the infradev process.

Incidents use the Timeline Events feature, the timeline can be viewed by selecting the “Timeline” tab on the incident. By default, all label events are added to the Timeline, this includes ~"Incident::Mitigated" and ~"Incident::Resolved". At a minimum, the timeline should include when start and end times of user impact. You may also want to highlight notes in the discussion, this is done by selecting the clock icon on the note which will automatically add it to the timeline. For adding timeline items quickly, use the quick action, for example:

We want to be able to report on a scope of incidents which have met a level of impact which necessitated customer communications. An underlying assumption is that any material impact will always be communicated in some form. Incidents are to be labeled indicating communications even if the impact is later determined to be lesser, or when the communication is done by mistake.

The Engineer On Call is responsible for the mitigation of impact and resolution to the incident that was declared. The EOC should reach out to the Incident Manager for support if help is needed or others are needed to aid in the incident investigation.

A page will be escalated to the Incident Manager (IM) if it is not answered by the Engineer on Call (EOC). This escalation will happen for all alerts that go through PagerDuty, which includes lower severity alerts. It’s possible that this can happen when there is a large number of pages and the EOC is unable to focus on acknowledging pages. When this occurs, the IM should reach out in Slack in the #incident-management channel to see if the EOC needs assistance.

Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.

In the case of high severity bugs, we prefer that an incident issue is still created via Reporting an Incident. This will give us an incident issue on which to track the events and response.

Labeling incidents with a Root Cause is done for the categorization of incidents when deploy pipelines are blocked. For this reason, a label with the prefix ~RootCause is required whenever an incident has the ~"release-blocker" label. The person assigned to the incident is responsible for adding the appropriate Root Cause label.

If a ~severity::3 and ~severity::4 occurs multiple times and requires weekend work, the multiple incidents should be combined into a single severity::2 incident. If assistance is needed to determine severity, EOCs and Incident Managers are encouraged to contact Reliability Leadership via PagerDuty

In the United States, the Aviation Safety Reporting System has been collecting reports of close calls since 1976. Due to near miss observations and other technological improvements, the rate of fatal accidents has dropped about 65 percent. source

As well as opening a GitLab incident issue, a dedicated incident Slack channel will be opened. The “woodhouse” bot will post links to all of these resources in the main #incident-management channel. Please note that unless you’re an SRE, you won’t be able to post in #incident-management directly. Please join the dedicated Slack channel, created and linked as a result of the incident declaration, to discuss the incident with the on-call engineer.

The CMOC is responsible for ensuring this label is set for all incidents involving use of the Status Page or where other direct notification to a set of customers is completed (such as via Zendesk).

A near miss, “near hit”, or “close call” is an unplanned event that has the potential to cause, but does not actually result in an incident.

The Incident Manager should exercise caution and their best judgement, in general we prefer to use internal notes instead of marking an entire issue confidential if possible. A couple lines of non-descript log data may not represent a data security concern, but a larger set of log, query, or other data must have more restrictive access. If assistance is required follow the Infrastructure Liaison Escalation process.

EOCs are responsible for responding to alerts even on the weekends. Time should not be spent mitigating the incident unless it is a ~severity::1 or ~severity::2. Mitigation for ~severity::3 and ~severity::4 incidents can occur during normal business hours, Monday-Friday. If you have any questions on this please reach out to an Infrastructure Engineering Manager.

During a verified Severity 1 Incident the IM will page the Infrastructure Liaison. This is not a substitute or replacement for the active Incident Manager.

The EOC and the Incident Manager On Call, at the time of the incident, are the default assignees for an incident issue. They are the assignees for the entire workflow of the incident issue.

Class 3 hazardous goodsexamples

If the EOC does not respond because they are unavailable, you should escalate the incident using the PagerDuty application, which will alert Infrastructure Engineering leadership.

Incident severity should be assigned at the beginning of an incident to ensure proper response across the organization. Incident severity should be determined based on the information that is available at the time. Severities can and should be adjusted as more information becomes available. The severity level reflects the maximum impact the incident had and should remain in that level even after the incident was mitigated or resolved.

After learning of the history and current state of the incident the Engineering Communications Lead will initiate and manage the customer interaction through these actions:

Class 3flammable liquid storage requirements

The Incident Manager is the DRI for all of the items listed above, but it is expected that the IM will do it with the support of the EOC or others who are involved with the incident. If an incident runs beyond a scheduled shift, the Incident Manager is responsible for handing over to the incoming IM.

For Sev3 and Sev4 incidents, the EOC is also responsible for Incident Manager Responsibilities, second to mitigating and resolving the incident.

In order to effectively track specific metrics and have a single pane of glass for incidents and their reviews, specific labels are used. The below workflow diagram describes the path an incident takes from open to closed. All S1 incidents require a review, other incidents can also be reviewed as described here.

Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.

The following services should primarily be used for application code changes or feature flag changes, not changes to configuration or code maintained by the Infrastructure department:

30-minutes before the maintenance window starts, the Engineering Manager who is responsible for the change should notify the SRE on-call, the Release Managers and the CMOC to inform them that the maintenance is about to begin.

In our last blog, What Defines a Hazardous Material, we began to discuss the different definitions of hazardous waste and materials based on the regulating agency that is defining them. Part of understanding these materials is to determine which class they fall under, this will be your guide to understanding how to adequately handled them.

As a means to ensure a healthy Incident Manager rotation with sufficient staffing and no significant burden on any single individual we staff this role with Team Members from across Engineering.

When an incident starts, the incident automation sends a message in the #incident-management channel containing a link to a per-incident Slack channel for text based communication, the incident issue for permanent records, and the Situation Room Zoom link for incident team members to join for synchronous verbal and screen-sharing communication.