When paged, the Incident Managers have the following responsibilities during a Sev1 or Sev2 incident and should be engaged on these tasks immediately when an incident is declared:

For general information about how shifts are scheduled and common scenarios about what to do when you have PTO or need coverage, see the Incident Manager onboarding documentation

Please note that when an incident is upgraded in severity (for example from S3 to S1), PagerDuty does not automatically page the Incident Manager or Communications Manager and they must be paged manually.

The current EOC can be contacted via the @sre-oncall handle in Slack, but please only use this handle in the following scenarios.

After learning of the history and current state of the incident the Engineering Communications Lead will initiate and manage the customer interaction through these actions:

If a ~severity::3 and ~severity::4 occurs multiple times and requires weekend work, the multiple incidents should be combined into a single severity::2 incident. If assistance is needed to determine severity, EOCs and Incident Managers are encouraged to contact Reliability Leadership via PagerDuty

Information is an asset to everyone impacted by an incident. Properly managing the flow of information is critical to minimizing surprise and setting expectations. We aim to keep interested stakeholders apprised of developments in a timely fashion so they can plan appropriately.

Some of this may feel counter to GitLab Values; this is not designed or intended to diminish our values but to acknowledge and reinforce our need to mitigate customer impact as quickly as possible.

EOCs are responsible for responding to alerts even on the weekends. Time should not be spent mitigating the incident unless it is a ~severity::1 or ~severity::2. Mitigation for ~severity::3 and ~severity::4 incidents can occur during normal business hours, Monday-Friday. If you have any questions on this please reach out to an Infrastructure Engineering Manager.

The EOC Coordinator will work closely with the Ops Team on core on-call and incident management concerns, and engage other teams across the organization as needed.

Labeling incidents with a Root Cause is done for the categorization of incidents when deploy pipelines are blocked. For this reason, a label with the prefix ~RootCause is required whenever an incident has the ~"release-blocker" label. The person assigned to the incident is responsible for adding the appropriate Root Cause label.

The following services should primarily be used for application code changes or feature flag changes, not changes to configuration or code maintained by the Infrastructure department:

When an incident starts, the incident automation sends a message in the #incident-management channel containing a link to a per-incident Slack channel for text based communication, the incident issue for permanent records, and the Situation Room Zoom link for incident team members to join for synchronous verbal and screen-sharing communication.

In the United States, the Aviation Safety Reporting System has been collecting reports of close calls since 1976. Due to near miss observations and other technological improvements, the rate of fatal accidents has dropped about 65 percent. source

The goal of incident management is to organize chaos into swift incident resolution. To that end, incident management provides:

To make your role clear edit your zoom name to start with your role when you join the Zoom meeting. For Example “IM - John Doe” To edit your name during a zoom call, click on the three dots by your name in your video tile and choose the “rename” option. Edits made during a zoom call only last for the length of the call, so it should automatically revert to your profile name/title with the next call.

Taser price

If during an incident, the EOC, Incident Manager or, CMOC need to be engaged, page the person on-call using one of the following. This triggers a PagerDuty incident and page the appropriate person based on the Impacted Service that you select.

The Engineer On Call is responsible for the mitigation of impact and resolution to the incident that was declared. The EOC should reach out to the Incident Manager for support if help is needed or others are needed to aid in the incident investigation.

As a means to ensure a healthy Incident Manager rotation with sufficient staffing and no significant burden on any single individual we staff this role with Team Members from across Engineering.

For serious incidents that require coordinated communications across multiple channels, the Incident Manager will rely on the CMOC for the duration of the incident.

In some cases, we may choose not to post to status.io, the following are examples where we may skip a post/tweet. In some cases, this helps protect the security of self managed instances until we have released the security update.

TASER vs stun gun

In the case of high severity bugs, we prefer that an incident issue is still created via Reporting an Incident. This will give us an incident issue on which to track the events and response.

When an incident is created that is a duplicate of an existing incident it is up to the EOC to mark it as a duplicate. In the case where we mark an incident as a duplicate, we should issue the following slash command and remove all labels on the incident issue:

The IM won’t be engaged on these tasks unless they are paged, which is why the default is to page them for all Sev1 and Sev2 incidents. In other situations, page the Incident Manager to engage them.

Tasers near me

In some scenarios it may be necessary for most all participants of an incident (including the EOC, other developers, etc.) to work directly with a customer. In this case, the customer interaction Zoom shall be used, NOT the main GitLab Incident Zoom. This will allow for the conversation (as well as text chat) while still supporting the ability for primary responders to quickly resume internal communications in the main Incident Zoom. Since the main incident Zoom may be used for multiple incidents it will also prevent the risk of confidential data leakage and prevent the inefficiency of having to frequently announce that there are customers in the main incident zoom each time the call membership changes.

If a separate unrelated incident occurs during the maintenance procedure, the engineers involved in the scheduled maintenance should vacate the Situation Room Zoom in favour of the active incident.

GitLab uses the Incident Management feature of the GitLab application. Incidents are reported and closed when they are resolved. A resolved incident means the degradation has ended and will not likely re-occur.

In order to effectively track specific metrics and have a single pane of glass for incidents and their reviews, specific labels are used. The below workflow diagram describes the path an incident takes from open to closed. All S1 incidents require a review, other incidents can also be reviewed as described here.

Where to buy a Taser gun

The following labels are used to track the incident lifecycle from active incident to completed incident review. Label Source

The GitLab support team staffs an oncall rotation and via the Incident Management - CMOC service in PagerDuty. They have a section in the support handbook for getting new CMOC people up to speed.

Due to the overhead involved and the risk of detracting from impact mitigation efforts, this communication option should be used sparingly and only when a very clear and distinct need is present.

30-minutes before the maintenance window starts, the Engineering Manager who is responsible for the change should notify the SRE on-call, the Release Managers and the CMOC to inform them that the maintenance is about to begin.

This is a first revision of the definition of Service Disruption (Outage), Partial Service Disruption, and Degraded Performance per the terms on Status.io. Data is based on the graphs from the Key Service Metrics Dashboard

If you are a GitLab team member and would like to report a possible incident related to GitLab.com and have the EOC paged in to respond, choose one of the reporting methods below. Regardless of the method chose, please stay online until the EOC has had a chance to come online and engage with you regarding the incident. Thanks for your help!

Taser gun

Coordination and communication should take place in the Situation Room Zoom so that it is quick and easy to include other engineers if there is a problem with the maintenance.

Incidents use the Timeline Events feature, the timeline can be viewed by selecting the “Timeline” tab on the incident. By default, all label events are added to the Timeline, this includes ~"Incident::Mitigated" and ~"Incident::Resolved". At a minimum, the timeline should include when start and end times of user impact. You may also want to highlight notes in the discussion, this is done by selecting the clock icon on the note which will automatically add it to the timeline. For adding timeline items quickly, use the quick action, for example:

The EOC and the Incident Manager On Call, at the time of the incident, are the default assignees for an incident issue. They are the assignees for the entire workflow of the incident issue.

Issues that have the ~"corrective action" label will automatically have the ~"infradev" label applied. This is done so teams these issues are follow the same process we have for development to resolve them in specific time-frames. For more details see the infradev process.

In the event of a GitLab.com outage, a mirror of the runbooks repository is available on at https://ops.gitlab.net/gitlab-com/runbooks.

By default, the EOC is the owner of the incident. The incident owner can delegate ownership to another engineer or escalate ownership to the IM at any time. There is only ever one owner of an incident and only the owner of the incident can declare an incident resolved. At anytime the incident owner can engage the next role in the hierarchy for support. The incident issue should be assigned to the current owner.

If an incident may be security related, engage the Security Engineer on-call by using /security in Slack. More detail can be found in Engaging the Security Engineer On-Call.

The Incident Manager should exercise caution and their best judgement, in general we prefer to use internal notes instead of marking an entire issue confidential if possible. A couple lines of non-descript log data may not represent a data security concern, but a larger set of log, query, or other data must have more restrictive access. If assistance is required follow the Infrastructure Liaison Escalation process.

If a related incident occurs during the maintenance procedure, the EM should act as the Incident Manager for the duration of the incident.

Incidents are anomalous conditions that result in—or may lead to—service degradation or outages. These events require human intervention to avert disruptions or restore service to operational status. Incidents are always given immediate attention.

The EOC will respond as soon as they can to the usage of the @sre-oncall handle in Slack, but depending on circumstances, may not be immediately available. If it is an emergency and you need an immediate response, please see the Reporting an Incident section.

Near misses are like a vaccine. They help the company better defend against more serious errors in the future, without harming anyone or anything in the process.

Runbooks are available for engineers on call. The project README contains links to checklists for each of the above roles.

In order to help with attribution, we also label each incident with a scoped label for the Infrastructure Service (Service::) and Group (group::) scoped labels among others.

A page will be escalated to the Incident Manager (IM) if it is not answered by the Engineer on Call (EOC). This escalation will happen for all alerts that go through PagerDuty, which includes lower severity alerts. It’s possible that this can happen when there is a large number of pages and the EOC is unable to focus on acknowledging pages. When this occurs, the IM should reach out in Slack in the #incident-management channel to see if the EOC needs assistance.

In both cases of Degraded or Outage, once an event has elapsed the 5 minutes, the Engineer on Call and the Incident Manager should engage the CMOC to help with external communications. All incidents with a total duration of more than 5 minutes should be publicly communicated as quickly as possible (including “blip” incidents), and within 1 hour of the incident occurring.

Corrective Actions issues in the Reliability project should be created using the Corrective Action issue template to ensure consistency in format, labels and application/monitoring of service level objectives for completion

These definitions imply several on-call rotations for the different roles. Note that not all incidents include engagement from Incident Managers or Communication Managers.

If the EOC does not respond because they are unavailable, you should escalate the incident using the PagerDuty application, which will alert Infrastructure Engineering leadership.

The EOC Coordinator is focused on improving SRE on-call quality of life and setting up processes to keep on-call engineers across the entire company operating at a high level of confidence.

Corrective Actions (CAs) are work items that we create as a result of an incident. Only issues arising out of an incident should receive the label ~"corrective action". They are designed to prevent the same kind of incident or improve the time to mitigation and as such are part of the Incidence Management cycle. Corrective Actions must be related to the incident issue to help with downstream analysis.

We manage incident communication using status.io, which updates status.gitlab.com. Incidents in status.io have state and status and are updated by the incident owner.

Taser gun for women

Clear delineation of responsibilities is important during an incident. Quick resolution requires focus and a clear hierarchy for delegation of tasks. Preventing overlaps and ensuring a proper order of operations is vital to mitigation.

A near miss, “near hit”, or “close call” is an unplanned event that has the potential to cause, but does not actually result in an incident.

The Incident Manager is the DRI for all of the items listed above, but it is expected that the IM will do it with the support of the EOC or others who are involved with the incident. If an incident runs beyond a scheduled shift, the Incident Manager is responsible for handing over to the incoming IM.

Implementing a direct customer interaction call for an incident is to be initiated by the current Incident Manager by taking these steps:

Further support is available from the Scalability and Delivery Groups if required. Scalability leadership can be reached via PagerDuty Scalability Escalation (further details available on their team page). Delivery leadership can be reached via PagerDuty. See the Release Management Escalation steps on the Delivery group page.

It is not always very clear which service label to apply, especially when causes span service boundaries that we define in Infrastructure. When unsure, it’s best to choose a label that corresponds to the primary cause of the incident, even if other services are involved.

Taser Amazon

If, during an S1 or S2 incident, it is determined that it would be beneficial to have a synchronous conversation with one or more customers a new Zoom meeting should be utilized for that conversation. Typically there are two situations which would lead to this action:

TASER 10 for sale

During a verified Severity 1 Incident the IM will page the Infrastructure Liaison. This is not a substitute or replacement for the active Incident Manager.

If a second incident zoom is desired, choose which incident will move to the new zoom and create a new meeting in zoom. Be sure to edit the channel topic of the incident slack channel to indicate the correct zoom link.

If there is additional follow-up work that requires more time after an incident is resolved and closed (like a detailed root cause analysis or a corrective action) a new issue may need to be created and linked to the incident issue. It is important to add as much information as possible as soon as an incident is resolved while the information is fresh, this includes a high level summary and a timeline where applicable.

Occasionally we encounter multiple incidents at the same time. Sometimes a single Incident Manager can cover multiple incidents. This isn’t always possible, especially if there are two simultaneous high-severity incidents with significant activity.

In the case of a high severity bug that is in an ongoing, or upcoming deployment please follow the steps to Block a Deployment.

The CMOC is responsible for ensuring this label is set for all incidents involving use of the Status Page or where other direct notification to a set of customers is completed (such as via Zendesk).

Type /incident declare in the #production channel in GitLab’s Slack and follow the prompts to open an incident issue. It is always better to err on side of choosing a higher severity, and declaring an incident for a production issue, even if you aren’t sure. Reporting high severity bugs via this process is the preferred path so that we can make sure we engage the appropriate engineering teams as needed.

The Infrastructure Leadership is on the escalation path for both Engineer On Call (EOC) and Incident Manager (IM). This is not a substitute or replacement for the active Incident Manager (unless the current IM is unavailable).

We want to be able to report on a scope of incidents which have met a level of impact which necessitated customer communications. An underlying assumption is that any material impact will always be communicated in some form. Incidents are to be labeled indicating communications even if the impact is later determined to be lesser, or when the communication is done by mistake.

If an alert silence is created for an active incident, the incident should be resolved with the ~"alertmanager-silence" label and the appropriate root cause label if it is known. There should also be a linked ~infradev issue for the long term solution or an investigation issue created using the related issue links on the incident template.

Incident severity should be assigned at the beginning of an incident to ensure proper response across the organization. Incident severity should be determined based on the information that is available at the time. Severities can and should be adjusted as more information becomes available. The severity level reflects the maximum impact the incident had and should remain in that level even after the incident was mitigated or resolved.

As well as opening a GitLab incident issue, a dedicated incident Slack channel will be opened. The “woodhouse” bot will post links to all of these resources in the main #incident-management channel. Please note that unless you’re an SRE, you won’t be able to post in #incident-management directly. Please join the dedicated Slack channel, created and linked as a result of the incident declaration, to discuss the incident with the on-call engineer.

For Sev3 and Sev4 incidents, the EOC is also responsible for Incident Manager Responsibilities, second to mitigating and resolving the incident.

The current Root Cause labels are listed below. In order to support trend awareness these labels are meant to be high-level, not too numerous, and as consistent as possible over time.

A Partial Service Disruption is when only part of the GitLab.com services or infrastructure is experiencing an incident. Examples of partial service disruptions are instances where GitLab.com is operating normally except there are: