GitHub Status

Mar 24, 2024

No incidents reported today.

Mar 23, 2024

No incidents reported.

Mar 22, 2024

No incidents reported.

Mar 21, 2024

No incidents reported.

Mar 20, 2024

No incidents reported.

Mar 19, 2024

No incidents reported.

Mar 18, 2024

No incidents reported.

Mar 17, 2024

No incidents reported.

Mar 16, 2024

No incidents reported.

Mar 15, 2024

Incident with Actions and Pages

Resolved - This incident has the same root cause as this incident.. Please follow the link to view the incident summary.
Mar 15, 20:28 UTC

Update - Actions is operating normally.
Mar 15, 20:27 UTC

Update - Pages is experiencing degraded performance. We are continuing to investigate.
Mar 15, 20:09 UTC

Investigating - We are investigating reports of degraded performance for Actions
Mar 15, 20:07 UTC

Incident with Codespaces and API Requests

Resolved - On March 15, 2024, between 19:42 UTC and 20:24 UTC several services were degraded due to a regression in calling the permissions system.

New GitHub Codespaces could not be created, as were Codespaces sessions that required minting a new auth token.

Actions saw delays and infrastructure failures due to the upstream dependency on fetching tokens for the repository for runs to successfully execute.

GitHub Pages were affected due to the impact on Actions, resulting in 1266 page builds failing, which at the low point represented 33% of page builds failing. This resulted in page edits not being reflected on those impacted sites.

We deployed an application update that included a newer version of our database query builder. The new version uses a newer MySQL syntax for upsert queries that is not supported by the database proxy service we use for some of our production-environment database clusters. This incompatibility impacted the permissions cluster specifically, causing requests that attempted such queries to fail.

We responded by rolling back the deployment, restoring the previous query use, and thus mitigated the incident.

We have identified and corrected a misconfiguration of the permissions cluster in our development and CI environments that will ensure queries utilize the proxy service to prevent future syntax additions causing issues in production.

Mar 15, 20:24 UTC

Update - Codespaces is operating normally.
Mar 15, 20:21 UTC

Update - API Requests is operating normally.
Mar 15, 20:20 UTC

Update - We rolled back the most recent deployment and are seeing improvements across all services, and will continue to monitor for additional impact.
Mar 15, 20:17 UTC

Update - API Requests is experiencing degraded performance. We are continuing to investigate.
Mar 15, 20:11 UTC

Update - Codespaces is experiencing degraded availability. We are continuing to investigate.
Mar 15, 20:03 UTC

Update - API Requests is experiencing degraded availability. We are continuing to investigate.
Mar 15, 20:03 UTC

Update - API Requests is experiencing degraded performance. We are continuing to investigate.
Mar 15, 20:00 UTC

Investigating - We are investigating reports of degraded performance for Codespaces
Mar 15, 19:55 UTC

Mar 14, 2024

No incidents reported.

Mar 13, 2024

Incident with Pull Requests

Resolved - From March 12, 2024 23:39 UTC to March 13, 2024 1:58 UTC, some Pull Requests updates were delayed and did not reflect the latest code that had been pushed. On average, 20% of Pull Requests page loads were out of sync and up to 30% of Pull Requests were impacted at peak. An internal component of our job queueing system was incorrectly handling invalid messages, resulting in stalled processing.

We mitigated the incident by shipping a fix to handle the edge case gracefully and allow processing to continue.

Once the fix was deployed at 1:47 UTC, our systems fully caught up with pending background jobs at 1:58 UTC.

We’re working to improve resiliency to invalid messages in our system to prevent future delays for these pull request updates. We are also reviewing our monitoring and observability to identify and remediate these types of failure cases faster.

Mar 13, 01:58 UTC

Update - Pull Requests is operating normally.
Mar 13, 01:58 UTC

Update - We believe we've found a mitigation and are currently monitoring systems for recovery.
Mar 13, 01:53 UTC

Update - We're continuing to investigate delays in PR updates. Next update in 30 minutes.
Mar 13, 01:18 UTC

Update - We're continuing to investigate an elevated number of pull requests that are out of sync on page load.
Mar 13, 00:47 UTC

Update - We're continuing to investigate an elevated number of pull requests that are out of sync on page load.
Mar 13, 00:12 UTC

Update - We're seeing an elevated number of pull requests that are out of sync on page load.
Mar 12, 23:39 UTC

Investigating - We are investigating reports of degraded performance for Pull Requests
Mar 12, 23:39 UTC

Mar 12, 2024

Incident with API Requests, Git Operations, Webhooks and Copilot

Resolved - On March 11, 2024 starting at 22:45 UTC and ending on March 12, 2024 00:48 UTC various GitHub services were degraded and returned intermittent errors for users. During this incident, the following customer impacts occurred: API error rates as high as 1%, Copilot error rates as high as 17%, and Secret Scanning and 2FA using GitHub Mobile error rates as high as 100% followed by a drop in error rates to 30% starting at 22:55 UTC. This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend.

The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment. This error was detected within 4 minutes and a rollback was initiated. While error rates began dropping quickly at 22:55 UTC, the rollback failed in one of our data centers, leading to a longer recovery time. At this point, many failed requests succeeded upon retrying. This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.

We are working on various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future.

Mar 12, 01:00 UTC

Update - We believe we've resolved the root cause and are waiting for services to recover
Mar 12, 01:00 UTC

Update - API Requests is operating normally.
Mar 12, 00:56 UTC

Update - Git Operations is operating normally.
Mar 12, 00:55 UTC

Update - Webhooks is operating normally.
Mar 12, 00:54 UTC

Update - Copilot is operating normally.
Mar 12, 00:54 UTC

Update - We're continuing to investigate issues with our authentication service, impacting multiple services
Mar 12, 00:14 UTC

Update - Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:55 UTC

Update - Webhooks is operating normally.
Mar 11, 23:31 UTC

Update - Copilot is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:21 UTC

Update - Git Operations is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:20 UTC

Update - Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:09 UTC

Investigating - We are investigating reports of degraded availability for API Requests, Git Operations and Webhooks
Mar 11, 23:01 UTC

Mar 11, 2024

Incident with Actions

Resolved - On March 11, 2024 between at 18:44 UTC and 19:10 UTC, GitHub Actions performance was degraded and some users experienced errors when trying to queue workflows. Approximately 3.7% of runs queued during this time were unable to start.

The issue was partially caused by a deployment of an internal system Actions relies on to process workflow run events. The pausing of the queue processing during this deployment for about 3 minutes caused a spike in queued workflow runs. When this queue began to be processed, the high number of queued workflows overwhelmed a secret-initialization component of the workflow invocation system. The errors generated by this overwhelmed system ultimately delayed workflow invocation. Through our alerting system, we received initial indications of an issue at approximately 18:44 UTC. However, we did not initially see impact on our run start delays and run queuing availability metrics until approximately 18:52 UTC. As the large queue of workflow run events burned down, we saw recovery in our key customer impact measures by 19:11 UTC, but waited to declare the incident resolved at 19:22 UTC while verifying there was no further customer impact.

We are working on various measures to reduce spikes in queue build up during deployments of our queueing system, and have scaled up the workers which handle secret generation and storage during the workflow invocation process.

Mar 11, 19:22 UTC

Update - Actions experienced a period of decreased workflow run throughput, and we are seeing recovery now. We are in the process of investigating the cause.
Mar 11, 19:21 UTC

Investigating - We are investigating reports of degraded performance for Actions
Mar 11, 19:02 UTC

Incident with Copilot

Resolved - On March 11, 2024, between 06:30 UTC and 11:45 UTC the Copilot Chat service was degraded and customers may have encountered errors or timed out requests for chat interactions. On average, the error rate was 10% and peaked at 45% of requests to the service for short periods of time.

This was due to a gap in handling an edge case for messages returned from the underlying language models. We mitigated the incident by applying a fix to the handling of the streaming response.

We are working to update monitoring to reduce time to detection and increase resiliency to message format changes.
Mar 11, 10:20 UTC

Update - We are deploying mitigations for the failures we have been observing in some chat requests for Copilot. We will continue to monitor and update.
Mar 11, 10:02 UTC

Update - We are seeing an elevated failure rate for chat requests for Copilot. We are investigating and will continue to keep users updated on progress towards mitigation.
Mar 11, 09:03 UTC

Investigating - We are investigating reports of degraded performance for Copilot
Mar 11, 08:14 UTC

Mar 10, 2024

No incidents reported.

Product

Platform

Support

Company