Resolved -
On March 11, 2024 starting at 22:45 UTC and ending on March 12, 2024 00:48 UTC various GitHub services were degraded and returned intermittent errors for users. During this incident, the following customer impacts occurred: API error rates as high as 1%, Copilot error rates as high as 17%, and Secret Scanning and 2FA using GitHub Mobile error rates as high as 100% followed by a drop in error rates to 30% starting at 22:55 UTC. This elevated error rate was due to a degradation of our centralized authentication service upon which many other services depend.
The issue was caused by a deployment of network related configuration that was inadvertently applied to the incorrect environment. This error was detected within 4 minutes and a rollback was initiated. While error rates began dropping quickly at 22:55 UTC, the rollback failed in one of our data centers, leading to a longer recovery time. At this point, many failed requests succeeded upon retrying. This failure was due to an unrelated issue that had occurred earlier in the day where the datastore for our configuration service was polluted in a way that required manual intervention. The bad data in the configuration service caused the rollback in this one datacenter to fail. A manual removal of the incorrect data allowed the full rollback to complete at 00:48 UTC thereby restoring full access to services. We understand how the corrupt data was deployed and continue to investigate why the specific data caused the subsequent deployments to fail.
We are working on various measures to ensure safety of this kind of configuration change, faster detection of the problem via better monitoring of the related subsystems, and improvements to the robustness of our underlying configuration system including prevention and automatic cleanup of polluted records such that we can automatically recover from this kind of data issue in the future.
Mar 12, 01:00 UTC
Update -
We believe we've resolved the root cause and are waiting for services to recover
Mar 12, 01:00 UTC
Update -
API Requests is operating normally.
Mar 12, 00:56 UTC
Update -
Git Operations is operating normally.
Mar 12, 00:55 UTC
Update -
Webhooks is operating normally.
Mar 12, 00:54 UTC
Update -
Copilot is operating normally.
Mar 12, 00:54 UTC
Update -
We're continuing to investigate issues with our authentication service, impacting multiple services
Mar 12, 00:14 UTC
Update -
Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:55 UTC
Update -
Webhooks is operating normally.
Mar 11, 23:31 UTC
Update -
Copilot is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:21 UTC
Update -
Git Operations is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:20 UTC
Update -
Webhooks is experiencing degraded performance. We are continuing to investigate.
Mar 11, 23:09 UTC
Investigating -
We are investigating reports of degraded availability for API Requests, Git Operations and Webhooks
Mar 11, 23:01 UTC