Post-Mortem: Incident on 2023-11-10

On November 10, 2023, a significant incident occurred affecting our navigation features. This disruption was caused by a bug in a maintenance script that we were running, leading to the deletion of navigation data for our users.

Although no data or other settings were deleted, the loss of navigation settings rendered the apps functionally unusable. This was a widespread issue affecting all Stacker apps. Although the cause and fix were identified quickly, it took nearly 2 hours to recover the deleted data and restore functionality back to everyone’s apps.

This never should have happened. We know the applications you run on the Stacker platform are often missions critical, and any unexpected downtime causes severe disruption to your businesses. While we have systems and processes in place to avoid circumstances such as this, they were clearly insufficient.

This document outlines the events that caused this incident, the protections we had in place to prevent issues like this, what failed, what worked and why, and the changes we're making based on what we've learned.

The maintenance script to speed up app recovery

The root cause of the incident was a script run in our production environment. This script was developed to decrease the time for us to restore apps that had been inadvertently broken while building.

Although we don’t have rollback functionality for app configuration, we are sometimes able to help out with this in exceptional circumstances, when a user loses some or all of their apps configuration due to inadvertent changes. Due to an excessive amount of useless records in some of our metadata tables, this process can take unnecessarily long to complete, and the aim of this script was to clear out unused records and decrease the time taken.

We run operations such as these in a script, rather than executing commands directly against the database, as it allows us to test and audit the script ahead of time, and avoids the potential for human error while running the individual commands.

Although our development team reviewed and tested this script before deploying it, the script we ran contained a bug: unexpected behaviour from on of our data management libraries (SQLAlchemy), which manifested when the script was run in production. This caused all navigation records to be unintentionally deleted.

At 15:16 UTC we started receiving widespread reports of a total loss of app functionality. We quickly identified the data loss as the cause and began work to remedy the problem.

Restoring the missing records

We have a backup system in place for exactly this purpose. We have recently switched to a point-in-time recovery system, which takes gives us the ability to restore data to any point in the past, down to the minute.

We keep our customers’ data separate from their app configuration (the “metadata”), so we knew there wasn’t any data loss. However, In this case, it didn’t make sense to restore all the metadata to before the script was run: many people had continued to make changes to the other parts of their app configuration, which would all be lost if we did a blanket restore.

Instead, we initiated a copy of the metadata database to be spun up, to allow us to copy over the relevant lost records, rather than overwriting the whole database.

This process of spinning up a copy from our point-in-time recovery system took longer than expected, more than 30 minutes. There is some variability in how long our cloud provider takes to provision new databases, but this was far outside of our previous experience. We are still working to understand whether this is due to the growth in our database storage, changes in our cloud provider, or a short-term slow-down in provisioning.

Additionally, due to human error during the recovery process, the first point-in-time recovery we spun up actually overlapped the period during which data was being deleted, so we had to wait additional time to for one safely within a the correct time period to spin up.

By 17:10 UTC we had a relevant backup live, and started the process to restore the data from one database to the other. By 17:35, this process had completed, and having verified that the service had been restored to everyone’s applications, we closed the incident and began monitoring for any knock-on disruption.

Lessons and Remediation

Following this incident, we ran a post-mortem exercise across our whole engineering team. They have identified root causes, remediation work, and changes to our process to avoid issues like this happening in future.