We resolved the API token authentication issue at 23:01 CEST by rolling back the change that introduced the failure.
Between approximately 22:56 and 23:01 CEST, a subset of API requests authenticated with tokens may have returned 500 errors when they required a fresh authentication lookup. Requests using already-cached token authentication continued to work.
The issue was caused by a database migration/deployment sequencing problem. The change introduced a new table for token authentication data and was designed to be backwards compatible: if no token data existed in the new table yet, authentication would fall back to the existing data source.
This fallback behavior worked as expected in development and staging. However, our staging environment did not reproduce the production migration timing. In production, the migration ran significantly longer because of the larger dataset and remained inside a transaction. Until that transaction committed, the newly-created table was not yet visible to the application. As a result, fresh token authentication lookups could fail before the intended fallback path was reached.
In other words, we accounted for missing data in the new table, but not for the table itself being temporarily unavailable during a long-running transactional migration.
This was our mistake. We tested the change and specifically designed it to avoid breaking existing authentication data, but our deployment safeguards did not adequately cover this failure mode.
Given the recent increase in production incidents, we are treating this as a high-priority reliability issue. We are pausing this rollout path and tightening our deployment process before reintroducing the change. In particular, we will separate schema changes from longer-running migration work, add explicit preflight checks for required database objects before enabling dependent code paths, and apply stricter rollout gates to authentication-related changes.
All affected services are currently operating normally.
Resolved
We resolved the API token authentication issue at 23:01 CEST by rolling back the change that introduced the failure.
Between approximately 22:56 and 23:01 CEST, a subset of API requests authenticated with tokens may have returned 500 errors when they required a fresh authentication lookup. Requests using already-cached token authentication continued to work.
The issue was caused by a database migration/deployment sequencing problem. The change introduced a new table for token authentication data and was designed to be backwards compatible: if no token data existed in the new table yet, authentication would fall back to the existing data source.
This fallback behavior worked as expected in development and staging. However, our staging environment did not reproduce the production migration timing. In production, the migration ran significantly longer because of the larger dataset and remained inside a transaction. Until that transaction committed, the newly-created table was not yet visible to the application. As a result, fresh token authentication lookups could fail before the intended fallback path was reached.
In other words, we accounted for missing data in the new table, but not for the table itself being temporarily unavailable during a long-running transactional migration.
This was our mistake. We tested the change and specifically designed it to avoid breaking existing authentication data, but our deployment safeguards did not adequately cover this failure mode.
Given the recent increase in production incidents, we are treating this as a high-priority reliability issue. We are pausing this rollout path and tightening our deployment process before reintroducing the change. In particular, we will separate schema changes from longer-running migration work, add explicit preflight checks for required database objects before enabling dependent code paths, and apply stricter rollout gates to authentication-related changes.
All affected services are currently operating normally.
Investigating
We are investigating an increase in 500 errors.