Lettermint

Report a problemSubscribe to updates
Powered by
Privacy policy

·

Terms of service
Write-up
Service Disruption
Partial outage
View the incident

Summary

On February 5, 2026, Lettermint experienced a service disruption lasting approximately 26 minutes. During this period, email sending via the API and SMTP was unavailable. The incident was caused by our cloud infrastructure provider accidentally triggering simultaneous restart of all compute nodes in the EU-WEST-PAR region, rather than performing the intended rolling update.

No data was lost. All queued messages were delivered after services recovered.

Timeline (CET)

  • 12:20 - Infrastructure provider begins restarting compute nodes across the region

  • 12:20 - Lettermint services become unavailable as workloads are rescheduled

  • 12:22 - Database recovers and begins accepting connections

  • 12:26 - Message queue cluster reforms, processing resumes partially

  • 12:44 - Caching layer fully restored, API requests begin succeeding

  • 12:46 - All services confirmed healthy

  • 12:47 - Full service restored

Impact

  • API: Requests returned HTTP 500 errors between 12:20 and ~12:46 CET.

  • SMTP submissions: SMTP connections were refused or timed out during the same window.

  • Dashboard: The Lettermint dashboard was intermittently unreachable.

  • Webhooks: Outbound webhook deliveries were delayed but retried successfully after recovery.

  • No data loss: All messages accepted before the incident were queued durably and delivered after recovery. No email data or customer configuration was affected.

Root cause

A software issue in our upstream providers' Managed Kubernetes Service (MKS) control plane triggered a simultaneous restart of all compute nodes in the EU-WEST-PAR region, rather than the intended rolling update. This affected the cluster that runs Lettermint's infrastructure. They have published details in their incident report.

The simultaneous restart meant all services (database, cache, message queue, and application servers) had to cold-start at the same time, rather than gracefully migrating between nodes. The caching layer was the last component to fully recover, which is why API errors persisted for several minutes after other services were back online.

What we're doing

Our platform runs with redundant replicas across three availability zones, with automated failover and graceful degradation. That design handled recovery well (full service was restored within 26 minutes of a complete cluster loss, without data loss or manual intervention). But it did not prevent the disruption itself, and we own that experience regardless of root cause.

We are in contact with our upstream provider regarding safeguards to prevent simultaneous node restarts in the future.

Closing

Reliability is the foundation of a transactional email service, and we take this disruption seriously. We're sorry your emails were delayed.

If you have any questions, please reach out to help@lettermint.co.