Resolved
No more regressions have been noticed and all services are still healthy.
Incident has been resolved.
A follow-up from our cloud provider is still expected and we are taking measures to prevent issues like this in the long-term.
Monitoring
Our cloud provider confirmed the roll-out was stopped. We are also no longer observing any more nodes being replaced for the last 30 minutes.
All services are healthy again and we will keep monitoring the situation.
Identified
We have received an update to the cause of this incident, which is in fact not related to a node hardware failure as previously thought. Our cloud provider has accidentally triggered a rolling replacement of all cluster nodes in our EU-West region. This process is ongoing and may cause intermittent latency spikes or brief errors as workloads migrate and failover between nodes. Impact should be minimal.
Our team is in contact with our cloud provider to receive updates on the resolution. We will provide updates as the situation develops.
Monitoring
We are monitoring the situation following an infrastructure node failure in one of our availability zones. Read database queries experienced degradation for approximately 72 seconds between 09:21–09:23 UTC impacting authentication during email submission and other endpoints.
Write operations and cache services were unaffected throughout.
All services have fully recovered.
Investigating
We are investigating possible issues related to accessing Lettermint services.
Updates will be shared as soon as we know more.