Incident Report - February 4, 2026

Vishnu Narayanan

Vishnu Narayanan

Published on

1 minute read

Chatwoot Cloud experienced an incident on February 4 which lasted approximately 12 minutes, from 12:48pm to 01:00pm UTC. During this time, all Chatwoot Cloud users were unable to access the platform. No data was lost during this incident.

Our sincerest apologies for the disruption. Reliability is the top priority for us at Chatwoot. We have identified the risks and have taken steps to mitigate such events in future.

Timeline

February 4, 2026

  • 12:43 PM: Database instability began, connections started failing
  • 12:48 PM: Service disruption began, team started investigating
  • 12:58 PM: Root cause identified as storage exhaustion and Storage capacity increase initiated
  • 1:00 PM: Storage scaling completed, service fully restored

All times are in Coordinated Universal Time (UTC)

What happened

We were preparing a PostgreSQL version upgrade using AWS RDS blue-green deployments. The deployment failed, but it remained in a pending state and was not cleaned up.

RDS blue-green deployments rely on logical replication. When the failed deployment was left behind, it retained a replication slot on the primary database. That replication slot prevented PostgreSQL from recycling write-ahead log (WAL) files.

As a result, WAL files continued accumulating over three days. We had roughly 1 TB of actual data, but an additional ~1 TB of WAL built up, pushing us to our 2 TB storage autoscaling limit.

Once storage was fully exhausted, the database stopped accepting connections, which caused the service disruption. After identifying the root cause, we immediately increased the storage capacity and restored service.

Follow-Up Actions and Preventive Measures

To prevent similar incidents, we are implementing the following changes:

  • Proactive storage monitoring: We are adding alerts at multiple storage utilization thresholds (60%, 75%, 90%) to catch capacity issues before they become critical.
  • Replication slot monitoring: We are implementing monitoring for database replication slots to detect orphaned slots that could cause WAL accumulation.
  • Database maintenance runbooks: We are creating detailed runbooks for database upgrade procedures with mandatory cleanup steps when deployments fail.
  • Infrastructure capacity review: We are reviewing storage limits and autoscaling configurations across all production systems to ensure adequate headroom.