Incident Report: April 28th

Between April 28 and April 30, the Composio platform experienced repeated disruptions to core platform APIs. Every customer hit high error rates, timeouts, and elevated latencies during these windows. The cumulative platform API degradation came to about 53 minutes.

During the same period, webhook triggers for Slack, Outlook, Notion, and HubSpot were unavailable for roughly 36 hours, which directly affected approximately 700 customers with active triggers on those integrations.

We sincerely apologize. Both halves of this disrupted real workflows for our customers. Below is a detailed account of what happened, what we've done to address it, and what we're doing to prevent similar incidents in the future.

What happened

On the morning of April 28, our monitoring detected intermittent API degradation. Our initial hypothesis was that higher webhook volume from a new set of triggers on Slack was creating excessive database load, and we deployed a caching fix to reduce per-event query volume. This did not resolve the issue, so we decided to pursue infrastructure-level fixes.

After performing database storage tier upgrades and restoring Slack triggers, we observed that database load continued to climb. Through further investigation, we identified the actual root cause.

A database table used by our trigger processing pipeline to queue messages for processing had grown unboundedly. A background cleaner job responsible for pruning processed messages from this table had silently stopped running. The cleaner had been failing since approximately April 6 due to query timeouts caused by the table's own growth. A self-reinforcing cycle, where the growing table caused the very job meant to keep it bounded to time out, allowing further unbounded growth.

Our infrastructure was designed and tested to handle a larger webhook volume than what we currently serve, but that capacity planning assumes a healthy maintenance layer underneath. The accumulated table size eventually placed exceptional load on the database, degrading query performance across the platform and causing the cascading API disruptions our customers experienced.

To protect core platform APIs, we disabled webhook trigger ingestion. We started with Slack triggers, the most voluminous source, and disabled all remaining triggers shortly after. From there, we began work to remediate the issue and migrate the trigger processing pipeline to its own isolated database, to compartmentalize future failures.

During remediation on April 29, we attempted a one-off job to clear the accumulated data from the table. This operation caused a second API disruption that lasted approximately 45 minutes.

We also want to acknowledge that our communication during this incident fell short of what our customers needed and deserved. Throughout the 36-hour trigger outage window, we did not provide timely status updates or ETAs, and many of you had to reach out proactively to learn what was happening. That is not acceptable, and we take full responsibility for overhauling our incident communication process. The specific changes are below in "What we're doing next."

The migration to a dedicated trigger database was completed on April 30. All webhook triggers, including Slack, Outlook, Notion, and HubSpot, were brought back online and confirmed healthy.

Timeline (all times PST)

April 28, 7:20 AM. Intermittent API degradation detected. Auto-recovered within ~5 minutes.
April 28, 11:15 AM. Second brief API degradation from continued webhook traffic pressure.
April 28, 11:25 AM. Caching fix deployed to reduce database load from webhook verification queries. The fix did not resolve the issue. Team decided to pursue infrastructure-level fixes.
April 28, 12:48 PM. All Slack webhook triggers disabled first as the most voluminous source of load. Done as a precautionary measure to prevent degradation of the overall platform.
April 29, 12:46 AM. Older Slack triggers restored after database storage tier upgrades. Monitoring before re-enabling the new triggers.
April 29, ~7:00 AM. Root cause identified: unbounded growth in the trigger processing table due to the cleaner job failing since early April.
April 29, 7:56 AM. API disruption caused by a one-off clean up job to clear accumulated data from the trigger processing table, resulting in database lock contention. Resolved within ~45 minutes.
April 29, 9:00 AM. All webhook triggers disabled again. Team began to plan a migration to an isolated database.
April 29, ~10:30 AM. Migration of trigger processing pipeline to a dedicated isolated database started.
April 30, 12:30 AM. Trigger processing pipeline migration to dedicated database completed. Webhook triggers for Slack (previous versions), Outlook, Notion, and HubSpot restored. After a few hours of observation, the newer Slack webhook triggers were re-enabled.

Root cause

Our webhook trigger processing pipeline uses a database table to track encrypted messages due for processing. A background cleaner job is responsible for pruning processed messages from this table to keep it bounded and performant.

This cleaner job had been failing since approximately April 6 due to query timeouts caused by the table's own unbounded growth. A self-reinforcing cycle, where the growing table caused the very job meant to keep it bounded to time out, allowing further growth.

With the cleaner silently failing, the table grew unboundedly over time. Our infrastructure was designed and tested to handle a larger webhook volume than what we currently serve, but that capacity planning assumes a healthy maintenance layer underneath. The accumulated table size eventually placed exceptional load on the database, degrading query performance across the platform and causing the cascading API disruptions our customers experienced.

Impact

API degradation affected all customers on the platform during the intermittent disruption windows, resulting in high error rates, timeouts, and elevated latencies across platform APIs.

Webhook trigger outage directly affected approximately 700 customers with active triggers on the impacted integrations. Webhook events that arrived while triggers were disabled were not queued by the platform and are unfortunately not recoverable.

What we've already done

The following changes are completed and live in production.

Isolated trigger database. The webhook trigger processing pipeline now runs on its own dedicated database with an independent connection pool. Webhook traffic spikes are fully decoupled from core platform APIs. A surge in inbound events can no longer affect tool execution, auth flows, or any other platform service.

Dedicated monitoring dashboard. A new real-time dashboard tracks webhook trigger health, ingestion rates, processing throughput, and database metrics. This gives our on-call team immediate visibility into trigger-specific issues.

Audited every maintenance job and added observability. We did a deep audit of every maintenance job across the platform, mapping which had failure alerting in place and which didn't. We then added monitoring and observability everywhere it was missing, including cleaner jobs, pruning tasks, and health checks. Every one of these should have had alerts from the start. The audit itself is the real process change. If a job is important enough to exist, it is important enough to be monitored.

What we're doing next

Beyond the immediate fixes, we're investing in longer-term improvements to prevent this class of incident.

Replacing the database-backed queue. The current architecture, which uses a database table as a message queue, functions adequately but depends on the cleaner job remaining healthy. We are migrating to a purpose-built message queue that handles message lifecycle natively, eliminating this class of failure entirely. While this was already on our roadmap, we have fast-tracked it given the incident.

Incident communication overhaul. We did a poor job communicating during this incident. Many of you had to come to us before we came to you. We've overhauled our internal process so that the next time something goes wrong, you'll know about it much sooner. For any major incident from here on, you can expect:

An initial notification from us within 30 minutes of when we detect impact.
Status updates every 60 minutes while the incident is active.
A clear resolution notification when service is restored.
A public post-mortem like this one, with the full account of what happened.

We're also tightening up our status page so it stays a reliable place to check during disruptions.

Stricter review for high-risk one-off operations. The API disruption on April 29 was caused by a one-off table clearing job that lacked sufficient review before execution. We are instituting a more rigorous review process for high-risk recovery operations, ensuring they require independent review and approval before execution, even under time pressure.

Closing

The architectural changes above, especially isolating the trigger pipeline onto its own database and the audit of every maintenance job in the system, are designed to prevent this class of failure from recurring. And the next time something does go wrong, you will hear from us within 30 minutes of when we know about it.

If you have questions about how this incident affected your workflows, email support@composio.dev. We will work through it with you.

— Team Composio