May 10, 2024

Severe dashboard degradation and slower processing of runs

Resolved

Queues and runs have been processing at good speeds now for several hours on v2 and v3.

Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.

All paying customers will get a full refund for the entirety of May.

What caused this?

This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.

We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.

What was impacted?

Queues got very long for v2 and processed slowly.
Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks.
The dashboard was very slow to load or was showing timeout errors (ironic I know).
When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs.
When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected.

What we've done (so far)

We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral.
We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this.
We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before.
We have far better diagnostic tools than before.

Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.

Monitoring

Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.

v3 was only impacted because the database had spiked because of v2 runs.

Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.

Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.

We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.

Identified

A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.

Apr 29, 2024

v3 Runs and deploys experiencing increased queue times

Resolved

Runs and deploys are back to processing normally. It was an issue with Docker images not being pulled from the cache.

Investigating

Runs are queued for longer than normal and deploys are taking a long time. We believe this because there was an infinite checkpoint restore which caused us to hit a Docker registry rate limit.

Apr 02, 2024

Dashboard servers failing and not rebooting correctly.

Resolved

This issue has been fixed.

Identified

A web socket issue was causing the dashboard to crash and the new servers not to come up. This only impacts the dashboard, not the job runners.

API requests getting blocked by the rate limiter: webhooks, notifications, HTTP endpoints

Identified

The new rate limiter was blocking:

Webhooks from being received.
HTTP endpoints from being received
Callbacks from being fired, triggerAndWait and batchTriggerAndWait are impacted by this.
Automatic indexing when deploying new code to your environments

Mar 12, 2024

Degraded database performance

Resolved

Our database has recovered and runs are now processing normally again.

Identified

A very large spike in runs caused degraded the database to have high CPU. This made the dashboard very slow to load and reduced the speed that runs were being processed.

We have fixed the issue but it will take a couple of hours to get back to normal run processing times.

Nov 27, 2023

API and Dashboard degredation

Resolved

The issue has been mitigated and DB CPU % has come down to normal levels.

Identified

We are encountering an issue that is causing our database CPU to spike to 100% which is causing downstream issues with our API and Dashboard. We're working on a fix currently and hope to be back to full health shortly.

Trigger.dev Status

Severe dashboard degradation and slower processing of runs

Queues and runs have been processing at good speeds now for several hours on v2 and v3.

What caused this?

What was impacted?

What we've done (so far)

v3 Runs and deploys experiencing increased queue times

Dashboard servers failing and not rebooting correctly.

API requests getting blocked by the rate limiter: webhooks, notifications, HTTP endpoints

Degraded database performance

API and Dashboard degredation

Trigger.dev Status

Severe dashboard degradation and slower processing of runs

Queues and runs have been processing at good speeds now for several hours on v2 and v3.

What caused this?

What was impacted?

What we've done (so far)

v3 Runs and deploys experiencing increased queue times

Dashboard servers failing and not rebooting correctly.

API requests getting blocked by the rate limiter: webhooks, notifications, HTTP endpoints

Degraded database performance

API and Dashboard degredation