Stay information about the cloud.trigger.dev API and Dashboard
May 10, 2024
Resolved
May 10, 2024 21:17 (UTC)
Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.
All paying customers will get a full refund for the entirety of May.
This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.
We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.
Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.
Monitoring
May 10, 2024 16:40 (UTC)
Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.
v3 was only impacted because the database had spiked because of v2 runs.
Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.
Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.
We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.
Identified
May 09, 2024 21:00 (UTC)
A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.
Apr 29, 2024
Apr 29, 2024 15:45 (UTC)
Runs and deploys are back to processing normally. It was an issue with Docker images not being pulled from the cache.
Investigating
Apr 29, 2024 14:44 (UTC)
Runs are queued for longer than normal and deploys are taking a long time. We believe this because there was an infinite checkpoint restore which caused us to hit a Docker registry rate limit.
Apr 02, 2024
Apr 02, 2024 12:12 (UTC)
This issue has been fixed.
Apr 02, 2024 12:00 (UTC)
A web socket issue was causing the dashboard to crash and the new servers not to come up. This only impacts the dashboard, not the job runners.
Apr 02, 2024 16:20 (UTC)
The new rate limiter was blocking:
Mar 12, 2024
Mar 12, 2024 11:39 (UTC)
Our database has recovered and runs are now processing normally again.
Mar 12, 2024 09:30 (UTC)
A very large spike in runs caused degraded the database to have high CPU. This made the dashboard very slow to load and reduced the speed that runs were being processed.
We have fixed the issue but it will take a couple of hours to get back to normal run processing times.
Nov 27, 2023
Nov 27, 2023 16:52 (UTC)
The issue has been mitigated and DB CPU % has come down to normal levels.
Nov 27, 2023 16:41 (UTC)
We are encountering an issue that is causing our database CPU to spike to 100% which is causing downstream issues with our API and Dashboard. We're working on a fix currently and hope to be back to full health shortly.