Trigger.dev Status

Stay information about the cloud.trigger.dev API and Dashboard

Severe dashboard degradation and slower processing of runs

May 09, 2024 21:00

Monitoring
Dashboard & API

monitoring

May 10, 2024 16:40

Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.

v3 was only impacted because the database had spiked because of v2 runs.

Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.

Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.

We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.

identified

May 09, 2024 21:00

A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.

v3 Runs and deploys experiencing increased queue times

Apr 29, 2024 14:44

Resolved
Dashboard & API

resolved

Apr 29, 2024 15:45

Runs and deploys are back to processing normally. It was an issue with Docker images not being pulled from the cache.

investigating

Apr 29, 2024 14:44

Runs are queued for longer than normal and deploys are taking a long time. We believe this because there was an infinite checkpoint restore which caused us to hit a Docker registry rate limit.

API requests getting blocked by the rate limiter: webhooks, notifications, HTTP endpoints

Apr 02, 2024 16:20

Resolved
Dashboard & API

identified

Apr 02, 2024 16:20

The new rate limiter was blocking:

  1. Webhooks from being received.
  2. HTTP endpoints from being received
  3. Callbacks from being fired, triggerAndWait and batchTriggerAndWait are impacted by this.
  4. Automatic indexing when deploying new code to your environments

Dashboard servers failing and not rebooting correctly.

Apr 02, 2024 12:00

Resolved
Dashboard & API

resolved

Apr 02, 2024 12:12

This issue has been fixed.

identified

Apr 02, 2024 12:00

A web socket issue was causing the dashboard to crash and the new servers not to come up. This only impacts the dashboard, not the job runners.

Degraded database performance

Mar 12, 2024 09:30

Resolved
Dashboard & API

resolved

Mar 12, 2024 11:39

Our database has recovered and runs are now processing normally again.

identified

Mar 12, 2024 09:30

A very large spike in runs caused degraded the database to have high CPU. This made the dashboard very slow to load and reduced the speed that runs were being processed.

We have fixed the issue but it will take a couple of hours to get back to normal run processing times.

API and Dashboard degredation

Nov 27, 2023 16:41

Resolved
Dashboard & API

resolved

Nov 27, 2023 16:52

The issue has been mitigated and DB CPU % has come down to normal levels.

identified

Nov 27, 2023 16:41

We are encountering an issue that is causing our database CPU to spike to 100% which is causing downstream issues with our API and Dashboard. We're working on a fix currently and hope to be back to full health shortly.

API and Dashboard are down

Sep 13, 2023 17:26

Resolved
Dashboard & API

resolved

Oct 01, 2023 08:39

This issue is resolved.

resolved

Sep 13, 2023 17:26

We are currently facing issues with our logs provider which is causing our web servers to fail, which is effecting our API and dashboard. We're looking into fixing the issue now by removing the log provider.

We have pushed a fix removing the log provider and that has fixed the issue.

API & Dashboard degradation

Sep 24, 2023 21:17

Resolved
Dashboard & API

resolved

Sep 29, 2023 04:29

monitoring

Sep 24, 2023 21:24

The hotfix has been deployed and API and dashboard services are resuming normal operation.

monitoring

Sep 24, 2023 21:24

The hotfix has been deployed and API and dashboard services are resuming normal operation.

identified

Sep 24, 2023 21:17

We're facing a degradation of our API and dashboard services and are pushing a hotfix now