Trigger.dev Status status page
  • Status
  • Events
  • Monitors
Trigger.dev Status
Stay information about the cloud.trigger.dev API and Dashboard
May 9, 2024
2 years ago
Severe dashboard degradation and slower processing of runs
Dashboard & API
Resolved May 10 at 9:17 PM (in 1 day)

Queues and runs have been processing at good speeds now for several hours on v2 and v3.

Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.

All paying customers will get a full refund for the entirety of May.

What caused this?

This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.

We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.

What was impacted?

    What we've done (so far)

    • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral.
    • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this.
    • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before.
    • We have far better diagnostic tools than before.

    Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.

    Monitoring May 10 at 4:40 PM (5 hours earlier)

    Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.

    v3 was only impacted because the database had spiked because of v2 runs.

    Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.

    Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.

    We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.

    Identified May 9 at 9:00 PM (20 hours earlier)

    A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.

    Apr 29, 2024
    2 years ago
    v3 Runs and deploys experiencing increased queue times
    Dashboard & API
    Resolved April 29 at 3:45 PM (in 1 hour)

    Runs and deploys are back to processing normally. It was an issue with Docker images not being pulled from the cache.

    Investigating April 29 at 2:44 PM (1 hour earlier)

    Runs are queued for longer than normal and deploys are taking a long time. We believe this because there was an infinite checkpoint restore which caused us to hit a Docker registry rate limit.

    Apr 2, 2024
    2 years ago
    API requests getting blocked by the rate limiter: webhooks, notifications, HTTP endpoints
    Dashboard & API
    Identified April 2 at 4:20 PM

    The new rate limiter was blocking:

      Apr 2, 2024
      2 years ago
      Dashboard servers failing and not rebooting correctly.
      Dashboard & API
      Resolved April 2 at 12:12 PM (in 12 minutes)

      This issue has been fixed.

      Identified April 2 at 12:00 PM (12 minutes earlier)

      A web socket issue was causing the dashboard to crash and the new servers not to come up. This only impacts the dashboard, not the job runners.

      Mar 12, 2024
      2 years ago
      Degraded database performance
      Dashboard & API
      Resolved March 12 at 11:39 AM (in 2 hours)

      Our database has recovered and runs are now processing normally again.

      Identified March 12 at 9:30 AM (2 hours earlier)

      A very large spike in runs caused degraded the database to have high CPU. This made the dashboard very slow to load and reduced the speed that runs were being processed.

      We have fixed the issue but it will take a couple of hours to get back to normal run processing times.

      Nov 27, 2023
      2 years ago
      API and Dashboard degredation
      Dashboard & API
      Resolved November 27 at 4:52 PM (in 12 minutes)

      The issue has been mitigated and DB CPU % has come down to normal levels.

      Identified November 27 at 4:41 PM (12 minutes earlier)

      We are encountering an issue that is causing our database CPU to spike to 100% which is causing downstream issues with our API and Dashboard. We're working on a fix currently and hope to be back to full health shortly.

      powered by openstatus.dev