Severe dashboard degradation and slower processing of runs

Started at May 09, 2024 21:00 (UTC)

Resolved
Dashboard & API

resolved

May 10, 2024 21:17 (UTC)

Queues and runs have been processing at good speeds now for several hours on v2 and v3.

Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.

All paying customers will get a full refund for the entirety of May.

What caused this?

This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.

We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.

What was impacted?

  1. Queues got very long for v2 and processed slowly.
  2. Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks.
  3. The dashboard was very slow to load or was showing timeout errors (ironic I know).
  4. When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs.
  5. When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected.

What we've done (so far)

  • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral.
  • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this.
  • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before.
  • We have far better diagnostic tools than before.

Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.

monitoring

May 10, 2024 16:40 (UTC)

Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.

v3 was only impacted because the database had spiked because of v2 runs.

Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.

Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.

We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.

identified

May 09, 2024 21:00 (UTC)

A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.