fix(actions): reduce DB load from many polling runners and fix stalled task pickup by bircni · Pull Request #38150 · go-gitea/gitea

bircni · 2026-06-17T16:12:26Z

Problem

On instances with many global/instance-scoped Actions runners (we saw it with 100+ runners against a 32-core server backed by MariaDB), the server sits at ~100% CPU with heavy SQL traffic whenever all runners are active, and a single bad job can stop Actions entirely. Three issues compound:

A write on every poll. Each FetchTask poll unconditionally runs UPDATE action_runner SET last_online=? (interceptor.go), regardless of whether there's any work — so N runners × poll-rate = a constant stream of indexed writes.
A global-version thundering herd over an unbounded scan. IncreaseTaskVersion always bumps the global scope (0,0) on every job state change anywhere in the instance. Since global runners all read that scope, a single queued job invalidates all of them, and on their next poll they all enter the full PickTask → CreateTaskForRunner transaction at once. Each one scans every waiting job (unbounded), matches labels in memory, and only one wins each job via the optimistic task_id=0 update — the other 99 do wasted work. At high throughput the version-skip optimization barely fires.
One unpreparable job stalls everything (#37586). Assignment picks the oldest waiting job and returns an error if it can't be prepared — e.g. its run was deleted out from under it, or its payload won't parse. Because that job stays at the head of the queue, every runner's poll hits it, fails, and never advances. Actions stop instance-wide until the job is manually cleared.

Changes

Debounce last_online. New RunnerHeartbeatInterval (30s) + ShouldPersistLastOnline helper; the interceptor persists last_online only when it's stale enough to matter, skipping the UpdateRunner call entirely otherwise. last_active is unchanged (still written on UpdateTask/UpdateLog). Safe because the offline threshold is 60s.
Throttle concurrent task assignments. New MAX_CONCURRENT_TASK_PICKS setting (default 16) backs a process-wide non-blocking semaphore via TryPickTask. When the limit is hit, the pick is skipped and — crucially — the response echoes the runner's request tasks version instead of latestVersion, so the runner retries on its next poll rather than sleeping until the next bump (which would stall the backlog).
Match labels in SQL instead of scanning in memory. Runner labels are now matched against a normalized action_run_job_label table (one row per runs_on label) via a correlated NOT EXISTS, so CreateTaskForRunner selects the oldest matchable waiting job directly (LIMIT 1) regardless of backlog size. This removes the unbounded scan + in-memory filtering entirely — without the head-of-line starvation a simple row cap would introduce. The table is backfilled for waiting/blocked jobs and kept in sync through a single InsertActionRunJob entry point.
Add a composite (status, updated) index so the "oldest waiting job" lookup is an index seek rather than a sort of the whole waiting backlog.
Make the pick resilient (fixes gitea actions stopped working #37586). A job that can't be prepared is marked failed and skipped (bounded per poll) so the next candidate is tried, instead of erroring out and stalling every poll. The failure marking uses a direct row update because the normal status path aggregates the run, which may no longer exist.

Behavior / compatibility

No runner-visible semantic change: offline/idle detection, oldest-first ordering, and label-match semantics (a runner must cover all of a job's labels) are preserved — label matching just moves from memory into SQL.
MAX_CONCURRENT_TASK_PICKS defaults to 16; lower it to protect a small DB, raise it for very high job throughput.
One migration adds the action_run_job_label table (backfilled) and the (status, updated) index.

Under 100+ global runners, every FetchTask poll wrote action_runner.last_online and, on each global tasks-version bump, all runners simultaneously ran the full task-assignment transaction (scanning every waiting job). This pegged the DB. - Debounce last_online: persist at most every RunnerHeartbeatInterval (30s) instead of on every poll, well under the 60s offline threshold. - Throttle concurrent task assignments via MAX_CONCURRENT_TASK_PICKS (default 16); throttled runners echo their request tasks version so they retry next poll instead of advancing and sleeping until the next bump. - Bound the waiting-jobs scan to 100 rows so a backlog can't grow each poll's cost. Assisted-by: Claude:claude-opus-4-8

wxiaoguang · 2026-06-18T00:55:43Z

The existing indices/SQLs/logic are wrong

Improve the performance for runner of picking up task #37163 (review)
related: gitea actions stopped working gitea actions stopped working #37586

But they just don't want to or don't know how to fix it.

wxiaoguang · 2026-06-18T01:12:07Z

+	// waiting jobs (which would feed back into more load). Oldest-first ordering means
+	// a label-matchable job behind maxWaitingJobsScan non-matching older jobs may be
+	// skipped this round, but it will be reconsidered on the next poll.
+	if err := e.Where("task_id=? AND status=? AND is_reusable_caller=?", 0, StatusWaiting, false).And(jobCond).Asc("updated", "id").Limit(maxWaitingJobsScan).Find(&jobs); err != nil {


Is it right? I don't understand the details of the Actions runner, but the code below checks the jobs against the labels.

What if the first 100 jobs don't match but the 101 job matches? And the first 100 jobs just get stuck?

// TODO: a more efficient way to filter labels var job *ActionRunJob log.Trace("runner labels: %v", runner.AgentLabels) for _, v := range jobs { if runner.CanMatchLabels(v.RunsOn) { job = v break } } if job == nil { return nil, false, nil }

this needs a better fix - working on it

CreateTaskForRunner picked the oldest waiting job and returned an error when it could not be prepared (its run was deleted - go-gitea#37586 - or its payload won't parse). Since that job stays at the head of the queue, every runner poll hit the same job and failed, stalling all Actions. An unpreparable job is now marked failed and skipped so the next candidate is tried, bounded per poll. The marking uses a direct row update because the normal status path aggregates the run, which no longer exists. Label matching is also moved into SQL via a normalized action_run_job_label table (kept in sync through a single InsertActionRunJob entry point), so the assignment query selects the oldest matchable job directly instead of scanning and filtering in memory. A composite (status, updated) index turns the "oldest waiting job" lookup into an index seek rather than a backlog sort.

…oll-load

bircni · 2026-06-20T14:39:42Z

@wxiaoguang could you check again please?

wxiaoguang · 2026-06-21T01:24:14Z

+		defer func() { <-sem }()
+	default:
+		return nil, false, true, nil
+	}


It needs to be "cancelable" if ctx is canceled.

wxiaoguang · 2026-06-21T01:25:19Z

@wxiaoguang could you check again please?

Better to ask real Actions users to check.

Seed action_run_job without explicit ids so xorm lets the identity column assign them; MSSQL rejects explicit identity inserts. Read the assigned ids back to key the expected labels. Assisted-by: Claude:claude-opus-4-8

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jun 17, 2026

bircni requested a review from Zettat123 June 17, 2026 16:12

github-actions Bot added the docs-update-needed The document needs to be updated synchronously label Jun 17, 2026

bircni marked this pull request as draft June 17, 2026 18:32

wxiaoguang reviewed Jun 18, 2026

View reviewed changes

more adjustments

c1593e2

bircni removed the request for review from Zettat123 June 19, 2026 20:28

bircni changed the title ~~perf(actions): reduce DB load from many concurrently polling runners~~ perf(actions): reduce DB load from many polling runners and fix stalled task pickup Jun 19, 2026

bircni added the topic/gitea-actions related to the actions of Gitea label Jun 19, 2026

bircni added 2 commits June 20, 2026 00:03

move to 340

d4f9c04

Merge remote-tracking branch 'origin/main' into perf/actions-runner-p…

7c0c9a8

…oll-load

bircni changed the title ~~perf(actions): reduce DB load from many polling runners and fix stalled task pickup~~ fix(actions): reduce DB load from many polling runners and fix stalled task pickup Jun 20, 2026

bircni added docs-update-needed The document needs to be updated synchronously and removed docs-update-needed The document needs to be updated synchronously labels Jun 20, 2026

bircni added 2 commits June 20, 2026 15:53

cleanup

a33737f

fixes

a50ce4c

wxiaoguang reviewed Jun 21, 2026

View reviewed changes

bircni added 2 commits June 21, 2026 10:30

test(migrations): fix v340 test identity insert on mssql

a347eb6

Seed action_run_job without explicit ids so xorm lets the identity column assign them; MSSQL rejects explicit identity inserts. Read the assigned ids back to key the expected labels. Assisted-by: Claude:claude-opus-4-8

fixes

c145514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150

fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150
bircni wants to merge 9 commits into
go-gitea:mainfrom
bircni:perf/actions-runner-poll-load

bircni commented Jun 17, 2026 •

edited

Loading

Uh oh!

wxiaoguang commented Jun 18, 2026 •

edited

Loading

Uh oh!

wxiaoguang Jun 18, 2026 •

edited

Loading

Uh oh!

bircni Jun 19, 2026

Uh oh!

bircni commented Jun 20, 2026

Uh oh!

wxiaoguang Jun 21, 2026

Uh oh!

wxiaoguang commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

Conversation

bircni commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Behavior / compatibility

Uh oh!

wxiaoguang commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wxiaoguang Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bircni Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

bircni commented Jun 20, 2026

Uh oh!

wxiaoguang Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

wxiaoguang commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bircni commented Jun 17, 2026 •

edited

Loading

wxiaoguang commented Jun 18, 2026 •

edited

Loading

wxiaoguang Jun 18, 2026 •

edited

Loading