Skip to content

fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150

Draft
bircni wants to merge 9 commits into
go-gitea:mainfrom
bircni:perf/actions-runner-poll-load
Draft

fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150
bircni wants to merge 9 commits into
go-gitea:mainfrom
bircni:perf/actions-runner-poll-load

Conversation

@bircni

@bircni bircni commented Jun 17, 2026

Copy link
Copy Markdown
Member

Problem

On instances with many global/instance-scoped Actions runners (we saw it with 100+ runners against a 32-core server backed by MariaDB), the server sits at ~100% CPU with heavy SQL traffic whenever all runners are active, and a single bad job can stop Actions entirely. Three issues compound:

  1. A write on every poll. Each FetchTask poll unconditionally runs UPDATE action_runner SET last_online=? (interceptor.go), regardless of whether there's any work — so N runners × poll-rate = a constant stream of indexed writes.

  2. A global-version thundering herd over an unbounded scan. IncreaseTaskVersion always bumps the global scope (0,0) on every job state change anywhere in the instance. Since global runners all read that scope, a single queued job invalidates all of them, and on their next poll they all enter the full PickTaskCreateTaskForRunner transaction at once. Each one scans every waiting job (unbounded), matches labels in memory, and only one wins each job via the optimistic task_id=0 update — the other 99 do wasted work. At high throughput the version-skip optimization barely fires.

  3. One unpreparable job stalls everything (#37586). Assignment picks the oldest waiting job and returns an error if it can't be prepared — e.g. its run was deleted out from under it, or its payload won't parse. Because that job stays at the head of the queue, every runner's poll hits it, fails, and never advances. Actions stop instance-wide until the job is manually cleared.

Changes

  • Debounce last_online. New RunnerHeartbeatInterval (30s) + ShouldPersistLastOnline helper; the interceptor persists last_online only when it's stale enough to matter, skipping the UpdateRunner call entirely otherwise. last_active is unchanged (still written on UpdateTask/UpdateLog). Safe because the offline threshold is 60s.

  • Throttle concurrent task assignments. New MAX_CONCURRENT_TASK_PICKS setting (default 16) backs a process-wide non-blocking semaphore via TryPickTask. When the limit is hit, the pick is skipped and — crucially — the response echoes the runner's request tasks version instead of latestVersion, so the runner retries on its next poll rather than sleeping until the next bump (which would stall the backlog).

  • Match labels in SQL instead of scanning in memory. Runner labels are now matched against a normalized action_run_job_label table (one row per runs_on label) via a correlated NOT EXISTS, so CreateTaskForRunner selects the oldest matchable waiting job directly (LIMIT 1) regardless of backlog size. This removes the unbounded scan + in-memory filtering entirely — without the head-of-line starvation a simple row cap would introduce. The table is backfilled for waiting/blocked jobs and kept in sync through a single InsertActionRunJob entry point.

  • Add a composite (status, updated) index so the "oldest waiting job" lookup is an index seek rather than a sort of the whole waiting backlog.

  • Make the pick resilient (fixes gitea actions stopped working #37586). A job that can't be prepared is marked failed and skipped (bounded per poll) so the next candidate is tried, instead of erroring out and stalling every poll. The failure marking uses a direct row update because the normal status path aggregates the run, which may no longer exist.

Behavior / compatibility

  • No runner-visible semantic change: offline/idle detection, oldest-first ordering, and label-match semantics (a runner must cover all of a job's labels) are preserved — label matching just moves from memory into SQL.
  • MAX_CONCURRENT_TASK_PICKS defaults to 16; lower it to protect a small DB, raise it for very high job throughput.
  • One migration adds the action_run_job_label table (backfilled) and the (status, updated) index.

Under 100+ global runners, every FetchTask poll wrote action_runner.last_online
and, on each global tasks-version bump, all runners simultaneously ran the full
task-assignment transaction (scanning every waiting job). This pegged the DB.

- Debounce last_online: persist at most every RunnerHeartbeatInterval (30s)
  instead of on every poll, well under the 60s offline threshold.
- Throttle concurrent task assignments via MAX_CONCURRENT_TASK_PICKS (default 16);
  throttled runners echo their request tasks version so they retry next poll
  instead of advancing and sleeping until the next bump.
- Bound the waiting-jobs scan to 100 rows so a backlog can't grow each poll's cost.

Assisted-by: Claude:claude-opus-4-8
@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jun 17, 2026
@bircni bircni requested a review from Zettat123 June 17, 2026 16:12
@github-actions github-actions Bot added the docs-update-needed The document needs to be updated synchronously label Jun 17, 2026
@bircni bircni marked this pull request as draft June 17, 2026 18:32
@wxiaoguang

wxiaoguang commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

The existing indices/SQLs/logic are wrong

But they just don't want to or don't know how to fix it.

Comment thread models/actions/task.go Outdated
// waiting jobs (which would feed back into more load). Oldest-first ordering means
// a label-matchable job behind maxWaitingJobsScan non-matching older jobs may be
// skipped this round, but it will be reconsidered on the next poll.
if err := e.Where("task_id=? AND status=? AND is_reusable_caller=?", 0, StatusWaiting, false).And(jobCond).Asc("updated", "id").Limit(maxWaitingJobsScan).Find(&jobs); err != nil {

@wxiaoguang wxiaoguang Jun 18, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right? I don't understand the details of the Actions runner, but the code below checks the jobs against the labels.

What if the first 100 jobs don't match but the 101 job matches? And the first 100 jobs just get stuck?

	// TODO: a more efficient way to filter labels
	var job *ActionRunJob
	log.Trace("runner labels: %v", runner.AgentLabels)
	for _, v := range jobs {
		if runner.CanMatchLabels(v.RunsOn) {
			job = v
			break
		}
	}
	if job == nil {
		return nil, false, nil
	}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a better fix - working on it

@bircni bircni removed the request for review from Zettat123 June 19, 2026 20:28
CreateTaskForRunner picked the oldest waiting job and returned an error when it
could not be prepared (its run was deleted - go-gitea#37586 - or its payload won't
parse). Since that job stays at the head of the queue, every runner poll hit the
same job and failed, stalling all Actions.

An unpreparable job is now marked failed and skipped so the next candidate is
tried, bounded per poll. The marking uses a direct row update because the normal
status path aggregates the run, which no longer exists.

Label matching is also moved into SQL via a normalized action_run_job_label
table (kept in sync through a single InsertActionRunJob entry point), so the
assignment query selects the oldest matchable job directly instead of scanning
and filtering in memory. A composite (status, updated) index turns the
"oldest waiting job" lookup into an index seek rather than a backlog sort.
@bircni bircni changed the title perf(actions): reduce DB load from many concurrently polling runners perf(actions): reduce DB load from many polling runners and fix stalled task pickup Jun 19, 2026
@bircni bircni added the topic/gitea-actions related to the actions of Gitea label Jun 19, 2026
@bircni bircni changed the title perf(actions): reduce DB load from many polling runners and fix stalled task pickup fix(actions): reduce DB load from many polling runners and fix stalled task pickup Jun 20, 2026
@bircni bircni added docs-update-needed The document needs to be updated synchronously and removed docs-update-needed The document needs to be updated synchronously labels Jun 20, 2026
@bircni

bircni commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

@wxiaoguang could you check again please?

Comment thread services/actions/task.go
defer func() { <-sem }()
default:
return nil, false, true, nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be "cancelable" if ctx is canceled.

@wxiaoguang

Copy link
Copy Markdown
Contributor

@wxiaoguang could you check again please?

Better to ask real Actions users to check.

bircni added 2 commits June 21, 2026 10:30
Seed action_run_job without explicit ids so xorm lets the identity
column assign them; MSSQL rejects explicit identity inserts. Read the
assigned ids back to key the expected labels.

Assisted-by: Claude:claude-opus-4-8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-update-needed The document needs to be updated synchronously lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. topic/gitea-actions related to the actions of Gitea

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gitea actions stopped working

3 participants