fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150
Draft
bircni wants to merge 9 commits into
Draft
fix(actions): reduce DB load from many polling runners and fix stalled task pickup#38150bircni wants to merge 9 commits into
bircni wants to merge 9 commits into
Conversation
Under 100+ global runners, every FetchTask poll wrote action_runner.last_online and, on each global tasks-version bump, all runners simultaneously ran the full task-assignment transaction (scanning every waiting job). This pegged the DB. - Debounce last_online: persist at most every RunnerHeartbeatInterval (30s) instead of on every poll, well under the 60s offline threshold. - Throttle concurrent task assignments via MAX_CONCURRENT_TASK_PICKS (default 16); throttled runners echo their request tasks version so they retry next poll instead of advancing and sleeping until the next bump. - Bound the waiting-jobs scan to 100 rows so a backlog can't grow each poll's cost. Assisted-by: Claude:claude-opus-4-8
Contributor
|
The existing indices/SQLs/logic are wrong
But they just don't want to or don't know how to fix it. |
wxiaoguang
reviewed
Jun 18, 2026
| // waiting jobs (which would feed back into more load). Oldest-first ordering means | ||
| // a label-matchable job behind maxWaitingJobsScan non-matching older jobs may be | ||
| // skipped this round, but it will be reconsidered on the next poll. | ||
| if err := e.Where("task_id=? AND status=? AND is_reusable_caller=?", 0, StatusWaiting, false).And(jobCond).Asc("updated", "id").Limit(maxWaitingJobsScan).Find(&jobs); err != nil { |
Contributor
There was a problem hiding this comment.
Is it right? I don't understand the details of the Actions runner, but the code below checks the jobs against the labels.
What if the first 100 jobs don't match but the 101 job matches? And the first 100 jobs just get stuck?
// TODO: a more efficient way to filter labels
var job *ActionRunJob
log.Trace("runner labels: %v", runner.AgentLabels)
for _, v := range jobs {
if runner.CanMatchLabels(v.RunsOn) {
job = v
break
}
}
if job == nil {
return nil, false, nil
}
Member
Author
There was a problem hiding this comment.
this needs a better fix - working on it
CreateTaskForRunner picked the oldest waiting job and returned an error when it could not be prepared (its run was deleted - go-gitea#37586 - or its payload won't parse). Since that job stays at the head of the queue, every runner poll hit the same job and failed, stalling all Actions. An unpreparable job is now marked failed and skipped so the next candidate is tried, bounded per poll. The marking uses a direct row update because the normal status path aggregates the run, which no longer exists. Label matching is also moved into SQL via a normalized action_run_job_label table (kept in sync through a single InsertActionRunJob entry point), so the assignment query selects the oldest matchable job directly instead of scanning and filtering in memory. A composite (status, updated) index turns the "oldest waiting job" lookup into an index seek rather than a backlog sort.
Member
Author
|
@wxiaoguang could you check again please? |
wxiaoguang
reviewed
Jun 21, 2026
| defer func() { <-sem }() | ||
| default: | ||
| return nil, false, true, nil | ||
| } |
Contributor
There was a problem hiding this comment.
It needs to be "cancelable" if ctx is canceled.
Contributor
Better to ask real Actions users to check. |
Seed action_run_job without explicit ids so xorm lets the identity column assign them; MSSQL rejects explicit identity inserts. Read the assigned ids back to key the expected labels. Assisted-by: Claude:claude-opus-4-8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On instances with many global/instance-scoped Actions runners (we saw it with 100+ runners against a 32-core server backed by MariaDB), the server sits at ~100% CPU with heavy SQL traffic whenever all runners are active, and a single bad job can stop Actions entirely. Three issues compound:
A write on every poll. Each
FetchTaskpoll unconditionally runsUPDATE action_runner SET last_online=?(interceptor.go), regardless of whether there's any work — so N runners × poll-rate = a constant stream of indexed writes.A global-version thundering herd over an unbounded scan.
IncreaseTaskVersionalways bumps the global scope(0,0)on every job state change anywhere in the instance. Since global runners all read that scope, a single queued job invalidates all of them, and on their next poll they all enter the fullPickTask→CreateTaskForRunnertransaction at once. Each one scans every waiting job (unbounded), matches labels in memory, and only one wins each job via the optimistictask_id=0update — the other 99 do wasted work. At high throughput the version-skip optimization barely fires.One unpreparable job stalls everything (#37586). Assignment picks the oldest waiting job and returns an error if it can't be prepared — e.g. its run was deleted out from under it, or its payload won't parse. Because that job stays at the head of the queue, every runner's poll hits it, fails, and never advances. Actions stop instance-wide until the job is manually cleared.
Changes
Debounce
last_online. NewRunnerHeartbeatInterval(30s) +ShouldPersistLastOnlinehelper; the interceptor persistslast_onlineonly when it's stale enough to matter, skipping theUpdateRunnercall entirely otherwise.last_activeis unchanged (still written onUpdateTask/UpdateLog). Safe because the offline threshold is 60s.Throttle concurrent task assignments. New
MAX_CONCURRENT_TASK_PICKSsetting (default 16) backs a process-wide non-blocking semaphore viaTryPickTask. When the limit is hit, the pick is skipped and — crucially — the response echoes the runner's request tasks version instead oflatestVersion, so the runner retries on its next poll rather than sleeping until the next bump (which would stall the backlog).Match labels in SQL instead of scanning in memory. Runner labels are now matched against a normalized
action_run_job_labeltable (one row perruns_onlabel) via a correlatedNOT EXISTS, soCreateTaskForRunnerselects the oldest matchable waiting job directly (LIMIT 1) regardless of backlog size. This removes the unbounded scan + in-memory filtering entirely — without the head-of-line starvation a simple row cap would introduce. The table is backfilled for waiting/blocked jobs and kept in sync through a singleInsertActionRunJobentry point.Add a composite
(status, updated)index so the "oldest waiting job" lookup is an index seek rather than a sort of the whole waiting backlog.Make the pick resilient (fixes gitea actions stopped working #37586). A job that can't be prepared is marked failed and skipped (bounded per poll) so the next candidate is tried, instead of erroring out and stalling every poll. The failure marking uses a direct row update because the normal status path aggregates the run, which may no longer exist.
Behavior / compatibility
MAX_CONCURRENT_TASK_PICKSdefaults to 16; lower it to protect a small DB, raise it for very high job throughput.action_run_job_labeltable (backfilled) and the(status, updated)index.