Fix infinite retry loop when runner registration is not found by bkw · Pull Request #4330 · actions/runner

bkw · 2026-04-09T11:56:17Z

Summary

Two related fixes for ephemeral runners that end up registered with GitHub but unable to pick up jobs, blocking scale set capacity.

Fix 1: Treat deleted runner registration as non-retryable error

When a runner's registration is garbage-collected by GitHub, the OAuth token exchange fails with invalid_client ("Registration was not found"). The runner enters an infinite retry loop because VssOAuthTokenRequestException is not classified as non-retryable in either retry layer. This leaves the pod in Running status indefinitely — the ARC controller counts it as a live runner and never creates a replacement.

Changes:

BrokerServer.ShouldRetryException() — stops the inner 5-retry loop in RetryRequest() immediately
BrokerMessageListener.IsGetNextMessageExceptionRetriable() — causes GetNextMessageAsync() to throw NonRetryableException, so Program.cs returns TerminatedError

This matches the existing precedent in CreateSessionAsync() (line 188), which already treats invalid_client as terminal.

Fix 2: Add idle timeout for ephemeral runners waiting for jobs

Ephemeral runners created via JIT config can end up registered with GitHub but never receive a job (e.g., due to pod recreation causing session conflicts, or the original job being reassigned to another runner). These orphaned runners sit idle indefinitely, blocking a slot in the ARC scale set.

Changes:

Runner.RunAsync() — adds a configurable idle timeout (default: 10 minutes) that exits the runner if no job is received after session creation. Uses the existing Task.WhenAny pattern already used for auto-update and run-once job completion.
Configurable via ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES env var (set to 0 to disable). Only applies to ephemeral/run-once runners.

How we found this

We traced a production incident where jobs targeting an ARC scale set were stuck for 9+ minutes. Cloud Logging revealed:

An ephemeral runner (jgjcv) had pod stability issues at startup — the original pod disappeared and was recreated
The replacement pod hit "A session for this runner already exists" conflicts for ~2 minutes before connecting
By then, the job was assigned elsewhere — the runner sat idle "Listening for Jobs" from 09:23 to 10:14 (51 minutes)
At 10:14, GitHub garbage-collected the idle registration → "Registration was not found"
The runner entered the infinite retry loop (fix 1)
The ARC controller saw running=1, desired=1 and never created a replacement (fix 2 would have caught this at 09:33)

Test plan

Build succeeds (dotnet build Runner.Listener.csproj)
L0 test: GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth — verifies NonRetryableException without retrying
L0 test: ShouldRetryException_ReturnsFalseForInvalidClientOAuth — verifies invalid_client is non-retryable
L0 test: ShouldRetryException_ReturnsTrueForOtherOAuthErrors — verifies other OAuth errors still retry
Verify idle timeout exits runner after configured period when no job is received
Verify idle timeout does not affect runners that receive a job within the timeout
Verify ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES=0 disables the timeout

When a runner's registration is garbage-collected by GitHub (OAuth error "invalid_client", message "Registration was not found"), the runner enters an infinite retry loop instead of exiting. This leaves the pod in Running status indefinitely, blocking the ARC controller from creating a replacement runner and causing jobs to queue with no runner to pick them up. The fix adds VssOAuthTokenRequestException with Error=="invalid_client" to the non-retryable exception lists in both BrokerServer.ShouldRetryException() and BrokerMessageListener.IsGetNextMessageExceptionRetriable(), matching the existing precedent in CreateSessionAsync() which already treats this error as terminal. With this fix, the runner exits immediately with TerminatedError on the first "Registration was not found" error, allowing the ARC controller to detect the exit and create a fresh replacement runner. Fixes actions#4191

Copilot

Pull request overview

Fixes an infinite retry loop when the runner’s server-side registration has been garbage-collected and the OAuth token exchange fails with invalid_client / “Registration was not found”, allowing the runner to terminate so orchestration can replace it.

Changes:

Classify VssOAuthTokenRequestException with Error == "invalid_client" as non-retryable in the listener’s GetNextMessage retry logic.
Classify the same exception as non-retryable in BrokerServer.ShouldRetryException() to stop the inner RetryRequest() loop immediately.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
src/Runner.Listener/BrokerMessageListener.cs	Marks OAuth `invalid_client` as non-retryable in GetNextMessage exception classification.
src/Runner.Common/BrokerServer.cs	Stops `RetryRequest()` retries for OAuth `invalid_client` by updating `ShouldRetryException()`.

src/Runner.Listener/BrokerMessageListener.cs

src/Runner.Common/BrokerServer.cs

- GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth: verifies that VssOAuthTokenRequestException with Error="invalid_client" causes GetNextMessageAsync to throw NonRetryableException without retrying - ShouldRetryException_ReturnsFalseForInvalidClientOAuth: verifies BrokerServer.ShouldRetryException returns false for invalid_client - ShouldRetryException_ReturnsTrueForOtherOAuthErrors: verifies other OAuth errors are still retried normally

bkw · 2026-04-09T12:04:30Z

Pushed tests in 4f6c72e addressing comments 2 and 3. Replying inline on comment 1:

Re: auth migration mode — Good observation. The !AllowAuthMigration guard is an existing pattern that gates all non-retryable checks in GetNextMessageAsync() (see the same guard on line 341 for AccessDeniedException, RunnerNotFoundException, etc.). During auth migration, the runner defers migration and retries with refreshed credentials, which is the correct behavior for that mode. Changing the auth-migration behavior would be a broader design decision beyond the scope of this fix. The primary impact of #4191 is on ARC ephemeral runners which do not use auth migration.

Re: tests — Added three L0 tests:

GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth — verifies GetNextMessageAsync throws NonRetryableException without retrying when it gets invalid_client
ShouldRetryException_ReturnsFalseForInvalidClientOAuth — verifies BrokerServer.ShouldRetryException returns false for invalid_client
ShouldRetryException_ReturnsTrueForOtherOAuthErrors — verifies other OAuth errors still retry normally

Ephemeral runners created via JIT config can end up registered with GitHub but never receive a job (e.g., due to pod recreation causing session conflicts, or the original job being reassigned). These orphaned runners sit idle indefinitely, blocking a slot in the ARC scale set and preventing replacement runners from being created. This adds a configurable idle timeout (default: 10 minutes) that exits the runner if no job is received after session creation. The timeout only applies to ephemeral/run-once runners and is disabled once a job is received. Configurable via ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES env var: - Default: 10 (minutes) for ephemeral runners - Set to 0 to disable - Has no effect on non-ephemeral runners Uses the existing Task.WhenAny pattern already used for auto-update and run-once job completion checks. Returns TerminatedError (exit code 1) so the ARC controller treats it as a non-retryable exit and creates a fresh replacement.

bkw requested a review from a team as a code owner April 9, 2026 11:56

Copilot AI review requested due to automatic review settings April 9, 2026 11:56

Copilot started reviewing on behalf of bkw April 9, 2026 11:56 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

src/Runner.Listener/BrokerMessageListener.cs Show resolved Hide resolved

src/Runner.Listener/BrokerMessageListener.cs Show resolved Hide resolved

src/Runner.Common/BrokerServer.cs Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infinite retry loop when runner registration is not found#4330

Fix infinite retry loop when runner registration is not found#4330
bkw wants to merge 3 commits intoactions:mainfrom
bkw:fix/non-retryable-registration-not-found

bkw commented Apr 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkw commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bkw commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Treat deleted runner registration as non-retryable error

Fix 2: Add idle timeout for ephemeral runners waiting for jobs

How we found this

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkw commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bkw commented Apr 9, 2026 •

edited

Loading

bkw commented Apr 9, 2026 •

edited

Loading