Skip to content

Fix infinite retry loop when runner registration is not found#4330

Open
bkw wants to merge 3 commits intoactions:mainfrom
bkw:fix/non-retryable-registration-not-found
Open

Fix infinite retry loop when runner registration is not found#4330
bkw wants to merge 3 commits intoactions:mainfrom
bkw:fix/non-retryable-registration-not-found

Conversation

@bkw
Copy link
Copy Markdown

@bkw bkw commented Apr 9, 2026

Summary

Fixes #4191

Two related fixes for ephemeral runners that end up registered with GitHub but unable to pick up jobs, blocking scale set capacity.

Fix 1: Treat deleted runner registration as non-retryable error

When a runner's registration is garbage-collected by GitHub, the OAuth token exchange fails with invalid_client ("Registration was not found"). The runner enters an infinite retry loop because VssOAuthTokenRequestException is not classified as non-retryable in either retry layer. This leaves the pod in Running status indefinitely — the ARC controller counts it as a live runner and never creates a replacement.

Changes:

  • BrokerServer.ShouldRetryException() — stops the inner 5-retry loop in RetryRequest() immediately
  • BrokerMessageListener.IsGetNextMessageExceptionRetriable() — causes GetNextMessageAsync() to throw NonRetryableException, so Program.cs returns TerminatedError

This matches the existing precedent in CreateSessionAsync() (line 188), which already treats invalid_client as terminal.

Fix 2: Add idle timeout for ephemeral runners waiting for jobs

Ephemeral runners created via JIT config can end up registered with GitHub but never receive a job (e.g., due to pod recreation causing session conflicts, or the original job being reassigned to another runner). These orphaned runners sit idle indefinitely, blocking a slot in the ARC scale set.

Changes:

  • Runner.RunAsync() — adds a configurable idle timeout (default: 10 minutes) that exits the runner if no job is received after session creation. Uses the existing Task.WhenAny pattern already used for auto-update and run-once job completion.
  • Configurable via ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES env var (set to 0 to disable). Only applies to ephemeral/run-once runners.

How we found this

We traced a production incident where jobs targeting an ARC scale set were stuck for 9+ minutes. Cloud Logging revealed:

  1. An ephemeral runner (jgjcv) had pod stability issues at startup — the original pod disappeared and was recreated
  2. The replacement pod hit "A session for this runner already exists" conflicts for ~2 minutes before connecting
  3. By then, the job was assigned elsewhere — the runner sat idle "Listening for Jobs" from 09:23 to 10:14 (51 minutes)
  4. At 10:14, GitHub garbage-collected the idle registration → "Registration was not found"
  5. The runner entered the infinite retry loop (fix 1)
  6. The ARC controller saw running=1, desired=1 and never created a replacement (fix 2 would have caught this at 09:33)

Test plan

  • Build succeeds (dotnet build Runner.Listener.csproj)
  • L0 test: GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth — verifies NonRetryableException without retrying
  • L0 test: ShouldRetryException_ReturnsFalseForInvalidClientOAuth — verifies invalid_client is non-retryable
  • L0 test: ShouldRetryException_ReturnsTrueForOtherOAuthErrors — verifies other OAuth errors still retry
  • Verify idle timeout exits runner after configured period when no job is received
  • Verify idle timeout does not affect runners that receive a job within the timeout
  • Verify ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES=0 disables the timeout

When a runner's registration is garbage-collected by GitHub (OAuth error
"invalid_client", message "Registration was not found"), the runner
enters an infinite retry loop instead of exiting. This leaves the pod
in Running status indefinitely, blocking the ARC controller from
creating a replacement runner and causing jobs to queue with no runner
to pick them up.

The fix adds VssOAuthTokenRequestException with Error=="invalid_client"
to the non-retryable exception lists in both BrokerServer.ShouldRetryException()
and BrokerMessageListener.IsGetNextMessageExceptionRetriable(), matching
the existing precedent in CreateSessionAsync() which already treats this
error as terminal.

With this fix, the runner exits immediately with TerminatedError on the
first "Registration was not found" error, allowing the ARC controller to
detect the exit and create a fresh replacement runner.

Fixes actions#4191
@bkw bkw requested a review from a team as a code owner April 9, 2026 11:56
Copilot AI review requested due to automatic review settings April 9, 2026 11:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an infinite retry loop when the runner’s server-side registration has been garbage-collected and the OAuth token exchange fails with invalid_client / “Registration was not found”, allowing the runner to terminate so orchestration can replace it.

Changes:

  • Classify VssOAuthTokenRequestException with Error == "invalid_client" as non-retryable in the listener’s GetNextMessage retry logic.
  • Classify the same exception as non-retryable in BrokerServer.ShouldRetryException() to stop the inner RetryRequest() loop immediately.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/Runner.Listener/BrokerMessageListener.cs Marks OAuth invalid_client as non-retryable in GetNextMessage exception classification.
src/Runner.Common/BrokerServer.cs Stops RetryRequest() retries for OAuth invalid_client by updating ShouldRetryException().

- GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth: verifies that
  VssOAuthTokenRequestException with Error="invalid_client" causes
  GetNextMessageAsync to throw NonRetryableException without retrying
- ShouldRetryException_ReturnsFalseForInvalidClientOAuth: verifies
  BrokerServer.ShouldRetryException returns false for invalid_client
- ShouldRetryException_ReturnsTrueForOtherOAuthErrors: verifies other
  OAuth errors are still retried normally
@bkw
Copy link
Copy Markdown
Author

bkw commented Apr 9, 2026

Pushed tests in 4f6c72e addressing comments 2 and 3. Replying inline on comment 1:

Re: auth migration mode — Good observation. The !AllowAuthMigration guard is an existing pattern that gates all non-retryable checks in GetNextMessageAsync() (see the same guard on line 341 for AccessDeniedException, RunnerNotFoundException, etc.). During auth migration, the runner defers migration and retries with refreshed credentials, which is the correct behavior for that mode. Changing the auth-migration behavior would be a broader design decision beyond the scope of this fix. The primary impact of #4191 is on ARC ephemeral runners which do not use auth migration.

Re: tests — Added three L0 tests:

  • GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth — verifies GetNextMessageAsync throws NonRetryableException without retrying when it gets invalid_client
  • ShouldRetryException_ReturnsFalseForInvalidClientOAuth — verifies BrokerServer.ShouldRetryException returns false for invalid_client
  • ShouldRetryException_ReturnsTrueForOtherOAuthErrors — verifies other OAuth errors still retry normally

Ephemeral runners created via JIT config can end up registered with
GitHub but never receive a job (e.g., due to pod recreation causing
session conflicts, or the original job being reassigned). These orphaned
runners sit idle indefinitely, blocking a slot in the ARC scale set
and preventing replacement runners from being created.

This adds a configurable idle timeout (default: 10 minutes) that exits
the runner if no job is received after session creation. The timeout
only applies to ephemeral/run-once runners and is disabled once a job
is received.

Configurable via ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES env var:
- Default: 10 (minutes) for ephemeral runners
- Set to 0 to disable
- Has no effect on non-ephemeral runners

Uses the existing Task.WhenAny pattern already used for auto-update
and run-once job completion checks. Returns TerminatedError (exit
code 1) so the ARC controller treats it as a non-retryable exit and
creates a fresh replacement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Runner gets stuck in infinite loop of retries when registration is not found

2 participants