Fix infinite retry loop when runner registration is not found#4330
Fix infinite retry loop when runner registration is not found#4330bkw wants to merge 3 commits intoactions:mainfrom
Conversation
When a runner's registration is garbage-collected by GitHub (OAuth error "invalid_client", message "Registration was not found"), the runner enters an infinite retry loop instead of exiting. This leaves the pod in Running status indefinitely, blocking the ARC controller from creating a replacement runner and causing jobs to queue with no runner to pick them up. The fix adds VssOAuthTokenRequestException with Error=="invalid_client" to the non-retryable exception lists in both BrokerServer.ShouldRetryException() and BrokerMessageListener.IsGetNextMessageExceptionRetriable(), matching the existing precedent in CreateSessionAsync() which already treats this error as terminal. With this fix, the runner exits immediately with TerminatedError on the first "Registration was not found" error, allowing the ARC controller to detect the exit and create a fresh replacement runner. Fixes actions#4191
There was a problem hiding this comment.
Pull request overview
Fixes an infinite retry loop when the runner’s server-side registration has been garbage-collected and the OAuth token exchange fails with invalid_client / “Registration was not found”, allowing the runner to terminate so orchestration can replace it.
Changes:
- Classify
VssOAuthTokenRequestExceptionwithError == "invalid_client"as non-retryable in the listener’s GetNextMessage retry logic. - Classify the same exception as non-retryable in
BrokerServer.ShouldRetryException()to stop the innerRetryRequest()loop immediately.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/Runner.Listener/BrokerMessageListener.cs | Marks OAuth invalid_client as non-retryable in GetNextMessage exception classification. |
| src/Runner.Common/BrokerServer.cs | Stops RetryRequest() retries for OAuth invalid_client by updating ShouldRetryException(). |
- GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth: verifies that VssOAuthTokenRequestException with Error="invalid_client" causes GetNextMessageAsync to throw NonRetryableException without retrying - ShouldRetryException_ReturnsFalseForInvalidClientOAuth: verifies BrokerServer.ShouldRetryException returns false for invalid_client - ShouldRetryException_ReturnsTrueForOtherOAuthErrors: verifies other OAuth errors are still retried normally
|
Pushed tests in 4f6c72e addressing comments 2 and 3. Replying inline on comment 1: Re: auth migration mode — Good observation. The Re: tests — Added three L0 tests:
|
Ephemeral runners created via JIT config can end up registered with GitHub but never receive a job (e.g., due to pod recreation causing session conflicts, or the original job being reassigned). These orphaned runners sit idle indefinitely, blocking a slot in the ARC scale set and preventing replacement runners from being created. This adds a configurable idle timeout (default: 10 minutes) that exits the runner if no job is received after session creation. The timeout only applies to ephemeral/run-once runners and is disabled once a job is received. Configurable via ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES env var: - Default: 10 (minutes) for ephemeral runners - Set to 0 to disable - Has no effect on non-ephemeral runners Uses the existing Task.WhenAny pattern already used for auto-update and run-once job completion checks. Returns TerminatedError (exit code 1) so the ARC controller treats it as a non-retryable exit and creates a fresh replacement.
Summary
Fixes #4191
Two related fixes for ephemeral runners that end up registered with GitHub but unable to pick up jobs, blocking scale set capacity.
Fix 1: Treat deleted runner registration as non-retryable error
When a runner's registration is garbage-collected by GitHub, the OAuth token exchange fails with
invalid_client("Registration was not found"). The runner enters an infinite retry loop becauseVssOAuthTokenRequestExceptionis not classified as non-retryable in either retry layer. This leaves the pod inRunningstatus indefinitely — the ARC controller counts it as a live runner and never creates a replacement.Changes:
BrokerServer.ShouldRetryException()— stops the inner 5-retry loop inRetryRequest()immediatelyBrokerMessageListener.IsGetNextMessageExceptionRetriable()— causesGetNextMessageAsync()to throwNonRetryableException, soProgram.csreturnsTerminatedErrorThis matches the existing precedent in
CreateSessionAsync()(line 188), which already treatsinvalid_clientas terminal.Fix 2: Add idle timeout for ephemeral runners waiting for jobs
Ephemeral runners created via JIT config can end up registered with GitHub but never receive a job (e.g., due to pod recreation causing session conflicts, or the original job being reassigned to another runner). These orphaned runners sit idle indefinitely, blocking a slot in the ARC scale set.
Changes:
Runner.RunAsync()— adds a configurable idle timeout (default: 10 minutes) that exits the runner if no job is received after session creation. Uses the existingTask.WhenAnypattern already used for auto-update and run-once job completion.ACTIONS_RUNNER_IDLE_TIMEOUT_MINUTESenv var (set to 0 to disable). Only applies to ephemeral/run-once runners.How we found this
We traced a production incident where jobs targeting an ARC scale set were stuck for 9+ minutes. Cloud Logging revealed:
jgjcv) had pod stability issues at startup — the original pod disappeared and was recreatedrunning=1, desired=1and never created a replacement (fix 2 would have caught this at 09:33)Test plan
dotnet build Runner.Listener.csproj)GetNextMessage_ThrowsNonRetryableOnInvalidClientOAuth— verifiesNonRetryableExceptionwithout retryingShouldRetryException_ReturnsFalseForInvalidClientOAuth— verifiesinvalid_clientis non-retryableShouldRetryException_ReturnsTrueForOtherOAuthErrors— verifies other OAuth errors still retryACTIONS_RUNNER_IDLE_TIMEOUT_MINUTES=0disables the timeout