Commit Graph

47 Commits

Author SHA1 Message Date
Rob Howie ff7b81dc8f fix: add health check to detect and recover stale EphemeralRunner registrations
fix: add health check to detect and recover stale EphemeralRunner registrations

## Summary

Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement.

This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners.

A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet.

## Changes

- **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`.
- **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone.
- **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers.
- **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path.

## Issues Fixed

### Directly fixes

- **#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced.

- **#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating.

### Partially addresses

- **#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear.

- **#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop.

- **#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up.

- **#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path.

## Test Plan

- [x] Unit test: `GetRunner` returning 404 → runner marked as failed
- [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance)
- [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone)
- [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it
- [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced
2026-03-09 09:31:17 +00:00
Nikola Jokic 8b7fd9ffef
Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
Nikola Jokic c6e4c94a6a
Fix tests and generate mocks (#4384) 2026-02-24 13:36:01 +01:00
dhawalseth 9de09f56eb
Include the HTTP status code in jit error (#4361)
Co-authored-by: Dhawal Seth <dseth@linkedin.com>
2026-01-29 16:40:17 +01:00
Caius Durling 02aa70a64a
Fix `AcivityId` typo in error strings (#4359) 2026-01-21 01:14:26 +01:00
Nikola Jokic 540269880f
Typo in test name caused test to not execute (#4330) 2025-11-27 15:31:57 +01:00
Nikola Jokic 088e2a3a90
Remove ephemeral runner when exit code != 0 and is patched with the job (#4239) 2025-09-17 21:40:37 +02:00
Nikola Jokic e46c929241
Azure Key Vault integration to resolve secrets (#4090) 2025-06-11 15:53:33 +02:00
Nash Luffman e335f53037
Add response body to error when fetching access token (#4005)
Co-authored-by: mluffman <mluffman@thoughtmachine.net>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-06-11 15:52:38 +02:00
Nikola Jokic 1dbb88cb9e
Allow use of client id as an app id (#4057) 2025-05-16 16:21:06 +02:00
Ryosei Karaki f832b0b254
upgrade(golangci-lint): v2.1.2 (#4023)
Signed-off-by: karamaru-alpha <mrnk3078@gmail.com>
2025-04-17 16:14:31 +02:00
Nikola Jokic 15990d492d
Include more context to errors raised by github/actions client (#4032)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-11 11:36:15 +02:00
Nikola Jokic fb9b96bf75
Update all dependencies, conforming to the new controller-runtime API (#3949) 2025-03-11 15:52:52 +01:00
Bassem Dghaidi 90b68fec1a
Add exponential backoff when generating runner reg tokens (#3724) 2024-09-04 12:23:31 +02:00
Nikola Jokic b2c6992e84
Check status code of fetch access token for github app (#3568) 2024-06-21 12:10:56 +02:00
Nikola Jokic 4ee49fee14
Propagate max capacity information to the actions back-end (#3431) 2024-04-16 14:00:40 +02:00
Nikola Jokic 8075e5ee74
Refactor actions client error to include request id (#3430)
Co-authored-by: Francesco Renzi <rentziass@gmail.com>
2024-04-16 12:57:44 +02:00
Nikola Jokic 1d1790614b
Add retry on 401 and 403 for runner-registration (#3377)
Co-authored-by: Francesco Renzi <rentziass@gmail.com>
2024-03-27 10:55:17 +01:00
Nikola Jokic 728f05c844
Delete message session when `listener.Listen` returns (#3240) 2024-01-25 15:12:19 +01:00
Nikola Jokic b78cadd901
Refactoring listener app with configurable fallback (#3096) 2023-12-08 13:41:06 +01:00
Nikola Jokic 202a97ab12
Modify user agent format with subsystem and is proxy configured information (#3116) 2023-12-08 13:16:29 +01:00
Nikola Jokic b202be712e
Set actions client timeout to 5 minutes, add logging to client (#3103) 2023-11-24 17:04:21 +01:00
Nikola Jokic 2646456677
Update authorization for PAT to be Bearer as documented (#3039) 2023-11-07 14:19:53 +01:00
Nikola Jokic 07bff8aa1e
Extend the user agent and fix the build version for the listener app (#2892) 2023-09-14 20:10:49 +02:00
Nikola Jokic 52fc819339
Fix parsing AcquireJob MessageQueueTokenExpiredError (#2837) 2023-08-25 20:35:01 +02:00
Nikola Jokic a0a3916c80
Provide scale-set listener metrics (#2559)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-08-21 13:50:07 +02:00
Nikola Jokic fde1893494
Add status check before deserializing runner-registration response (#2699) 2023-07-05 21:09:07 +02:00
Tingluo Huang 8fa4520376
Treat `.ghe.com` domain as hosted environment (#2480)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2023-04-04 14:43:45 -04:00
Tingluo Huang 08acb1b831
Get RunnerScaleSet based on both RunnerGroupId and Name. (#2413) 2023-03-15 11:10:09 -04:00
Francesco Renzi c569304271
Add support for self-signed CA certificates (#2268)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-03-09 17:23:32 +00:00
Tingluo Huang a462ecbe79
Trim slash for configure URL. (#2381) 2023-03-09 09:02:05 -05:00
Chris Patterson 41f2ca3ed9
Adding parameter to configure the runner set name. (#2279)
Co-authored-by: TingluoHuang <TingluoHuang@github.com>
2023-03-03 08:36:14 -05:00
Francesco Renzi 6b4250ca90
Add support for proxy (#2286)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
2023-02-21 17:33:48 +00:00
Francesco Renzi dd8ec1a055
Add testserver package (#2281) 2023-02-14 12:11:46 +01:00
Francesco Renzi 8f62e35f6b
Add options to multi client (#2257) 2023-02-07 08:47:59 +01:00
Nikola Jokic c4297d25bb
Avoid deleting scale set if annotation is not parsable or if it does not exist (#2239) 2023-02-03 17:27:31 +01:00
Francesco Renzi 92ab11b4d2
Use UUID v5 for client identifiers (#2241) 2023-02-02 09:28:34 +01:00
Francesco Renzi 7414dc6568
Add Identifier to actions.Client (#2237) 2023-02-01 14:47:54 +01:00
Tingluo Huang a5cef7e47b
Resolve CI break due to bad merge. (#2236) 2023-01-31 22:00:26 +01:00
Tingluo Huang 1f4fe4681e
Delete RunnerScaleSet on service when AutoScalingRunnerSet is deleted. (#2223) 2023-01-31 15:03:11 -05:00
Francesco Renzi df12e00c9e
Remove network requests from actions.NewClient (#2219)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2023-01-31 10:55:23 +00:00
Tingluo Huang 803818162c
Allow update runner group for AutoScalingRunnerSet (#2216) 2023-01-27 09:27:52 -05:00
Tingluo Huang b09e3a2dc9
Return error for non-existing runner group. (#2215) 2023-01-26 12:19:52 -05:00
Francesco Renzi c8918f5a7b
Fix URL for authenticating using a GitHub app (#2206)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2023-01-24 18:02:23 +01:00
Francesco Renzi d57d17f161
Add support for custom CA in actions.Client (#2199) 2023-01-23 17:36:57 -05:00
Francesco Renzi 3327f620fb
Refactor actions.Client with options to help extensibility (#2193) 2023-01-23 11:50:14 +00:00
Tingluo Huang 622eaa34f8
Introduce new preview auto-scaling mode for ARC. (#2153)
Co-authored-by: Cory Miller <cory-miller@github.com>
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Ava Stancu <AvaStancu@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
Co-authored-by: Francesco Renzi <rentziass@github.com>
Co-authored-by: Bassem Dghaidi <Link-@github.com>
2023-01-17 12:06:20 -05:00