Commit Graph

406 Commits

Author SHA1 Message Date
Rob Howie ff7b81dc8f fix: add health check to detect and recover stale EphemeralRunner registrations
fix: add health check to detect and recover stale EphemeralRunner registrations

## Summary

Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement.

This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners.

A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet.

## Changes

- **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`.
- **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone.
- **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers.
- **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path.

## Issues Fixed

### Directly fixes

- **#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced.

- **#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating.

### Partially addresses

- **#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear.

- **#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop.

- **#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up.

- **#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path.

## Test Plan

- [x] Unit test: `GetRunner` returning 404 → runner marked as failed
- [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance)
- [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone)
- [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it
- [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced
2026-03-09 09:31:17 +00:00
gateixeira 1f615c1a33
feat: add default linux nodeSelector to listener pod (#4377)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2026-02-24 17:56:39 +01:00
Nikola Jokic 8b7fd9ffef
Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
Jiaren Wu d3ca9de3ca
Potential fix for code scanning alert no. 7: Use of a broken or weak cryptographic hashing algorithm on sensitive data (#4353)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-01-14 21:04:02 -08:00
Nikola Jokic bfe78ccd5d
Make restart pod more flexible to different failure scenarios (#4340) 2025-12-19 15:49:42 +01:00
Nikola Jokic 50038fba61
Re-schedule if the failed reason starts with `OutOf` (#4336) 2025-12-16 13:26:44 +01:00
Nikola Jokic 82d5579696
Restart the listener if pod is evicted (#4332) 2025-12-09 17:55:09 +01:00
Nikola Jokic 95d2107a6a
Code style changes on the controller (#4324) 2025-11-21 14:20:44 +01:00
Nikola Jokic 6d07b8d853
Add ephemeral runner finalizer during creation and check finalizer without requeue (#4320) 2025-11-20 23:06:27 +01:00
Nikola Jokic 9f9409a4c1
Handle resource quota on status forbidden by retrying (#4305)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-10 13:58:25 +01:00
Nikola Jokic 3d73636407
Use combination of namespace, GitHub URL, and runner group when hashing the listener name (#4299) 2025-11-10 13:58:16 +01:00
Nikola Jokic 4d22089978
Delete listener resources without requeueing on each call (#4289) 2025-10-29 13:01:00 +01:00
Nikola Jokic 634e42c916
Bump all dependencies (#4266) 2025-10-14 13:24:25 +02:00
Nikola Jokic 94a6f3cc3a
Ensure ephemeral runner is deleted from the service on exit != 0 (#4260) 2025-10-06 11:38:56 +02:00
Nikola Jokic 088e2a3a90
Remove ephemeral runner when exit code != 0 and is patched with the job (#4239) 2025-09-17 21:40:37 +02:00
Nikola Jokic ddc2918a48
Requeue if create pod returns already exists error (#4201) 2025-08-14 17:00:48 +02:00
Nikola Jokic c27541140a
Remove JIT config from ephemeral runner status field (#4191) 2025-08-04 12:35:04 +02:00
Ho Kim aa14f50e45
feat(runner): add ubuntu 24.04 support (#3598) 2025-07-01 18:34:52 +09:00
Nikola Jokic 9890c0592d
Explicitly requeue during backoff ephemeral runner (#4152) 2025-06-27 12:05:43 +02:00
Nikola Jokic 3b5693eecb
Remove check if runner exists after exit code 0 (#4142) 2025-06-27 11:11:39 +02:00
Nikola Jokic e46c929241
Azure Key Vault integration to resolve secrets (#4090) 2025-06-11 15:53:33 +02:00
Wim Fournier d4af75d82e
Delete config secret when listener pod gets deleted (#4033)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-06-11 15:53:04 +02:00
Nikola Jokic 1dbb88cb9e
Allow use of client id as an app id (#4057) 2025-05-16 16:21:06 +02:00
Nikola Jokic 43f1cd0dac
Refactor resource naming removing unnecessary calculations (#4076) 2025-05-15 10:56:34 +02:00
Nikola Jokic 389d842a30
Relax version requirements to allow patch version mismatch (#4080)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-14 21:38:16 +02:00
Nikola Jokic cae7efa2c6
Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners (#4059) 2025-05-14 15:38:50 +02:00
Ryosei Karaki f832b0b254
upgrade(golangci-lint): v2.1.2 (#4023)
Signed-off-by: karamaru-alpha <mrnk3078@gmail.com>
2025-04-17 16:14:31 +02:00
Nikola Jokic 5a960b5ebb
Create configurable metrics (#3975) 2025-03-24 15:27:42 +01:00
kahirokunn eaa3f2a3a0
chore: Added `OwnerReferences` during resource creation for `EphemeralRunnerSet`, `EphemeralRunner`, and `EphemeralRunnerPod` (#3575) 2025-03-19 15:03:20 +01:00
Nikola Jokic fb9b96bf75
Update all dependencies, conforming to the new controller-runtime API (#3949) 2025-03-11 15:52:52 +01:00
Nikola Jokic d8f1a61ab6
Clean up as much as possible in a single pass for the EphemeralRunner reconciler (#3941) 2025-03-10 11:03:45 +01:00
Nikola Jokic 2dab45c373
Wrap errors in controller helper methods and swap logic in cleanups (#3960) 2025-03-07 11:58:53 +01:00
Nikola Jokic 7a5996f467
Remove old githubrunnerscalesetlistener, remove warning and fix config bug (#3937) 2025-03-07 11:58:16 +01:00
Nikola Jokic e122615553
Use Ready from the pod conditions when setting it to the EphemeralRunner (#3891) 2025-03-05 10:21:06 +01:00
Nikola Jokic e12a892748
Rename log from target/actual to build/autoscalingRunnerSet version (#3957) 2025-03-04 17:01:34 +01:00
&es 7ccc177b84
Sanitize labels ending in hyphen, underscore, and dot (#3664) 2025-02-18 15:15:39 +01:00
Yusuke Kuoka 32ae917937
Make EphemeralRunnerReconciler create runner pods earlier (#3831)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2024-12-11 21:28:29 +01:00
Yusuke Kuoka 3998f6dee6
Make EphemeralRunnerController MaxConcurrentReconciles configurable (#3832)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2024-12-11 21:19:43 +01:00
Nikola Jokic b349ded2be
Increase test timeouts to avoid CI test failures (#3554) 2024-06-21 13:45:48 +02:00
Nikola Jokic 6276c84493
AutoscalingListener controller: Inspect listener container state instead of pod phase (#3548) 2024-06-21 13:40:08 +02:00
Nikola Jokic a62ca3d853
Exclude label prefix propagation (#3607) 2024-06-21 12:12:14 +02:00
Nikola Jokic 2cc793a835
Remove `.Named()` from the ephemeral runner controller (#3596) 2024-06-17 10:36:08 +02:00
Serge e45ac190e2
Customize work directory (#3477) 2024-06-04 15:16:45 +02:00
Katarzyna d0fb7206a4
Fix problem with ephemeralRunner Succeeded state before build executed (#3528) 2024-06-03 10:49:45 +02:00
Nikola Jokic 9afd93065f
Remove finalizers in one pass to speed up cleanups AutoscalingRunnerSet (#3536) 2024-05-27 09:21:31 +02:00
Nikola Jokic ab92e4edc3
Re-use the last desired patch on empty batch (#3453) 2024-05-17 15:12:16 +02:00
Nikola Jokic fa7a4f584e
Extract single place to set up indexers (#3454) 2024-05-17 14:42:46 +02:00
Nikola Jokic 9b51f25800
Rename imports in tests to remove double import and to improve readability (#3455) 2024-05-17 14:37:13 +02:00
Bryan Peterson 109750f816
propogate arbitrary labels from runnersets to all created resources (#3157) 2024-04-23 11:19:32 +02:00
Nikola Jokic 8075e5ee74
Refactor actions client error to include request id (#3430)
Co-authored-by: Francesco Renzi <rentziass@gmail.com>
2024-04-16 12:57:44 +02:00