fix: add health check to detect and recover stale EphemeralRunner registrations

fix: add health check to detect and recover stale EphemeralRunner registrations ## Summary Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement. This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners. A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet. ## Changes - **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`. - **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone. - **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers. - **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path. ## Issues Fixed ### Directly fixes - **#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced. - **#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating. ### Partially addresses - **#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear. - **#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop. - **#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up. - **#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path. ## Test Plan - [x] Unit test: `GetRunner` returning 404 → runner marked as failed - [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance) - [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone) - [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it - [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced
2026-03-09 09:31:17 +00:00 · 2026-03-09 09:31:17 +00:00 · ff7b81dc8f
parent 396ee88f5a
commit ff7b81dc8f
3 changed files with 273 additions and 0 deletions
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@ -356,6 +356,19 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
 			log.Info("Failed to update ephemeral runner status. Requeue to not miss this event")
 			return ctrl.Result{}, err
 		}
+
+		// Check if runner's GitHub registration is still valid
+		healthy, err := r.checkRunnerRegistration(ctx, ephemeralRunner, log)
+		if err != nil {
+			return ctrl.Result{}, err
+		}
+		if !healthy {
+			if err := r.markAsFailed(ctx, ephemeralRunner, "Runner registration no longer exists on GitHub", "RegistrationInvalidated", log); err != nil {
+				return ctrl.Result{}, err
+			}
+			return ctrl.Result{}, nil
+		}
+
 		return ctrl.Result{}, nil

 	case cs.State.Terminated.ExitCode != 0: // failed
@ -804,6 +817,43 @@ func (r *EphemeralRunnerReconciler) updateRunStatusFromPod(ctx context.Context,
 	return nil
 }

+// checkRunnerRegistration verifies the runner's GitHub-side registration is still valid.
+// Returns true if the runner is healthy or we can't determine status (API errors).
+// Returns false if the registration is confirmed gone (404).
+func (r *EphemeralRunnerReconciler) checkRunnerRegistration(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) (bool, error) {
+	// Only check runners that have been registered (have a RunnerId)
+	if ephemeralRunner.Status.RunnerId == 0 {
+		return true, nil
+	}
+
+	actionsClient, err := r.GetActionsService(ctx, ephemeralRunner)
+	if err != nil {
+		log.Error(err, "Failed to get actions client for health check, skipping")
+		return true, nil
+	}
+
+	_, err = actionsClient.GetRunner(ctx, int64(ephemeralRunner.Status.RunnerId))
+	if err == nil {
+		return true, nil
+	}
+
+	var actionsErr *actions.ActionsError
+	if errors.As(err, &actionsErr) && actionsErr.StatusCode == 404 {
+		log.Info("Runner registration not found on GitHub, marking as failed",
+			"runnerId", ephemeralRunner.Status.RunnerId,
+			"runnerName", ephemeralRunner.Status.RunnerName,
+		)
+		return false, nil
+	}
+
+	// For non-404 errors (transient API issues), don't take action
+	log.Info("Health check API call failed, will retry on next reconciliation",
+		"runnerId", ephemeralRunner.Status.RunnerId,
+		"error", err.Error(),
+	)
+	return true, nil
+}
+
 func (r *EphemeralRunnerReconciler) deleteRunnerFromService(ctx context.Context, ephemeralRunner *v1alpha1.EphemeralRunner, log logr.Logger) error {
 	client, err := r.GetActionsService(ctx, ephemeralRunner)
 	if err != nil {
@ -813,6 +863,11 @@ func (r *EphemeralRunnerReconciler) deleteRunnerFromService(ctx context.Context,
 	log.Info("Removing runner from the service", "runnerId", ephemeralRunner.Status.RunnerId)
 	err = client.RemoveRunner(ctx, int64(ephemeralRunner.Status.RunnerId))
 	if err != nil {
+		var actionsErr *actions.ActionsError
+		if errors.As(err, &actionsErr) && actionsErr.StatusCode == 404 {
+			log.Info("Runner already removed from service", "runnerId", ephemeralRunner.Status.RunnerId)
+			return nil
+		}
 		return fmt.Errorf("failed to remove runner from the service: %w", err)
 	}

--- a/controllers/actions.github.com/ephemeralrunner_controller_test.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller_test.go
@ -1422,4 +1422,216 @@ var _ = Describe("EphemeralRunner", func() {
 			).Should(BeTrue(), "failed to contact server")
 		})
 	})
+
+	Describe("Stale runner health check", func() {
+		var ctx context.Context
+		var mgr ctrl.Manager
+		var autoscalingNS *corev1.Namespace
+		var configSecret *corev1.Secret
+		var controller *EphemeralRunnerReconciler
+
+		Context("when GitHub registration is invalidated (404)", func() {
+			var ephemeralRunner *v1alpha1.EphemeralRunner
+
+			BeforeEach(func() {
+				ctx = context.Background()
+				autoscalingNS, mgr = createNamespace(GinkgoT(), k8sClient)
+				configSecret = createDefaultSecret(GinkgoT(), k8sClient, autoscalingNS.Name)
+
+				// GetRunner returns 404 — simulates GitHub forgetting the registration
+				// RemoveRunner also returns 404 — the runner is already gone
+				fakeClient := fake.NewFakeClient(
+					fake.WithGetRunner(nil, &actions.ActionsError{
+						StatusCode: 404,
+						Err: &actions.ActionsExceptionError{
+							ExceptionName: "AgentNotFoundException",
+							Message:       "runner not found",
+						},
+					}),
+					fake.WithRemoveRunnerError(&actions.ActionsError{
+						StatusCode: 404,
+						Err: &actions.ActionsExceptionError{
+							ExceptionName: "AgentNotFoundException",
+							Message:       "runner not found",
+						},
+					}),
+				)
+
+				controller = &EphemeralRunnerReconciler{
+					Client: mgr.GetClient(),
+					Scheme: mgr.GetScheme(),
+					Log:    logf.Log,
+					ResourceBuilder: ResourceBuilder{
+						SecretResolver: &SecretResolver{
+							k8sClient:   mgr.GetClient(),
+							multiClient: fake.NewMultiClient(fake.WithDefaultClient(fakeClient, nil)),
+						},
+					},
+				}
+
+				err := controller.SetupWithManager(mgr)
+				Expect(err).To(BeNil(), "failed to setup controller")
+
+				ephemeralRunner = newExampleRunner("stale-runner", autoscalingNS.Name, configSecret.Name)
+				err = k8sClient.Create(ctx, ephemeralRunner)
+				Expect(err).To(BeNil(), "failed to create ephemeral runner")
+
+				startManagers(GinkgoT(), mgr)
+			})
+
+			It("should mark the runner as failed when GetRunner returns 404", func() {
+				// Wait for the controller to set up the runner (finalizers, secret, pod, status)
+				Eventually(
+					func() (int, error) {
+						updated := new(v1alpha1.EphemeralRunner)
+						err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, updated)
+						if err != nil {
+							return 0, err
+						}
+						return updated.Status.RunnerId, nil
+					},
+					ephemeralRunnerTimeout,
+					ephemeralRunnerInterval,
+				).ShouldNot(Equal(0), "runner should have a RunnerId set")
+
+				// Wait for the pod to exist, then simulate a running container
+				pod := new(corev1.Pod)
+				Eventually(
+					func() (bool, error) {
+						if err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, pod); err != nil {
+							return false, err
+						}
+						return true, nil
+					},
+					ephemeralRunnerTimeout,
+					ephemeralRunnerInterval,
+				).Should(BeEquivalentTo(true), "pod should exist")
+
+				// Set pod to running with a running container status
+				pod.Status.Phase = corev1.PodRunning
+				pod.Status.ContainerStatuses = []corev1.ContainerStatus{
+					{
+						Name: v1alpha1.EphemeralRunnerContainerName,
+						State: corev1.ContainerState{
+							Running: &corev1.ContainerStateRunning{
+								StartedAt: metav1.Now(),
+							},
+						},
+					},
+				}
+				err := k8sClient.Status().Update(ctx, pod)
+				Expect(err).To(BeNil(), "failed to update pod status to running")
+
+				// Now the health check should detect the stale registration and mark it failed
+				Eventually(
+					func() (corev1.PodPhase, error) {
+						updated := new(v1alpha1.EphemeralRunner)
+						err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, updated)
+						if err != nil {
+							return "", err
+						}
+						return updated.Status.Phase, nil
+					},
+					ephemeralRunnerTimeout,
+					ephemeralRunnerInterval,
+				).Should(Equal(corev1.PodFailed), "runner with invalidated registration should be marked failed")
+			})
+		})
+
+		Context("when GetRunner returns a transient error (500)", func() {
+			var ephemeralRunner *v1alpha1.EphemeralRunner
+
+			BeforeEach(func() {
+				ctx = context.Background()
+				autoscalingNS, mgr = createNamespace(GinkgoT(), k8sClient)
+				configSecret = createDefaultSecret(GinkgoT(), k8sClient, autoscalingNS.Name)
+
+				// GetRunner returns 500 — transient error, should NOT kill the runner
+				fakeClient := fake.NewFakeClient(
+					fake.WithGetRunner(nil, &actions.ActionsError{
+						StatusCode: 500,
+						Err:        fmt.Errorf("internal server error"),
+					}),
+				)
+
+				controller = &EphemeralRunnerReconciler{
+					Client: mgr.GetClient(),
+					Scheme: mgr.GetScheme(),
+					Log:    logf.Log,
+					ResourceBuilder: ResourceBuilder{
+						SecretResolver: &SecretResolver{
+							k8sClient:   mgr.GetClient(),
+							multiClient: fake.NewMultiClient(fake.WithDefaultClient(fakeClient, nil)),
+						},
+					},
+				}
+
+				err := controller.SetupWithManager(mgr)
+				Expect(err).To(BeNil(), "failed to setup controller")
+
+				ephemeralRunner = newExampleRunner("transient-error-runner", autoscalingNS.Name, configSecret.Name)
+				err = k8sClient.Create(ctx, ephemeralRunner)
+				Expect(err).To(BeNil(), "failed to create ephemeral runner")
+
+				startManagers(GinkgoT(), mgr)
+			})
+
+			It("should NOT mark the runner as failed on transient errors", func() {
+				// Wait for runner to get a RunnerId
+				Eventually(
+					func() (int, error) {
+						updated := new(v1alpha1.EphemeralRunner)
+						err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, updated)
+						if err != nil {
+							return 0, err
+						}
+						return updated.Status.RunnerId, nil
+					},
+					ephemeralRunnerTimeout,
+					ephemeralRunnerInterval,
+				).ShouldNot(Equal(0), "runner should have a RunnerId set")
+
+				// Wait for pod and set it to running (same as the 404 test)
+				pod := new(corev1.Pod)
+				Eventually(
+					func() (bool, error) {
+						if err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, pod); err != nil {
+							return false, err
+						}
+						return true, nil
+					},
+					ephemeralRunnerTimeout,
+					ephemeralRunnerInterval,
+				).Should(BeEquivalentTo(true), "pod should exist")
+
+				pod.Status.Phase = corev1.PodRunning
+				pod.Status.ContainerStatuses = []corev1.ContainerStatus{
+					{
+						Name: v1alpha1.EphemeralRunnerContainerName,
+						State: corev1.ContainerState{
+							Running: &corev1.ContainerStateRunning{
+								StartedAt: metav1.Now(),
+							},
+						},
+					},
+				}
+				err := k8sClient.Status().Update(ctx, pod)
+				Expect(err).To(BeNil(), "failed to update pod status to running")
+
+				// Give the controller time to process — it should NOT mark as failed
+				Consistently(
+					func() (corev1.PodPhase, error) {
+						updated := new(v1alpha1.EphemeralRunner)
+						err := k8sClient.Get(ctx, client.ObjectKey{Name: ephemeralRunner.Name, Namespace: ephemeralRunner.Namespace}, updated)
+						if err != nil {
+							return "", err
+						}
+						return updated.Status.Phase, nil
+					},
+					time.Second*3,
+					ephemeralRunnerInterval,
+				).ShouldNot(Equal(corev1.PodFailed), "runner should NOT be marked failed on transient errors")
+			})
+		})
+	})
 })
--- a/github/actions/fake/client.go
+++ b/github/actions/fake/client.go
@ -45,6 +45,12 @@ func WithUpdateRunnerScaleSet(scaleSet *actions.RunnerScaleSet, err error) Optio
 	}
 }

+func WithRemoveRunnerError(err error) Option {
+	return func(f *FakeClient) {
+		f.removeRunnerResult.err = err
+	}
+}
+
 var defaultRunnerScaleSet = &actions.RunnerScaleSet{
 	Id:                 1,
 	Name:               "testset",