Handle missing runner ID more gracefully (#1855)
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes. You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
This commit is contained in:
parent
6aaff4ecee
commit
7ff5b7da8c
|
|
@ -98,11 +98,27 @@ func ensureRunnerUnregistration(ctx context.Context, retryDelay time.Duration, l
|
||||||
// If it's already unregistered in the previous reconcilation loop,
|
// If it's already unregistered in the previous reconcilation loop,
|
||||||
// you can safely assume that it won't get registered again so it's safe to delete the runner pod.
|
// you can safely assume that it won't get registered again so it's safe to delete the runner pod.
|
||||||
log.Info("Runner pod is marked as already unregistered.")
|
log.Info("Runner pod is marked as already unregistered.")
|
||||||
} else if runnerID == nil {
|
} else if runnerID == nil && !runnerPodOrContainerIsStopped(pod) && !podConditionTransitionTimeAfter(pod, corev1.PodReady, registrationTimeout) {
|
||||||
log.Info(
|
log.Info(
|
||||||
"Unregistration started before runner ID is assigned. " +
|
"Unregistration started before runner obtains ID. Waiting for the regisration timeout to elapse, or the runner to obtain ID, or the runner pod to stop",
|
||||||
"Perhaps the runner pod was terminated by anyone other than ARC? Was it OOM killed? " +
|
"registrationTimeout", registrationTimeout,
|
||||||
|
)
|
||||||
|
return &ctrl.Result{RequeueAfter: retryDelay}, nil
|
||||||
|
} else if runnerID == nil && runnerPodOrContainerIsStopped(pod) {
|
||||||
|
log.Info(
|
||||||
|
"Unregistration started before runner ID is assigned and the runner stopped before obtaining ID within registration timeout. "+
|
||||||
|
"Perhaps the runner successfully ran the job and stopped normally before the runner ID becomes visible via GitHub API? "+
|
||||||
|
"Perhaps the runner pod was terminated by anyone other than ARC? Was it OOM killed? "+
|
||||||
"Marking unregistration as completed anyway because there's nothing ARC can do.",
|
"Marking unregistration as completed anyway because there's nothing ARC can do.",
|
||||||
|
"registrationTimeout", registrationTimeout,
|
||||||
|
)
|
||||||
|
} else if runnerID == nil && podConditionTransitionTimeAfter(pod, corev1.PodReady, registrationTimeout) {
|
||||||
|
log.Info(
|
||||||
|
"Unregistration started before runner ID is assigned and the runner was unable to obtain ID within registration timeout. "+
|
||||||
|
"Perhaps the runner has communication issue, or a firewall egress rule is dropping traffic to GitHub API, or GitHub API is unavailable? "+
|
||||||
|
"Marking unregistration as completed anyway because there's nothing ARC can do. "+
|
||||||
|
"This may result in in cancelling the job depending on your terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT settings.",
|
||||||
|
"registrationTimeout", registrationTimeout,
|
||||||
)
|
)
|
||||||
} else if pod != nil && runnerPodOrContainerIsStopped(pod) {
|
} else if pod != nil && runnerPodOrContainerIsStopped(pod) {
|
||||||
// If it's an ephemeral runner with the actions/runner container exited with 0,
|
// If it's an ephemeral runner with the actions/runner container exited with 0,
|
||||||
|
|
@ -351,21 +367,22 @@ func setRunnerEnv(pod *corev1.Pod, key, value string) {
|
||||||
// Case 1. (true, nil) when it has successfully unregistered the runner.
|
// Case 1. (true, nil) when it has successfully unregistered the runner.
|
||||||
// Case 2. (false, nil) when (2-1.) the runner has been already unregistered OR (2-2.) the runner will never be created OR (2-3.) the runner is not created yet and it is about to be registered(hence we couldn't see it's existence from GitHub Actions API yet)
|
// Case 2. (false, nil) when (2-1.) the runner has been already unregistered OR (2-2.) the runner will never be created OR (2-3.) the runner is not created yet and it is about to be registered(hence we couldn't see it's existence from GitHub Actions API yet)
|
||||||
// Case 3. (false, err) when it postponed unregistration due to the runner being busy, or it tried to unregister the runner but failed due to
|
// Case 3. (false, err) when it postponed unregistration due to the runner being busy, or it tried to unregister the runner but failed due to
|
||||||
// an error returned by GitHub API.
|
//
|
||||||
|
// an error returned by GitHub API.
|
||||||
//
|
//
|
||||||
// When the returned values is "Case 2. (false, nil)", the caller must handle the three possible sub-cases appropriately.
|
// When the returned values is "Case 2. (false, nil)", the caller must handle the three possible sub-cases appropriately.
|
||||||
// In other words, all those three sub-cases cannot be distinguished by this function alone.
|
// In other words, all those three sub-cases cannot be distinguished by this function alone.
|
||||||
//
|
//
|
||||||
// - Case "2-1." can happen when e.g. ARC has successfully unregistered in a previous reconcilation loop or it was an ephemeral runner that finished it's job run(an ephemeral runner is designed to stop after a job run).
|
// - Case "2-1." can happen when e.g. ARC has successfully unregistered in a previous reconcilation loop or it was an ephemeral runner that finished it's job run(an ephemeral runner is designed to stop after a job run).
|
||||||
// You'd need to maintain the runner state(i.e. if it's already unregistered or not) somewhere,
|
// You'd need to maintain the runner state(i.e. if it's already unregistered or not) somewhere,
|
||||||
// so that you can either not call this function at all if the runner state says it's already unregistered, or determine that it's case "2-1." when you got (false, nil).
|
// so that you can either not call this function at all if the runner state says it's already unregistered, or determine that it's case "2-1." when you got (false, nil).
|
||||||
//
|
//
|
||||||
// - Case "2-2." can happen when e.g. the runner registration token was somehow broken so that `config.sh` within the runner container was never meant to succeed.
|
// - Case "2-2." can happen when e.g. the runner registration token was somehow broken so that `config.sh` within the runner container was never meant to succeed.
|
||||||
// Waiting and retrying forever on this case is not a solution, because `config.sh` won't succeed with a wrong token hence the runner gets stuck in this state forever.
|
// Waiting and retrying forever on this case is not a solution, because `config.sh` won't succeed with a wrong token hence the runner gets stuck in this state forever.
|
||||||
// There isn't a perfect solution to this, but a practical workaround would be implement a "grace period" in the caller side.
|
// There isn't a perfect solution to this, but a practical workaround would be implement a "grace period" in the caller side.
|
||||||
//
|
//
|
||||||
// - Case "2-3." can happen when e.g. ARC recreated an ephemral runner pod in a previous reconcilation loop and then it was requested to delete the runner before the runner comes up.
|
// - Case "2-3." can happen when e.g. ARC recreated an ephemral runner pod in a previous reconcilation loop and then it was requested to delete the runner before the runner comes up.
|
||||||
// If handled inappropriately, this can cause a race condition betweeen a deletion of the runner pod and GitHub scheduling a workflow job onto the runner.
|
// If handled inappropriately, this can cause a race condition betweeen a deletion of the runner pod and GitHub scheduling a workflow job onto the runner.
|
||||||
//
|
//
|
||||||
// Once successfully detected case "2-1." or "2-2.", you can safely delete the runner pod because you know that the runner won't come back
|
// Once successfully detected case "2-1." or "2-2.", you can safely delete the runner pod because you know that the runner won't come back
|
||||||
// as long as you recreate the runner pod.
|
// as long as you recreate the runner pod.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue