* 1770 update log format and add runID and Id to worflow logs
update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook
use logging package
remove unused modules
add setup name to setuplog
add flag to change log format
change flag name to enableProdLogConfig
move log opts to logger package
remove empty else and reset timeEncoder
update flag description
use get function to handle nil
rename flag and update logger function
Update main.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update controllers/horizontal_runner_autoscaler_webhook.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update logging/logger.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
copy log opt per each NewLogger call
revert to use autoscaler.log
update flag descript and remove unused imports
add logFormat to readme
rename setupLog to logger
make fmt
* Fix E2E along the way
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation
While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes.
You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
This is intended to fix#1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".
Ref #1369
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.
This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.
The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.