This fixes the said issue that I found while I was running a series of E2E tests to test other features and pull requestes I have recently contributed.
Setting SecurityContext.Privileged bit to false, which is default,
prevents GKE from admitting Windows pods. Privileged bit is not
supported on Windows.
* 1770 update log format and add runID and Id to worflow logs
update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook
use logging package
remove unused modules
add setup name to setuplog
add flag to change log format
change flag name to enableProdLogConfig
move log opts to logger package
remove empty else and reset timeEncoder
update flag description
use get function to handle nil
rename flag and update logger function
Update main.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update controllers/horizontal_runner_autoscaler_webhook.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update logging/logger.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
copy log opt per each NewLogger call
revert to use autoscaler.log
update flag descript and remove unused imports
add logFormat to readme
rename setupLog to logger
make fmt
* Fix E2E along the way
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation
While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes.
You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
This introduces a linter to PRs to help with code reviews and code hygiene. I've also gone ahead and fixed (or ignored) the existing lints.
I've only setup the default linters right now. There are many more options that are documented at https://golangci-lint.run/.
The GitHub Action should add appropriate annotations to the lint job for the PR. Contributors can also lint locally using `make lint`.
* Add prometheus metrics for autoscaling
* Add desc for prometheus-metrics
* FIX: Typo
* Remove replicas_desired_before in metrics
* Remove Num prefix in metricws
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.
Ref #1412
* feat: allow to discover runner statuses
* fix manifests
* Bump runner version to 2.289.1 which includes the hooks support
* Add feedback from review
* Update reference to newRunnerPod
* Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific
* Fix additional TestNewRunnerPod test
* Cover additional feedback from review
* fix rbac manager role
* Add permissions to service account for container mode if not provided
* Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
The regression resulted in the webhook-based autoscaler be unable to find visible runner groups and therefore unable to scale up and down the target RunnerDeployment/RunnerSet at all when the webhook-based autoscaler was provided GitHub API credentials to enable the runner groups support. This fixes that.
The regression was introduced via #1578 which is not released yet. Users of existing ARC releases are therefore not affected.
* added containerMode=kubernetes env variables to the runner
* removed unused logging
* restored configs and charts
* restored makefile cert version and acceptance/run
* added workVolumeClaimTemplate in pod definition, including logic
* added claim template name based on the runner
* Apply suggestions from code review
update errors
* added concurrent cleanup before runner pod is deleted
* update manifests
* added retry after 30s if pod cleanup contains err
* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s
* style changes and added comments
* added izZero timestamp check for deleting runner-linked pods
* changed order of local variable to avoid copy if p is deleted
* removed docker from container mode k8s
* restored charts, config, makefile
* restored forked files back and not the ARC ones
* created PersistentVolume on containerMode k8s
* create pv only if storage class name is local-storage
* removed actions if storage class name is local-storage
* added service account validation if container mode kubernetes
* changed the coding style to match rest of the ARC
* added validation to the runnerdeployment webhook
* specified fields more precisely, added webhook validation to the replicaset as well
* remake manifests
* wraped delete runner-linked-pods in kube mode
* fixed empty line
* fixed import
* makefile changes for hooks
* added cleanup secrets
* create manifests
* docs
* update access modes
* update dockerfile
* nit changes
* fixed dockerfile
* rewrite allowing reuse for runners and runnersets
* deepcopy forgot to stage
* changed privileged
* make manifests
* partly moved to finalizer, still need to apply finalizer first
* finalizer added if env variable used in container mode exists
* bump runner version
* error message moved from Error to Info on cleanup pods/secrets
* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate
* Apply suggestions from code review
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* add hook version to cli, update to 0.1.2
* Apply suggestions from code review
* Update controllers/utils_test.go
* Update runner/Makefile
* Fix missing secret permission and the error handling
* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Make webhook-based scale operation asynchronous
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.
Ref #1321
* Fix typos
* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers
* Batch capacity reservation updates for efficient use of apiserver
* Fix potential never-ending HRA update conflicts in batch update
* Extract batchScaler out of webhook-based autoscaler for testability
* Fix log levels and batch scaler hang on start
* Correlate webhook event with scale trigger amount in logs
* Fix log message
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
* fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work"
While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error.
This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears.
Ref #1464
* e2e: Ensure work generic ephemeral volume to work as expected
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859#1207
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
This is intended to fix#1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".
Ref #1369
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.
Ref #1196
In #1373 we made two mistakes:
- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.
The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.
Ref #1056
Ref #1373
Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282