* Make webhook-based scale operation asynchronous
This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.
Ref #1321
* Fix typos
* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers
* Batch capacity reservation updates for efficient use of apiserver
* Fix potential never-ending HRA update conflicts in batch update
* Extract batchScaler out of webhook-based autoscaler for testability
* Fix log levels and batch scaler hang on start
* Correlate webhook event with scale trigger amount in logs
* Fix log message
This overhaul turns it into a shellcheck valid script with explicit error handling for all possible situations I could think of. This change takes https://github.com/actions-runner-controller/actions-runner-controller/pull/1409 into account and things can be merged in any order. There are a few important changes here to the logic:
- The wait logic for checking if docker comes up was fundamentally flawed because it checks for the PID. Docker will always come up and thus become visible in the process list, just to immediately die when it encounters an issue, after which supervisor starts it again. This means that our check so far is flaky due to the `sleep 1` it might encounter a PID, or it might not, and the existence of the PID does not mean anything. The `docker ps` check we have in the `entrypoint.sh` script does not suffer from this as it checks for a feature of docker and not a PID. I thus entirely removed the PID check, and instead I am handing things over to our `entrypoint.sh` script by setting the environment variables correctly.
- This change has an influence on the `docker0` interface MTU configuration, because the interface might or might not exist after we started docker. Hence, I changed this to a time boxed loop that tries for one minute to set up the interface's MTU. In case the command fails we log an error and continue with the run.
- I changed the entire MTU handling by validating its value before configuring it, logging an error and continuing without if it is set incorrectly. This ensures that we are not going to send our users on a bug hunt.
- The way we started supervisord did not make much sense to me. It sends itself into the background automatically, there is no need for us to do so with Bash.
The decision to not fail on errors but continue is a deliberate choice, because I believe that running a build is more important than having a perfectly configured system. However, this strategy might also hide issues for all users who are not properly checking their logs. It also makes testing harder. Hence, we could change all error conditions from graceful to panicking. We should then align the exit codes across `startup.sh` and `entrypoint.sh` to ensure that every possible error condition has its own unique error code for easy debugging.
* ci: align pipeline files and setups
* ci: more changes
* ci: various changes
* ci: fix setup-helm action ref
* ci: better pipeline name
* ci: more format aligning
* ci: more format aligning
* ci: better job name
* ci: supports multiple languages
* ci: better pipeline and job names
* ci: do a verb-noun thing for consistency
* ci: use 'arc' when talking holistically
* ci: add caching scope
* ci: put canary in a scope
* ci: fix syntax error
* ci: better pipeline and job names
* ci: better job name
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
* Fix example manifests for webhook based scaling
I tried running these on my k8s cluster and I got some easy to fix errors, so I am committing them here.
* Fix example manifests for webhook autoscaling with workflow_jobs
* Fix the explation on how to setup webhooks on your cluster
* Replace unclear comment with actual code examples
There was a comment instructing users to add minReplicas and
maxReplicas to all the HRA yamls, so I just removed it and added
these attributes to the yamls themselves for clarity.
* Make clear that using the ingress example is just a suggestion
* Apply some text improvements suggested by @mumoshu
* Update examples so the webhook server is exposed on a NodePort
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Remove an unnecessary field from one the examples
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Apply suggestion from @mumoshu
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Remove namespace fields from webhook autoscaler examples
This change was suggested by @mumoshu
* Apply final suggestion from @mumoshu
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
A small improvement to our E2E test suite which allows you to set `ARC_E2E_NO_CLEANUP=whatever` to let it prevent the kind cluster cleanup on successful test run, so that you can rerun it without waiting for the new kind cluster to come up.
* doc: Use RunnerSet to retain various cache
In relation to #1286 and as a follow-up for #1340
* docs: clarify client vs daemon
* docs: better wording
* Separate RunnerSet examples for docker iimage layer caching
* Revert changes on testdata as it is going to be added via #1471 instead
* Update README.md
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* fixup! Update README.md
* Remove the outdated RunnerSet limitation
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
* fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work"
While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error.
This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears.
Ref #1464
* e2e: Ensure work generic ephemeral volume to work as expected
As a part of #1298, I'm going to use Go fuzzing which is availabls since Go 1.18.
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
Renamed the runner dockerfiles so that we have proper syntax highlighting for them, as well as a consistent way to map from the image name to the dockerfile. Added a `.dockerignore` file to avoid uploading things to the daemon that we never use.
* chart: Add extraPaths to Ingress of GitHub Webhook Server
* Update charts/actions-runner-controller/templates/githubwebhook.ingress.yaml
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Prefix the toYaml expression to remove the extra newline before extra paths
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859#1207
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
Give up pinning deps with commit IDs because PRs were unreviewable due to missing changelog and it sends PRs for every commit to the master/main branch of the deps, which is undesired. We only need updates for tagged releases!