* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.
Ref #1196
* runner: Remove the ability to use the deprecated `--once` flag
Ref #1196
* runner: Ability to opt-out of using --ephemeral
Although we are going to eventually remove the ability to use the legacy --once flag as proposed in #1196, there might be folks still using legacy GHES versions 3.2 or earlier.
This commit removes the existing feature flag to opt-in for --ephemeral, while adding another feature flag RUNNER_FEATURE_FLAG_ONCE to opt-in for --once so that folks stuck in legacy GHES versions
can still use ARC.
Since this change every user starts using --ephemeral by default. If they see any issues on legacy GHES instance, RUNNER_FEATURE_FLAG_ONCE=true can be set to opt-in to keep using --once, which gives one more ARC release until they upgrade their GHES instance.
But beware, we won't support legacy GHES instances forever as it's going to be a maintenance nightmare. Please upgrade!
Ref #1196
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
* Add env variable to configure `disablupdate` flag
* Write test for entrypoint disable update
* Rename flag, update docs for DISABLE_RUNNER_UPDATE
* chore: bump runner version in makefile
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
I have a dedicated GitHub organization and a private repository to run this E2E test. After a few fixes included in this change, it has successfully passed.
This was something that was missing in #707.
Adding a new test to make sure the ephemeral feature flag from upstream
is set up correctly by the script.
The unit tests are simulating a run for entrypoint. It creates some
dummy config.sh and runsvc.sh and makes sure the logic behind
entrypoint.sh is correct.
Unfortunately the entrypoint.sh contains some sections that are not
mockable so I had to put some logic in there too.
Testing includes for now:
- the normal scenario
- the normal non-ephemeral scenario
- the configuration failure scenario
Also tested the entrypoint.sh on a real runner, still works as expected.
Previously the E2E test suite covered only RunnerSet. This refactors the existing E2E test code to extract the common test structure into a `env` struct and its methods, and use it to write two very similar tests, one for RunnerSet and another for RunnerDeployment.
Enhances out existing E2E test suite to additionally support triggering two or more concurrent workflow jobs and verifying all the results, so that you can ensure the runners managed by the controller are able to handle jobs reliably when loaded.
This enhances the E2E test suite introduced in #658 to also include the following steps:
- Install GitHub Actions workflow
- Trigger a workflow run via a git commit
- Verify the workflow run result
In the workflow, we use `kubectl create cm --from-literal` to create a configmap that contains an unique test ID. In the last step we obtain the configmap from within the E2E test and check the test ID to match the expected one.
To install a GitHub Actions workflow, we clone a GitHub repository denoted by the TEST_REPO envvar, progmatically generate a few files with some Go code, run `git-add`, `git-commit`, and then `git-push` to actually push the files to the repository. A single commit containing an updated workflow definition and an updated file seems to run a workflow derived to the definition introduced in the commit, which was a bit surpirising and useful behaviour.
At this point, the E2E test fully covers all the steps for a GitHub token based installation. We need to add scenarios for more deployment options, like GitHub App, RunnerDeployment, HRA, and so on. But each of them would worth another pull request.
This is the initial version of our E2E test suite which is currently a subset of the acceptance test suite reimplemented in Go.
To run it, pass `-run ^TestE2E$` to `go test`, without `-short`, like `go test -timeout 600s -run ^TestE2E$ github.com/actions-runner-controller/actions-runner-controller/test/e2e -v`.
`make test` is modified to pass `-short` to `go test` by default to skip E2E tests.
The biggest benefit of rewriting the acceptance test in Go turned out to be the fact that you can easily rerun each step- a go-test "subtest"- individually from your IDE, for faster turnaround. Both VS Code and IntelliJ IDEA/GoLand are known to work.
In the near future, we will add more steps to the suite, like actually git-comminting some Actions workflow and pushing some commit to trigger a workflow run, and verify the workflow and job run results, and finally run it on our `test` workflow to fully automated E2E testing. But that s another story.