Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
There is a race condition between ARC and GitHub service about deleting runner pod.
- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.
To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.
TODO: I need to do some E2E tests to force the race condition to happen.
Ref #911
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.
This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.
Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012
If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories
`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:
1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.
b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This allows providing a different `work` Volume.
This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.
This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-2
spec:
template:
spec:
dockerdContainerResources: {}
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# this is to mount the docker in docker onto NVME disk
dockerVolumeMounts:
- mountPath: /var/lib/docker
name: scratch
subPathExpr: $(POD_NAME)-docker
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
volumeMounts:
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
dockerEnv:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- hostPath:
path: /mnt/disks/ssd0
name: scratch
- hostPath:
path: /mnt/disks/ssd0
name: work
nodeSelector:
cloud.google.com/gke-nodepool: runner-16-with-nvme
ephemeral: false
image: ""
imagePullPolicy: Always
labels:
- runner-2
- self-hosted
organization: yourorganization
```
Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs.
This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed.
Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this).
If anyone is able to help out verifying this PR, i would really appreciate it.
Resolves#990
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).
This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.
This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling.
This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture).
The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels.
code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]`
Fixes#951
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0
* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade
* Regenerate manifests with the latest K8s dependencies
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix bug related to label matching.
Add start of test framework for Workflow Job Events
Signed-off-by: Aidan Jensen <aidan@artificial.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.
**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.
**Ephemeral runners**:
The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).
`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.
To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.
Please read the section `Ephemeral Runners` in the updated version of our README for more information.
Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!
**`workflow_job` events**:
`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.
Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.
In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.
Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.
Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
* Adding a default docker registry mirror
This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.
The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
* Add POC of GitHub Webhook Delivery Forwarder
* multi-forwarder and ctrl-c existing and fix for non-woring http post
* Rename source files
* Extract signal handling into a dedicated source file
* Faster ctrl-c handling
* Enable automatic creation of repo hook on startup
* Add support for forwarding org hook deliveries
* Set hook secret on hook creation via envvar (HOOK_SECRET)
* Fix org hook support
* Fix HOOK_SECRET for consistency
* Refactor to prepare for custom log position provider
* Refactor to extract inmemory log position provider
* Add configmap-based log position provider
* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder
* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg
* Refactor to extract logger initialization
* Add hookdeliveryforwarder README and bump go-github to unreleased ver
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.
It defaults to `RunnerDeployment` for backward compatibility.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: myhra
spec:
scaleTargetRef:
kind: RunnerSet
name: myrunnerset
```
Ref #629
Ref #613
Ref #612
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.
This option it's set by default to `true` in any Kubernetes deployment.
Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>