Commit Graph

188 Commits

Author SHA1 Message Date
Felipe Galindo Sanchez eff0c7364f
Merge branch 'master' into improve-logs 2022-02-28 09:25:30 -08:00
Felipe Galindo Sanchez 3abecd0f19 logging: improve logs for scaling 2022-02-23 08:29:13 -08:00
Yusuke Kuoka 5bc16f2619 Enhance HRA capacity reservation update log 2022-02-21 00:06:26 +00:00
Yusuke Kuoka b8e65aa857 Prevent unnecessary ephemeral runner recreations 2022-02-20 13:45:42 +00:00
Yusuke Kuoka a6f0e0008f Make unregistration timeout and retry delay configurable in integration tests 2022-02-20 12:05:34 +00:00
Yusuke Kuoka 79a31328a5 Stop recreating ephemeral runner pod
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384
2022-02-20 04:42:19 +00:00
Yusuke Kuoka 3c16188371 Introduce consistent timeouts for runner unregistration and runner pod deletion
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
2022-02-20 04:36:35 +00:00
Tingluo Huang 0b9bef2c08
Try to unconfig runner before deleting the pod to recreate (#1125)
There is a race condition between ARC and GitHub service about deleting runner pod.

- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.

To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.

TODO: I need to do some E2E tests to force the race condition to happen.

Ref #911
2022-02-19 21:22:31 +09:00
Yusuke Kuoka a5ed6bd263
Fix RunerSet managed runner pods to terminate more gracefully (#1126)
Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460
2022-02-19 21:19:37 +09:00
Yusuke Kuoka 921f547200
fix: Do recreate runner pod on registration token update (#1087)
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.

This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
2022-02-19 21:18:00 +09:00
Yusuke Kuoka fcf4778bac Fix regression that prevented default organizational runner group from being scale target
Fixes #1131
2022-02-19 14:43:41 +09:00
Yusuke Kuoka e22d981d58
githubwebhookserver: Tweak log levels of various messages (#1123)
Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.
2022-02-17 09:15:26 +09:00
Felipe Galindo Sanchez d0d316252e
Option to consider runner group visibility on scale based on webhook (#1062)
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.

Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012

If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories

`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:

1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
  a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.  
  b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-02-16 19:08:56 +09:00
Daniel 8a73560dbc
if a Volume is defined by the operator don't add another "work" volume. (#1015)
This allows providing a different `work` Volume.

This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.

This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-2
spec:
  template:
    spec:
      dockerdContainerResources: {}
      env:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      # this is to mount the docker in docker onto NVME disk
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: scratch
        subPathExpr: $(POD_NAME)-docker
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      volumeMounts:
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      volumes:
      - hostPath:
          path: /mnt/disks/ssd0
        name: scratch
      - hostPath:
          path: /mnt/disks/ssd0
        name: work
      nodeSelector:
       cloud.google.com/gke-nodepool: runner-16-with-nvme
      ephemeral: false
      image: ""
      imagePullPolicy: Always
      labels:
        - runner-2
        - self-hosted
      organization: yourorganization

```
2022-01-07 10:01:40 +09:00
Yusuke Kuoka 01301d3ce8
Stop creating registration-only runners on scale-to-zero (#1028)
Resolves #859
2022-01-07 09:56:21 +09:00
Hyeonmin Park 1a6e5719c3
test: Add tests with self-hosted label for #953 (#1030) 2022-01-07 08:50:26 +09:00
Callum Tait ad48851dc9
feat: expose if docker is enabled and wait for docker to be ready (#962)
Resolves #897
Resolves #915
2021-12-29 10:23:35 +09:00
Lars Haugan c5950d75fa
fix: pagination for ListWorkflowJobs in autoscaler (#990) (#992)
Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs.

This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed.

Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this).
If anyone is able to help out verifying this PR, i would really appreciate it.

Resolves #990
2021-12-24 09:12:36 +09:00
Felipe Galindo Sanchez 608c56936e
Remove duplicate self-hosted condition (#1016)
Duplicate condition caused after merge of #953 and #1012
2021-12-21 09:08:21 +09:00
Felipe Galindo Sanchez 4ebec38208
Support runner groups with selected visibility in webhooks autoscaler (#1012)
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).

This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.

This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
2021-12-19 18:29:44 +09:00
clement-loiselet-talend 0c34196d87
fix(#951): add exception for self-hosted label in webhook search (#953)
The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling.

This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture). 

The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels.

code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]`

Fixes #951
2021-12-19 10:55:23 +09:00
renovate[bot] c64000e11c
fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 (#740)
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0

* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade

* Regenerate manifests with the latest K8s dependencies

Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-17 09:06:55 +09:00
Felipe Galindo Sanchez 9bb21aef1f
Add support for default image pull secret name (#921)
Resolves #896

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-15 09:29:31 +09:00
Pavel Smalenski 91102c8088
Add dockerEnv variable for RunnerDeployment (#912)
Resolves #878

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-14 17:13:24 +09:00
Felipe Galindo Sanchez f0fccc020b
refactor: split Reconciler from Reconcile in a few methods (#926)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-12 14:22:55 +09:00
Patrick Ellis ea2dbc2807
Update go-github from v37 -> v39 (#925) 2021-12-11 21:43:40 +09:00
Yusuke Kuoka 898ad3c355
Work-around for offline+busy runners (#993)
Ref #911
2021-12-09 09:31:06 +09:00
Max N. Boyarov 88b8871830
Reduce number of http superfluous messages (#894)
write to http.ResponseWriter create HTTP OK response, so set *ok* to disable error code in defered function
2021-11-09 09:07:07 +09:00
Yusuke Kuoka 2191617eb5
Remove unnecessary scale-target-not-found error on in_progress workflow_job event (#927)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/877#issuecomment-955614456
2021-11-09 09:05:50 +09:00
Yusuke Kuoka b305e38b17
Add webhook-based autoscale for Enterprise runners (#906)
Fixes #892
2021-11-09 09:04:19 +09:00
apr-1985 0d3de9ee2a
chore: correct logging typo (#904) 2021-10-22 09:03:23 +09:00
Maxim Pogozhiy fce7d6d2a7
Add topologySpreadConstraints (#814) 2021-10-17 21:49:44 +01:00
Callum Tait 5805e39e1f
Revert "feat: adding workflow_dispatch webhook event" (#879)
This reverts commit d36d47fe66.
2021-10-09 18:36:02 +01:00
Callum d36d47fe66 feat: adding workflow_dispatch webhook event 2021-10-09 10:07:07 +01:00
Aidan fccf29970b
Fix bug related to label matching. (#852)
* Fix bug related to label matching.

Add start of test framework for Workflow Job Events

Signed-off-by: Aidan Jensen <aidan@artificial.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-09-30 11:02:59 +09:00
Alex Kulikovskikh ea06001819
fix: scaling issue based on `workflow_job` event (#850)
This PR fix scaling issue based on `workflow_job` event discussed in #819
2021-09-30 10:36:59 +09:00
Rob Bos 3f331e9a39
Fixing capitalization and a typo (#838)
* Fixing capitalization and a typo

* typo

* Typo

* Update controllers/autoscaling.go

* Update controllers/autoscaling.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-09-26 14:34:55 +09:00
Tristan Keen 5e3f89bdc5
Correct test to append docker container (#837)
Fixes #835
2021-09-24 09:18:20 +09:00
Tristan Keen 1eb135cace Correct default image logic 2021-09-14 17:00:57 +09:00
Tarasovych 7008b0c257
feat: Organization RunnerDeployment with webhook-based autoscaling only for certain repositories (#766)
Resolves #765

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-08-31 09:46:36 +09:00
Yusuke Kuoka fabead8c8e
feat: Workflow job based ephemeral runner scaling (#721)
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.

**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.

**Ephemeral runners**:

The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).

`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.

To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.

Please read the section `Ephemeral Runners` in the updated version of our README for more information.

Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!

**`workflow_job` events**:

`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.

Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.

In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.

Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.

Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
2021-08-11 09:52:04 +09:00
Rolf Ahrenberg 14564c7b8e
Allow disabling /runner emptydir mounts and setting storage volume (#674)
* Allow disabling /runner emptydir mounts

* Support defining storage medium for emptydirs

* Fix typos
2021-07-15 06:29:58 +09:00
Sebastien Le Digabel 7f2795b5d6
Adding a default docker registry mirror (#689)
* Adding a default docker registry mirror

This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.

The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
2021-07-15 06:20:08 +09:00
Yusuke Kuoka f858e2e432
Add POC of GitHub Webhook Delivery Forwarder (#682)
* Add POC of GitHub Webhook Delivery Forwarder

* multi-forwarder and ctrl-c existing and fix for non-woring http post

* Rename source files

* Extract signal handling into a dedicated source file

* Faster ctrl-c handling

* Enable automatic creation of repo hook on startup

* Add support for forwarding org hook deliveries

* Set hook secret on hook creation via envvar (HOOK_SECRET)

* Fix org hook support

* Fix HOOK_SECRET for consistency

* Refactor to prepare for custom log position provider

* Refactor to extract inmemory log position provider

* Add configmap-based log position provider

* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder

* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg

* Refactor to extract logger initialization

* Add hookdeliveryforwarder README and bump go-github to unreleased ver
2021-07-14 10:18:55 +09:00
Yusuke Kuoka 6f130c2db5
Fix dockerdWithinRunnerContainer for Runner(Deployment) not working in the main branch (#696)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/674#issuecomment-878600993
2021-07-13 18:14:15 +09:00
Yusuke Kuoka f19e7ea8a8
chore: Upgrade go-github to v36 (#681) 2021-07-04 17:43:52 +09:00
Yusuke Kuoka acb906164b
RunnerSet: Automatic-recovery from registration timeout and deregistration on pod termination (#652)
Ref #629
Ref #613
Ref #612
2021-06-24 20:39:37 +09:00
Yusuke Kuoka 98da4c2adb
Add HRA support for RunnerSet (#647)
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.

It defaults to `RunnerDeployment` for backward compatibility.

```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: myhra
spec:
  scaleTargetRef:
    kind: RunnerSet
    name: myrunnerset
```

Ref #629
Ref #613
Ref #612
2021-06-23 20:25:03 +09:00
Yusuke Kuoka 8b90b0f0e3
Clean up import list (#645)
Resolves #644
2021-06-22 17:55:06 +09:00
Jonathan Gonzalez V a277489003
Added support to enable and disable enableServiceLinks. (#628)
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.

This option it's set by default to `true` in any Kubernetes deployment.

Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>
2021-06-22 17:27:26 +09:00