Commit Graph

31 Commits

Author SHA1 Message Date
Felipe Galindo Sanchez d0d316252e
Option to consider runner group visibility on scale based on webhook (#1062)
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.

Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012

If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories

`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:

1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
  a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.  
  b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-02-16 19:08:56 +09:00
Hyeonmin Park 1a6e5719c3
test: Add tests with self-hosted label for #953 (#1030) 2022-01-07 08:50:26 +09:00
Felipe Galindo Sanchez 608c56936e
Remove duplicate self-hosted condition (#1016)
Duplicate condition caused after merge of #953 and #1012
2021-12-21 09:08:21 +09:00
Felipe Galindo Sanchez 4ebec38208
Support runner groups with selected visibility in webhooks autoscaler (#1012)
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).

This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.

This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
2021-12-19 18:29:44 +09:00
clement-loiselet-talend 0c34196d87
fix(#951): add exception for self-hosted label in webhook search (#953)
The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling.

This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture). 

The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels.

code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]`

Fixes #951
2021-12-19 10:55:23 +09:00
Patrick Ellis ea2dbc2807
Update go-github from v37 -> v39 (#925) 2021-12-11 21:43:40 +09:00
Max N. Boyarov 88b8871830
Reduce number of http superfluous messages (#894)
write to http.ResponseWriter create HTTP OK response, so set *ok* to disable error code in defered function
2021-11-09 09:07:07 +09:00
Yusuke Kuoka 2191617eb5
Remove unnecessary scale-target-not-found error on in_progress workflow_job event (#927)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/877#issuecomment-955614456
2021-11-09 09:05:50 +09:00
Yusuke Kuoka b305e38b17
Add webhook-based autoscale for Enterprise runners (#906)
Fixes #892
2021-11-09 09:04:19 +09:00
Callum Tait 5805e39e1f
Revert "feat: adding workflow_dispatch webhook event" (#879)
This reverts commit d36d47fe66.
2021-10-09 18:36:02 +01:00
Callum d36d47fe66 feat: adding workflow_dispatch webhook event 2021-10-09 10:07:07 +01:00
Aidan fccf29970b
Fix bug related to label matching. (#852)
* Fix bug related to label matching.

Add start of test framework for Workflow Job Events

Signed-off-by: Aidan Jensen <aidan@artificial.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-09-30 11:02:59 +09:00
Alex Kulikovskikh ea06001819
fix: scaling issue based on `workflow_job` event (#850)
This PR fix scaling issue based on `workflow_job` event discussed in #819
2021-09-30 10:36:59 +09:00
Yusuke Kuoka fabead8c8e
feat: Workflow job based ephemeral runner scaling (#721)
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.

**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.

**Ephemeral runners**:

The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).

`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.

To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.

Please read the section `Ephemeral Runners` in the updated version of our README for more information.

Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!

**`workflow_job` events**:

`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.

Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.

In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.

Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.

Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
2021-08-11 09:52:04 +09:00
Yusuke Kuoka f858e2e432
Add POC of GitHub Webhook Delivery Forwarder (#682)
* Add POC of GitHub Webhook Delivery Forwarder

* multi-forwarder and ctrl-c existing and fix for non-woring http post

* Rename source files

* Extract signal handling into a dedicated source file

* Faster ctrl-c handling

* Enable automatic creation of repo hook on startup

* Add support for forwarding org hook deliveries

* Set hook secret on hook creation via envvar (HOOK_SECRET)

* Fix org hook support

* Fix HOOK_SECRET for consistency

* Refactor to prepare for custom log position provider

* Refactor to extract inmemory log position provider

* Add configmap-based log position provider

* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder

* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg

* Refactor to extract logger initialization

* Add hookdeliveryforwarder README and bump go-github to unreleased ver
2021-07-14 10:18:55 +09:00
Yusuke Kuoka f19e7ea8a8
chore: Upgrade go-github to v36 (#681) 2021-07-04 17:43:52 +09:00
Yusuke Kuoka 8b90b0f0e3
Clean up import list (#645)
Resolves #644
2021-06-22 17:55:06 +09:00
Yusuke Kuoka 9e4dbf497c
feat: RunnerSet backed by StatefulSet (#629)
* feat: RunnerSet backed by StatefulSet

Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.

Resolves #613
Ref #612

* Upgrade controller-runtime to 0.9.0

* Bump Go to 1.16.x following controller-runtime 0.9.0

* Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup

* Fix startup failure due to missing LeaderElectionID

* Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered

* Allow force-updating statefulset

* Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false

* Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec

* Enable running acceptance tests against arbitrary kind cluster

* RunnerSet supports non-ephemeral runners only today

* fix: docker-build from root Makefile on intel mac

* fix: arch check fixes for mac and ARM

* ci: aligning test data format and patching checks

* fix: removing namespace in test data

* chore: adding more ignores

* chore: removing leading space in shebang

* Re-add metrics to org hra testdata

* Bump cert-manager to v1.1.1 and fix deploy.sh

Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
2021-06-22 17:10:09 +09:00
Yusuke Kuoka 25f5817a5e Improve debug log in webhook-based autoscaling
Adds some helpful debug log messages I have used while verifying #534
2021-05-11 15:49:03 +09:00
Yusuke Kuoka f5c639ae28
Make webhook-based autoscaler github event logs more operator-friendly (#384)
Adds fields like `pullRequest.base.ref` and `checkRun.status` that are useful for verifying the autoscaling behaviour without browsing GitHub.
Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-794175312
2021-03-10 09:40:44 +09:00
Yusuke Kuoka 1b8a656051 Use --watch-namespace flag to restrict the namespace to watch
Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793172995
2021-03-09 09:46:21 +09:00
Rob Whitby 1753fa3530
handle GET requests in webhook hra (#378) 2021-03-09 08:46:27 +09:00
Yusuke Kuoka 584590e97c
Use patch instead of update to alleviate HRA conflict on webhook (#358)
We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.

I have also updated the hra controller to use Patch when updating RunnerDeployment, too.

Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
2021-02-26 10:17:09 +09:00
Yusuke Kuoka e9eef04993
Fix old HRA capacity reservations not cleaned up (#354)
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
2021-02-25 11:08:00 +09:00
Yusuke Kuoka 9890a90e69
Improve webhook-based autoscaler log (#352)
The controller had been writing confusing messages like the below on missing scale target:

```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```

This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
2021-02-25 10:07:41 +09:00
Yusuke Kuoka 991535e567
Fix panic on webhook for user-owned repository (#344)
* Fix panic on webhook for user-owned repository

Follow-up for #282 and #334
2021-02-23 08:05:25 +09:00
Hidetake Iwata b0e74bebab Fix index key to find HRA in GitHub webhook handler 2021-02-20 21:25:23 +09:00
Yusuke Kuoka ebc3970b84 Add integration test for autoscaling on check_run webhook event 2021-02-19 10:33:04 +09:00
Hidetake Iwata 1ddcf6946a
Fix nil pointer error on received check_run event (#331)
* Reproduce nil pointer error on received check_run event

* Fix nil pointer error on received check_run event
2021-02-18 20:22:36 +09:00
Yusuke Kuoka 7d024a6c05
Fix "duplicate metrics collector registration attempted" errors at startup (#317)
I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.
2021-02-16 18:51:33 +09:00
Yusuke Kuoka ab1c39de57
feat: HorizontalRunnerAutoscaler Webhook server (#282)
* feat: HorizontalRunnerAutoscaler Webhook server

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

Changes:

* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
2021-02-07 17:37:27 +09:00