In #1373 we made two mistakes:
- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.
The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.
Ref #1056
Ref #1373
* Remove legacy GitHub API cache of HRA.Status.CachedEntries
We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.
* Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well
* Fix integration test
Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs.
This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed.
Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this).
If anyone is able to help out verifying this PR, i would really appreciate it.
Resolves#990
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.
It defaults to `RunnerDeployment` for backward compatibility.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: myhra
spec:
scaleTargetRef:
kind: RunnerSet
name: myrunnerset
```
Ref #629
Ref #613
Ref #612
`PercentageRunnersBusy`, in combination with a secondary `TotalInProgressAndQueuedWorkflowRuns` metric, enables scale-from-zero for PercentageRunnersBusy.
Please see the new `Autoscaling to/from 0` section in the updated documentation about how it works.
Resolves#522
As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event.
Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted.
This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas.
Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas).
Follow-up for #282
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector.
Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-795200205
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
Enhances #57 to add support for organizational runners.
As GitHub Actions does not have an appropriate API for this, this is the spec you need:
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: myrunners
spec:
minReplicas: 1
maxReplicas: 3
autoscaling:
metrics:
- type: TotalNumberOfQueuedAndProgressingWorkflowRuns
repositories:
# Assumes that you have `github.com/myorg/myrepo1` repo
- myrepo1
- myrepo2
template:
spec:
organization: myorg
```
It works by collecting "in_progress" and "queued" workflow runs for the repositories `myrepo1` and `myrepo2` to autoscale the number of replicas, assuming you have this organizational runner deployment only for those two repositories.
For example, if `myrepo1` had 1 `in_progress` and 2 `queued` workflow runs, and `myrepo2` had 4 `in_progress` and 8 `queued` workflow runs at the time of running the reconcilation loop on the runner deployment, it will scale replicas to 1 + 2 + 4 + 8 = 15.
Perhaps we might be better add a kind of "ratio" setting so that you can configure the controller to create e.g. 2x runners than demanded. But that's another story.
Ref #10
* feat: Repository-wide RunnerDeployment Autoscaling
This adds `maxReplicas` and `minReplicas` to the RunnerDeploymentSpec. If and only if both fields are set, the controller computes and sets desired `replicas` automatically depending on the demand.
The number of demanded runner replicas is computed by `queued workflow runs + in_progress workflow runs` for the repository. The support for organizational runners is not included.
Ref https://github.com/summerwind/actions-runner-controller/issues/10