We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.
I have also updated the hra controller to use Patch when updating RunnerDeployment, too.
Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.
This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* Adds RUNNER_GROUP argument to the runner registration
Adds the ability to register a runner to a predefined runner_group
Resolves#137
* Update README with runner group example
- Updates the README with instructions of how to add the runner to a
group
- Fix code fencing for shell and yaml blocks in the README
- Use consistent bullet points (dash not asterisk)
Currently, after refreshing the token, the controller re-creates the runner with the new token. This results in jobs being interrupted. This PR makes sure the pod is not restarted if it is busy.
Closes#74
This is a follow-up for #66.
The reconciler for the new HorizontalRunnerDeploymentAutoscaler had a terrible flaw that prevented the controller to fail launching due to an error like:
```
indexer conflict: map[field:.metadata.controller:{}]
```
This fixes that, while adding `integration_test.go` to verify its actually fixed and prevent regression in the future.