We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.
I have also updated the hra controller to use Patch when updating RunnerDeployment, too.
Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
* add replicaCount
* Add authSecret.existingSecret
* set image.tag null by default
* implement ingress for githubwebhook server
* fix deprecated and secretName template
* backward compat .authSecret.enabled
* existingSecret for github webhook secret
* use secretName template
* set default secret names
* do not use app version based image tag
* create and name variable for secrets
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
The controller had been writing confusing messages like the below on missing scale target:
```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```
This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
* if a new runner pod was just scheduled to start up right before a
registration expired, it will not get a new registration token and go in
an infinite update loop (until #341) kicks in
* if registzration tokens got updated a little bit before they actually
expired, just starting up pods will way more likely get a working token
We occasionally see logs like the below:
```
2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
/home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
```
which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.
Apparently I have mistakenly removed `push` option from the workflow in #323 which resulted in new runner build #323 not being pushed. This fixes that.
* if a runner pod starts up with an invalid token, it will go in an
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding
runner objects exists in GitHub and the pod will get removed after
registration timeout
* if the GitHub runner object already existed before - e.g. because a
finalizer was not properly run as part of a partial Kubernetes crash,
the runner will always stay in a running mode, even updating the
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart,
only do additional restart checks if no restart was already decided,
making code a bit cleaner and saving GitHub API calls after each job
completion
Changes:
1. Fix length of github-webhook-server port name
2. Add a cluster role binding for github-webhook-server
3. Remove --enable-leader-election from github-webhook-server
* Update runner to 2.277.1
* Update build-and-release-runners.yml
* integration test condition
Don't run integration tests when only updating the runner image
* fixup! integration test condition
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* so far, only push events would trigger the DockerHub login step
* hence, attempts to release would fail because of a permission problem (tested locally)
* adding OR condition to also login in case a release got published
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.
This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.