* fix(chart): allow to use basic auth when authSecret.create is false
When secret is created outside of the ARC chart using authSecret.create=false and basicAuth,
the controller fails as we're not including the basic password as environment variable as
the password value won't be inside the helm values.
This PR includes both environment variables for consistent regardless if
those are set or not similar as the rest of the other auth options (e.g
app_id, private key, etc)
* chart: Add back the conditional block for .Values.authSecret.github_basicauth_username
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.
In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet.
RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects.
This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet.
Follow-up for #1127, #1167, and 1178
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
I can't find any requests made by user agent `actions-runner-controller` in GitHub.com's telemetry in the past 7 days.
Turns out we only set user agent `actions-runner-controller` if we are configured to use BasicAuth which is not the case for most customers I think.
I update the code a little bit to make sure it always set `actions-runner-controller` as UserAgent for the GitHub HttpClient in ARC.
A further step would be somehow baking the ARC release version into the UserAgent as well.
Enhances ARC(both the controller-manager and github-webhook-server) to cache any GitHub API responses with HTTP GET and an appropriate Cache-Control header.
Ref #920
## Cache Implementation
`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:
https://github.com/google/go-github#conditional-requests
`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:
https://github.com/gregjones/httpcache#cache-backends
We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.
`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.
## Ephemeral runner pods are no longer recreated
The addition of the cache layer resulted in a slow down of a scale-down process and a trade-off between making the runner pod termination process fragile to various race conditions(shorter grace period before runner deletion) or delaying runner pod deletion depending on how long the grace period is(longer grace period). A grace period needs to be at least longer than 60s (which is the same as cache duration of ListRunners API) to not prematurely delete a runner pod that was just created.
But once I disabled automatic recreation of ephemeral runner pod, it turned out to be no more of an issue when it's being scaled via workflow_job webhook.
Ephemeral runner resources are still automatically added on demand by RunnerDeployment via RunnerReplicaSet(I've added `EffectiveTime` fields to our CRDs but that's an implementation detail so let's omit). A good side-effect of disabling ephemeral runner pod recreations is that ARC will no longer create redundant ephemeral runners when used with webhook-based autoscaler.
Basically, autoscaling still works as everyone might expect. It's just better than before overall.