Commit Graph

91 Commits

Author SHA1 Message Date
Yusuke Kuoka 5f2b5327f7 integration: Reduce error logs to ease debugging 2022-03-03 18:47:54 +09:00
Tingluo Huang 516695b275
Set UserAgent to 'actions-runner-controller' for all Http Client. (#1140)
I can't find any requests made by user agent `actions-runner-controller` in GitHub.com's telemetry in the past 7 days.

Turns out we only set user agent `actions-runner-controller` if we are configured to use BasicAuth which is not the case for most customers I think.

I update the code a little bit to make sure it always set `actions-runner-controller` as UserAgent for the GitHub HttpClient in ARC.

A further step would be somehow baking the ARC release version into the UserAgent as well.
2022-02-28 09:17:58 +09:00
Yusuke Kuoka 4b557dc54c Add logging transport to log HTTP requests in log level -3
The log level -3 is the minimum log level that is supported today, smaller than debug(-1) and -2(used to log some HRA related logs).

This commit adds a logging HTTP transport to log HTTP requests and responses to that log level.

It implements http.RoundTripper so that it can log each HTTP request with useful metadata like `from_cache` and `ratelimit_remaining`.
The former is set to `true` only when the logged request's response was served from ARC's in-memory cache.
The latter is set to X-RateLimit-Remaining response header value if and only if the response was served by GitHub, not by ARC's cache.
2022-02-19 12:22:53 +00:00
Yusuke Kuoka 4c53e3aa75 Add GitHub API cache to avoid rate limit
This will cache any GitHub API responses with correct Cache-Control header.

`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:

https://github.com/google/go-github#conditional-requests

`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:

https://github.com/gregjones/httpcache#cache-backends

We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.

`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.

Ref #920
2022-02-19 12:22:53 +00:00
Felipe Galindo Sanchez d0d316252e
Option to consider runner group visibility on scale based on webhook (#1062)
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.

Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012

If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories

`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:

1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
  a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.  
  b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-02-16 19:08:56 +09:00
Felipe Galindo Sanchez de1f48111a
feat: support routing GitHub API calls to custom proxy API (#1017)
GitHub currently has some limitations w.r.t permissions management on
runner groups as they all require org admin, however at our company
we're using runner groups to serve different internal teams (with
different permissions), thus we needed to deploy a custom proxy API with
our internal authentication to provide who has access to certain APIs
depending on the repository/runner group on a given org/enterprise

This change just allows to optionally send the GitHub API calls to an alternate custom
proxy URL instead of cloud github (github.com) or an enterprise URL with
basic authentication

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-23 09:24:10 +09:00
Felipe Galindo Sanchez 4ebec38208
Support runner groups with selected visibility in webhooks autoscaler (#1012)
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).

This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.

This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
2021-12-19 18:29:44 +09:00
Gabriele Mambrini 7261d927fb
fix: report busy status for offline workers (#1009)
ref #911 

Fix #993 cannot work because the runner busy status is not reported when offline
2021-12-15 08:57:13 +09:00
Patrick Ellis ea2dbc2807
Update go-github from v37 -> v39 (#925) 2021-12-11 21:43:40 +09:00
Yusuke Kuoka f858e2e432
Add POC of GitHub Webhook Delivery Forwarder (#682)
* Add POC of GitHub Webhook Delivery Forwarder

* multi-forwarder and ctrl-c existing and fix for non-woring http post

* Rename source files

* Extract signal handling into a dedicated source file

* Faster ctrl-c handling

* Enable automatic creation of repo hook on startup

* Add support for forwarding org hook deliveries

* Set hook secret on hook creation via envvar (HOOK_SECRET)

* Fix org hook support

* Fix HOOK_SECRET for consistency

* Refactor to prepare for custom log position provider

* Refactor to extract inmemory log position provider

* Add configmap-based log position provider

* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder

* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg

* Refactor to extract logger initialization

* Add hookdeliveryforwarder README and bump go-github to unreleased ver
2021-07-14 10:18:55 +09:00
Yusuke Kuoka f19e7ea8a8
chore: Upgrade go-github to v36 (#681) 2021-07-04 17:43:52 +09:00
Yusuke Kuoka 8b90b0f0e3
Clean up import list (#645)
Resolves #644
2021-06-22 17:55:06 +09:00
Yusuke Kuoka e7020c7c0f
Fix scale-from-zero to retain the reg-only runner until other pods come up (#523)
Fixes #516
2021-05-05 12:13:51 +09:00
Yusuke Kuoka 3f23501b8e
Reduce "No runner matching the specified labels was found" errors while runner replacement (#392)
We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered.

A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.
2021-03-16 10:52:30 +09:00
Yusuke Kuoka 81016154c0
GITHUB_APP_PRIVATE_KEY can now be the content of the key (#383)
Resolves #382
2021-03-10 09:37:15 +09:00
Yusuke Kuoka 9ae3551744
Remove unnecessary GitHub API calls (#363)
The controller had the 2 extra and redundant calls to List Workflow Runs API.

Ref #362
2021-03-02 10:55:30 +09:00
Yusuke Kuoka 9da123ae5e
Fix integration test flakiness (#351)
Ref https://github.com/summerwind/actions-runner-controller/pull/345#issuecomment-785015406
2021-02-25 09:30:32 +09:00
Johannes Nicolai 4d4137aa28
Avoid zombie runners that missed token expiration by a bit (#345)
* if a new runner pod was just scheduled to start up right before a 
registration expired, it will not get a new registration token and go in 
an infinite update loop (until #341) kicks in
* if registzration tokens got updated a little bit before they actually 
expired, just starting up pods will way more likely get a working token
2021-02-25 09:07:49 +09:00
Johannes Nicolai 2d7fbbfb68
Handle offline runners gracefully (#341)
* if a runner pod starts up with an invalid token, it will go in an 
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding 
runner objects exists in GitHub and the pod will get removed after 
registration timeout
* if the GitHub runner object already existed before - e.g. because a 
finalizer was not properly run as part of a partial Kubernetes crash, 
the runner will always stay in a running mode, even updating the 
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner 
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart, 
only do additional restart checks if no restart was already decided, 
making code a bit cleaner and saving GitHub API calls after each job 
completion
2021-02-22 10:08:04 +09:00
Yusuke Kuoka eb2eaf8130
Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining `completed` jobs (#316)
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.

This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
2021-02-16 18:55:55 +09:00
Yusuke Kuoka 35d047db01
Fix enterprise runners misusing cached token (#314)
Follow-up for #290
2021-02-16 12:56:52 +09:00
Hidetake Iwata 4f3f2fb60d
Add metrics for GitHub API rate limit (#312) 2021-02-16 09:58:09 +09:00
Johannes Nicolai bc8bc70f69
Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00
Yusuke Kuoka bbb036e732
feat: Prevent blocking on transient runner registration failure (#297)
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).

It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).

Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.

Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.

Ref #288
2021-02-09 10:17:52 +09:00
Yusuke Kuoka 9301409aec
fix: Paginate ListRepositoryWorkflowRuns (#295)
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
2021-02-09 10:13:53 +09:00
Yusuke Kuoka ab1c39de57
feat: HorizontalRunnerAutoscaler Webhook server (#282)
* feat: HorizontalRunnerAutoscaler Webhook server

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

Changes:

* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
2021-02-07 17:37:27 +09:00
Jesse Haka 28e80a2d28
Add support for enterprise runners (#290)
* Add support for enterprise runners

* update docs
2021-02-05 09:31:06 +09:00
Reinier Timmer 8d6f77e07c
Remove beta GitHub client implementations (#228) 2020-12-10 09:08:51 +09:00
Erik Nobel a2b335ad6a
Github pkg: Bump github package to version 33 (#222) 2020-12-06 10:01:47 +09:00
ZacharyBenamram df99f394b4
Remove 10 minute buffer to token expiration (#214)
Co-authored-by: Zachary Benamram <zacharybenamram@blend.com>
2020-11-30 09:03:27 +09:00
Yusuke Kuoka 4eb45d3c7f Fix build error 2020-11-10 17:09:16 +09:00
Juho Saarinen 1c30bdf35b
Add GHE URL to transport (#152)
Fixes #149
2020-11-10 17:05:09 +09:00
Yusuke Kuoka 3f335ca628
Fix panic on startup when misconfigured (#154)
Fixes #153
2020-11-10 17:03:33 +09:00
Juho Saarinen 40c5050978
Added support for other than public GitHub URL (#146)
Refactoring a bit
2020-10-28 22:15:53 +09:00
Dominic LoBue a63860029a
Prefer autoscaling based on jobs rather than workflows if available (#114)
Adds the ability to autoscale on jobs in addition to workflows. We fall back to using workflow metrics if job details are not present.

Resolves #89
2020-10-08 09:00:44 +09:00
Helder Moreira 7a2fa7fbce
runner-controller: do not delete runner if it is busy (#103)
Currently, after refreshing the token, the controller re-creates the runner with the new token. This results in jobs being interrupted. This PR makes sure the pod is not restarted if it is busy.

Closes #74
2020-10-05 09:06:37 +09:00
Yusuke Kuoka 4733edc20d Add scaling-down scenario to integration test 2020-08-02 16:10:01 +09:00
KUOKA Yusuke 5bb2694349
feat: Repository-wide RunnerDeployment Autoscaling (#57)
* feat: Repository-wide RunnerDeployment Autoscaling

This adds `maxReplicas` and `minReplicas` to the RunnerDeploymentSpec. If and only if both fields are set, the controller computes and sets desired `replicas` automatically depending on the demand.

The number of demanded runner replicas is computed by `queued workflow runs + in_progress workflow runs` for the repository. The support for organizational runners is not included.

Ref https://github.com/summerwind/actions-runner-controller/issues/10
2020-06-27 17:26:46 +09:00
Reinier Timmer 9f57f52e36 organization and repository are now exclusive 2020-04-28 11:14:31 +02:00
Reinier Timmer fb35dd4131 support for organization runners 2020-04-28 11:14:31 +02:00
Moto Ishizawa 5f608058cd Add github package 2020-04-13 22:27:05 +09:00