* Changed folder structure to allow multi group registration
* included actions.github.com directory for resources and controllers
* updated go module to actions/actions-runner-controller
* publish arc packages under actions-runner-controller
* Update charts/actions-runner-controller/docs/UPGRADING.md
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Add workflow job metrics to Github webhook server
* Fix handling of workflow_job.Conclusion
* Make the prometheus metrics exporter for the workflow jobs a dedicated application
* chart: Add support for deploying actions-metrics-server
* A few improvements to make it easy to cover in E2E
* chart: Add missing actionsmetrics.service.yaml
* chart: Do not modify actionsMetricsServer.replicaCount
* chart: Add documentation for actionsMetrics and actionsMetricsServer
Co-authored-by: Colin Heathman <cheathman@benchsci.com>
* Use quote on caBundle values for the webhook deployment
* Drop unrecognized --log-format arg on the manager container
* Update custom cert docs with the default san/secret names
* Revert "Drop unrecognized --log-format arg on the manager container"
This reverts commit d76dd67317.
* 1770 update log format and add runID and Id to worflow logs
update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook
use logging package
remove unused modules
add setup name to setuplog
add flag to change log format
change flag name to enableProdLogConfig
move log opts to logger package
remove empty else and reset timeEncoder
update flag description
use get function to handle nil
rename flag and update logger function
Update main.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update controllers/horizontal_runner_autoscaler_webhook.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Update logging/logger.go
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
copy log opt per each NewLogger call
revert to use autoscaler.log
update flag descript and remove unused imports
add logFormat to readme
rename setupLog to logger
make fmt
* Fix E2E along the way
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* chart: Remove support for extensions/v1beta1 and networking.k8s.io/v1beta1
`networking.k8s.io/v1` has been available since v1.19.
As of today, AWS EKS supports v1.19+ and Oracle Cloud supports v1.20+. GKE and AKS supports v1.21+. The upstream Kubernetes project maintains v1.22+.
So it should be safe to remove it now.
* fixup! chart: Remove support for extensions/v1beta1 and networking.k8s.io/v1beta1
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.
Ref #1412
* feat: allow to discover runner statuses
* fix manifests
* Bump runner version to 2.289.1 which includes the hooks support
* Add feedback from review
* Update reference to newRunnerPod
* Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific
* Fix additional TestNewRunnerPod test
* Cover additional feedback from review
* fix rbac manager role
* Add permissions to service account for container mode if not provided
* Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* added containerMode=kubernetes env variables to the runner
* removed unused logging
* restored configs and charts
* restored makefile cert version and acceptance/run
* added workVolumeClaimTemplate in pod definition, including logic
* added claim template name based on the runner
* Apply suggestions from code review
update errors
* added concurrent cleanup before runner pod is deleted
* update manifests
* added retry after 30s if pod cleanup contains err
* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s
* style changes and added comments
* added izZero timestamp check for deleting runner-linked pods
* changed order of local variable to avoid copy if p is deleted
* removed docker from container mode k8s
* restored charts, config, makefile
* restored forked files back and not the ARC ones
* created PersistentVolume on containerMode k8s
* create pv only if storage class name is local-storage
* removed actions if storage class name is local-storage
* added service account validation if container mode kubernetes
* changed the coding style to match rest of the ARC
* added validation to the runnerdeployment webhook
* specified fields more precisely, added webhook validation to the replicaset as well
* remake manifests
* wraped delete runner-linked-pods in kube mode
* fixed empty line
* fixed import
* makefile changes for hooks
* added cleanup secrets
* create manifests
* docs
* update access modes
* update dockerfile
* nit changes
* fixed dockerfile
* rewrite allowing reuse for runners and runnersets
* deepcopy forgot to stage
* changed privileged
* make manifests
* partly moved to finalizer, still need to apply finalizer first
* finalizer added if env variable used in container mode exists
* bump runner version
* error message moved from Error to Info on cleanup pods/secrets
* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate
* Apply suggestions from code review
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* Update controllers/utils_test.go
Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>
* add hook version to cli, update to 0.1.2
* Apply suggestions from code review
* Update controllers/utils_test.go
* Update runner/Makefile
* Fix missing secret permission and the error handling
* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* chart: Add extraPaths to Ingress of GitHub Webhook Server
* Update charts/actions-runner-controller/templates/githubwebhook.ingress.yaml
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Prefix the toYaml expression to remove the extra newline before extra paths
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
> chart test is failing due to `flag provided but not defined: -default-scale-down-delay` which seems to come from the fact that we still use ARC 0.22.3 for chart testing.
>
> Probably we'd better figure out how to test it against both the latest release version of ARC and the canary version of ARC?
>
> Or just test it against the canary version so that it won't fail when the chart depends on features that are available only in the canary version of ARC? 🤔
yup, lets get this merged though so we can do a release today
Bumps the chart version along with the controller version.
We bump the patch number for the chart as the release for the controller is a patch release.
That's the same handling as we've done in the previous version ecc8b4472a and #1300
As always, be sure to upgrade CRDs before updating the controller version!
Otherwise it can break in interesting ways.
* chore: bump chart to latest
Bumps the chart version along with the controller version.
We bump the patch number for the chart as the release for the controller is a patch release.
That's the same handling as we've done in the previous version ecc8b4472a
As always, be sure to upgrade CRDs before updating the controller version!
Otherwise it can break in interesting ways.
* docs: expand on CRD upgrade requirement
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* Remove legacy GitHub API cache of HRA.Status.CachedEntries
We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.
* Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well
* Fix integration test
* fix(chart): allow to use basic auth when authSecret.create is false
When secret is created outside of the ARC chart using authSecret.create=false and basicAuth,
the controller fails as we're not including the basic password as environment variable as
the password value won't be inside the helm values.
This PR includes both environment variables for consistent regardless if
those are set or not similar as the rest of the other auth options (e.g
app_id, private key, etc)
* chart: Add back the conditional block for .Values.authSecret.github_basicauth_username
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Enhances ARC(both the controller-manager and github-webhook-server) to cache any GitHub API responses with HTTP GET and an appropriate Cache-Control header.
Ref #920
## Cache Implementation
`gregjones/httpcache` has been chosen as a library to implement this feature, as it is as recommended in `go-github`'s documentation:
https://github.com/google/go-github#conditional-requests
`gregjones/httpcache` supports a number of cache backends like `diskcache`, `s3cache`, and so on:
https://github.com/gregjones/httpcache#cache-backends
We stick to the built-in in-memory cache as a starter. Probably this will never becomes an issue as long as various HTTP responses for all the GitHub API calls that ARC makes, list-runners, list-workflow-jobs, list-runner-groups, etc., doesn't overflow the in-memory cache.
`httpcache` has an known unfixed issue that it doesn't update cache on chunked responses. But we assume that the APIs that we call doesn't use chunked responses. See #1503 for more information on that.
## Ephemeral runner pods are no longer recreated
The addition of the cache layer resulted in a slow down of a scale-down process and a trade-off between making the runner pod termination process fragile to various race conditions(shorter grace period before runner deletion) or delaying runner pod deletion depending on how long the grace period is(longer grace period). A grace period needs to be at least longer than 60s (which is the same as cache duration of ListRunners API) to not prematurely delete a runner pod that was just created.
But once I disabled automatic recreation of ephemeral runner pod, it turned out to be no more of an issue when it's being scaled via workflow_job webhook.
Ephemeral runner resources are still automatically added on demand by RunnerDeployment via RunnerReplicaSet(I've added `EffectiveTime` fields to our CRDs but that's an implementation detail so let's omit). A good side-effect of disabling ephemeral runner pod recreations is that ARC will no longer create redundant ephemeral runners when used with webhook-based autoscaler.
Basically, autoscaling still works as everyone might expect. It's just better than before overall.
so that you can run `kubectl logs` on controller pods without the specifying the container name.
It is especially useful when you want to run kubectl-logs on all ARC pods across controller-manager and github-webhook-server like:
```
kubectl -n actions-runner-system logs -l app.kubernetes.io/name=actions-runner-controller
```
That was previously impossible due to that the selector matches pods from both controller-manager and github-webhook-server and kubectl does not provide a way to specify container names for respective pods.
* chart: Allow using different secrets for controller-manager and gh-webhook-server
As it is entirely possible to do so because they are two different K8s deployments. It may provide better scalability because then each component gets its own GitHub API quota.
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.
Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012
If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories
`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:
1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.
b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
GitHub currently has some limitations w.r.t permissions management on
runner groups as they all require org admin, however at our company
we're using runner groups to serve different internal teams (with
different permissions), thus we needed to deploy a custom proxy API with
our internal authentication to provide who has access to certain APIs
depending on the repository/runner group on a given org/enterprise
This change just allows to optionally send the GitHub API calls to an alternate custom
proxy URL instead of cloud github (github.com) or an enterprise URL with
basic authentication
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).
This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.
This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0
* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade
* Regenerate manifests with the latest K8s dependencies
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
When false the chart deployment template will not add GITHUB_*
environment variables to the manager container. In addition, the `volume`
and `volumeMount` for the secret will also be omitted from the
deployment manifest.
Signed-off-by: Piaras Hoban <phoban01@gmail.com>
* chore(deps): update quay.io/brancz/kube-rbac-proxy docker tag to v0.11.0
* chore(deps): update quay.io/brancz/kube-rbac-proxy make tag to v0.11.0
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Callum Tait <15716903+toast-gear@users.noreply.github.com>
* docs: clean up of autoscaling section
* docs: clarifying anti-flapping
* docs: more improvements
* docs: more improvements
* docs: adding duration details and cavaets
* docs: smaller english and better structure
* docs: use consistent wording
* docs: adding limitation cavaet for RunnerSets
* docs: correct helm uprgade order
* docs: lines helm upgrade command with help switch
* docs: use existing limitations section
* docs: fix table of headers and contents
* docs: add link to runnersets on first mention
* docs: adding runnerset limitation
* chore: use new enterprise permission for PAT
* docs: bump example deploy to latest version
* docs: adding oauth apps link
* docs: adding cavaet to the oauth apps doc
* Revert "chore: support app ids as int or strings (#869)"
This reverts commit 0a3d2b686e.
* docs: adding some comments to the code
* docs: adding comment to the chart values
* Create optional serviceAnnotations value in helm chart
* update annotation key
* update annotation key - webhook service
* fix README.md
* docs: using consistent tense
* docs: making the code comments more generic
* feat: Have githubwebhook monitor a single namespace
When using `scope.singleNamespace: true` in Helm, expected behaviour is
that the github webhook server behaves the same way as the controller.
The current behaviour is that the webhook server monitors all the
namespaces.
* Changing the chart README.md to reflect the scope
The documentation now mentions that both the controller and the github
webhook server will make use of the `scope.watchNamespace` field if
`scope.singleNamespace` is set to `true`.
Co-authored-by: Sebastien Le Digabel <sebastien.ledigabel@skyscanner.net>