Commit Graph

380 Commits

Author SHA1 Message Date
Thang Le e7ec736738
Use head_branch metric (#2549)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-05-28 16:36:55 +09:00
Daniel Hobley 90ea691e72
feat: allow for modifying `var-run` mount maximum size limit (#2624) 2023-05-27 11:47:23 +09:00
Bassem Dghaidi 8afef51c8b
Add DrainJobsMode (aka UpdateStrategy feature) (#2569) 2023-05-23 07:42:30 -04:00
argokasper a2d4b95b79
Fix GET validation for lowercase http methods (#2497)
Some requests send method in lowercase (verified with curl and as a default for AWS ALB health check requests), but Go HTTP library constant MethodGet is in upper.
2023-04-27 13:22:41 +09:00
Nuru 9bd4025e9c
Stricter filtering of check run completion events (#2520)
I observed that 100% of canceled jobs in my runner pool were not causing scale down events. This PR fixes that.

The problem was caused by #2119. 

#2119 ignores certain webhook events in order to fix #2118. However, #2119 overdoes it and filters out valid job cancellation events. This PR uses stricter filtering and add visibility for future troubleshooting.

<details><summary>Example cancellation event</summary>

This is the redacted top portion of a valid cancellation event my runner pool received and ignored.

```json
{
  "action": "completed",
  "workflow_job": {
    "id": 12848997134,
    "run_id": 4738060033,
    "workflow_name": "slack-notifier",
    "head_branch": "auto-update/slack-notifier-0.5.1",
    "run_url": "https://api.github.com/repos/nuru/<redacted>/actions/runs/4738060033",
    "run_attempt": 1,
    "node_id": "CR_kwDOB8Xtbc8AAAAC_dwjDg",
    "head_sha": "55bada8f3d0d3e12a510a1bf34d0c3e169b65f89",
    "url": "https://api.github.com/repos/nuru/<redacted>/actions/jobs/12848997134",
    "html_url": "https://github.com/nuru/<redacted>/actions/runs/4738060033/jobs/8411515430",
    "status": "completed",
    "conclusion": "cancelled",
    "created_at": "2023-04-19T00:03:12Z",
    "started_at": "2023-04-19T00:03:42Z",
    "completed_at": "2023-04-19T00:03:42Z",
    "name": "build (arm64)",
    "steps": [

    ],
    "check_run_url": "https://api.github.com/repos/nuru/<redacted>/check-runs/12848997134",
    "labels": [
      "self-hosted",
      "arm64"
    ],
    "runner_id": 0,
    "runner_name": "",
    "runner_group_id": 0,
    "runner_group_name": ""
  },
```

</details>
2023-04-27 13:15:23 +09:00
Yusuke Kuoka 94c089c407
Revert docker.sock path to /var/run/docker.sock (#2536)
Starting ARC v0.27.2, we've changed the `docker.sock` path from `/var/run/docker.sock` to `/var/run/docker/docker.sock`. That resulted in breaking some container-based actions due to the hard-coded `docker.sock` path in various places.

Even `actions/runner` seem to use `/var/run/docker.sock` for building container-based actions and for service containers?

Anyway, this fixes that by moving the sock file back to the previous location.

Once this gets merged, users stuck at ARC v0.27.1, previously upgraded to 0.27.2 or 0.27.3 and reverted back to v0.27.1 due to #2519, should be able to upgrade to the upcoming v0.27.4.

Resolves #2519
Resolves #2538
2023-04-27 13:06:35 +09:00
Nikola Jokic 9859bbc7f2
Use build.Version to check if resource version is a mismatch (#2521)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-04-24 10:40:15 +02:00
Yusuke Kuoka 3e0bc3f7be
Fix docker.sock permission error for non-dind Ubuntu 20.04 runners since v0.27.2 (#2499)
#2490 has been happening since v0.27.2 for non-dind runners based on Ubuntu 20.04 runner images. It does not affect Ubuntu 22.04 non-dind runners(i.e. runners with dockerd sidecars) and Ubuntu 20.04/22.04 dind runners(i.e. runners without dockerd sidecars). However, presuming many folks are still using Ubuntu 20.04 runners and non-dind runners, we should fix it.

This change tries to fix it by defaulting to the docker group id 1001 used by Ubuntu 20.04 runners, and use gid 121 for Ubuntu 22.04 runners. We use the image tag to see which Ubuntu version the runner is based on. The algorithm is so simple- we assume it's Ubuntu-22.04-based if the image tag contains "22.04".

This might be a breaking change for folks who have already upgraded to Ubuntu 22.04 runners using their own custom runner images. Note again; we rely on the image tag to detect Ubuntu 22.04 runner images and use the proper docker gid- Folks using our official Ubuntu 22.04 runner images are not affected. It is a breaking change anyway, so I have added a remedy-

ARC got a new flag, `--docker-gid`, which defaults to `1001` but can be set to `121` or whatever gid the operator/admin likes. This can be set to `--docker-gid=121`, for example, if you are using your own custom runner image based on Ubuntu 22.04 and the image tag does not contain "22.04".

Fixes #2490
2023-04-17 21:30:41 +09:00
Nikola Jokic ba1ac0990b
Reordering methods and constants so it is easier to look it up (#2501) 2023-04-12 09:50:23 +02:00
Nikola Jokic b86af190f7
Extend manager roles to accept ephemeralrunnerset/finalizers (#2493) 2023-04-10 08:49:32 +02:00
Nikola Jokic a804bf8b00
Add ImagePullPolicy to the AutoscalingListener, configurable through Manager env (#2477) 2023-04-04 19:07:20 +02:00
Nikola Jokic 5dea6db412
Fix helm uninstall cleanup by adding finalizers and cleaning them from the controller (#2433)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-04-03 21:06:12 +02:00
Stewart Thomson a7ef871248
Check if appID and instID are non-empty before attempting to parseInt (#2463) 2023-04-03 09:06:59 +09:00
Milas Bowman 878c9b8b49
runner: Use Docker socket via shared emptyDir instead of TCP/mTLS (#2324)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-03-28 11:29:16 +09:00
Nikola Jokic 56e1c62ac2
Add labels to autoscaling runner set subresources to allow easier inspection (#2391)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-03-27 11:19:34 +02:00
Tingluo Huang 08acb1b831
Get RunnerScaleSet based on both RunnerGroupId and Name. (#2413) 2023-03-15 11:10:09 -04:00
Tingluo Huang 2bf83d0d7f
Remove list/watch secrets permission from the manager cluster role. (#2276) 2023-03-14 09:23:14 -04:00
Tingluo Huang 261d4371b5
Update E2E test workflow. (#2395) 2023-03-14 09:00:07 -04:00
Nikola Jokic babbfc77d5
Surface EphemeralRunnerSet stats to AutoscalingRunnerSet (#2382) 2023-03-13 16:16:28 +01:00
Tingluo Huang d7b589bed5
Helm chart react changes for the new runner image. (#2348) 2023-03-10 11:18:21 +00:00
Francesco Renzi c569304271
Add support for self-signed CA certificates (#2268)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-03-09 17:23:32 +00:00
Chris Patterson 41f2ca3ed9
Adding parameter to configure the runner set name. (#2279)
Co-authored-by: TingluoHuang <TingluoHuang@github.com>
2023-03-03 08:36:14 -05:00
Francesco Renzi 40c905f25d
Simplify the setup of controller tests (#2352) 2023-03-02 18:55:49 +00:00
Nikola Jokic 2984de912c
Split listener pod label to avoid long names issue (#2341) 2023-03-02 17:25:50 +01:00
Alex Williams 69abd51f30
Ensure that EffectiveTime is updated on webhook scale down (#2258)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-03-01 08:27:37 +09:00
Francesco Renzi 73e22a1756
Disable metrics serving in proxy tests (#2307) 2023-02-22 16:57:59 +00:00
Francesco Renzi 6b4250ca90
Add support for proxy (#2286)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
2023-02-21 17:33:48 +00:00
Nathan Klick ced88228fc
Resolves the erroneous webhook scale down due to check runs (#2119)
Signed-off-by: Nathan Klick <nathan@swirldslabs.com>
2023-02-21 10:56:46 +09:00
Andrei Vydrin 44c06c21ce
fix: case-insensitive webhook label matching (#2302)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-02-21 09:37:42 +09:00
Nikola Jokic 8e52a6d2cf
EphemeralRunner: On cleanup, if pod is pending, delete from service (#2255)
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
2023-02-11 19:55:12 -05:00
Nikola Jokic 9990243520
Early return if finalizer does not exist to make it more readable (#2262) 2023-02-08 15:21:13 +01:00
Tingluo Huang facae69e0b
Remove un-required permissions for the manager-role of the new `AutoScalingRunnerSet` (#2260) 2023-02-07 12:37:09 -05:00
Nikola Jokic c4297d25bb
Avoid deleting scale set if annotation is not parsable or if it does not exist (#2239) 2023-02-03 17:27:31 +01:00
Tingluo Huang 1f4fe4681e
Delete RunnerScaleSet on service when AutoScalingRunnerSet is deleted. (#2223) 2023-01-31 15:03:11 -05:00
Tingluo Huang 803818162c
Allow update runner group for AutoScalingRunnerSet (#2216) 2023-01-27 09:27:52 -05:00
dependabot[bot] 219ba5b477
chore(deps): bump sigs.k8s.io/controller-runtime from 0.13.1 to 0.14.1 (#2132)
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2023-01-27 09:23:28 +09:00
Nikola Jokic 882bfab569
Renaming autoScaling to autoscaling in tests matching the convention (#2201) 2023-01-23 17:03:01 +01:00
Tingluo Huang 4932412cd6
Fix L0 test to make it more reliable. (#2178) 2023-01-19 07:33:04 -05:00
Tingluo Huang bb61bb1342
Include extra user-agent for runners created by actions-runner-controller. (#2177) 2023-01-18 07:38:59 +09:00
Tingluo Huang 622eaa34f8
Introduce new preview auto-scaling mode for ARC. (#2153)
Co-authored-by: Cory Miller <cory-miller@github.com>
Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Ava Stancu <AvaStancu@github.com>
Co-authored-by: Ferenc Hammerl <fhammerl@github.com>
Co-authored-by: Francesco Renzi <rentziass@github.com>
Co-authored-by: Bassem Dghaidi <Link-@github.com>
2023-01-17 12:06:20 -05:00
Tingluo Huang 044c8ad4d5
Include actions-runner-controller in runner's User-Agent for better telemetry in Actions service. (#2155) 2023-01-15 09:35:56 +09:00
Tingluo Huang eaa451df32
Update controller package names to match the owning API group name (#2150)
* Update controller package names to match the owning API group name

* feedback.

Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-01-13 08:24:11 +09:00
Yusuke Kuoka bc4f4fee12
Fix various golangci-lint errors (#2147)
that we introduced via controller-runtime upgrade and via the removal of legacy pull-based scale triggers (#2001).
2023-01-13 07:14:36 +09:00
Nikola Jokic aa6dab5a9a
Changes to folder structure to allow multigroups and changed go mod name (#2105)
* Changed folder structure to allow multi group registration

* included actions.github.com directory for resources and controllers

* updated go module to actions/actions-runner-controller

* publish arc packages under actions-runner-controller

* Update charts/actions-runner-controller/docs/UPGRADING.md

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-12-28 09:38:34 +09:00
Callum Tait 418f719bdf
chore: highlight watch namespace (#2087)
* chore: highlight watch namespace

* chore: wording

Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>
2022-12-12 08:39:04 +09:00
Yusuke Kuoka 96a930bfd9
Fix runner pod to not stuck in Terminating when runner got deleted before pod scheduling (#2043)
This fixes the said issue that I found while I was running a series of E2E tests to test other features and pull requestes I have recently contributed.
2022-11-27 11:13:38 +09:00
Yusuke Kuoka ae86b1a011
Use the patch API instead to prevent unnecessary field updates (#1998)
Fixes #1916
2022-11-22 12:09:24 +09:00
Yusuke Kuoka 86d7893d61
breaking: Make legacy webhook scale triggers no-op (#2001)
Ref #1607
2022-11-22 12:08:29 +09:00
Igor Sarkisov 8f374d561f
Do not explicitly set Privileged to false. (#2009)
Setting SecurityContext.Privileged bit to false, which is default,
prevents GKE from admitting Windows pods.  Privileged bit is not
supported on Windows.
2022-11-15 11:29:37 +09:00
malachiobadeyi fbdfe0df8c
1770 update log format and add additional fields to webhook server logs (#1771)
* 1770 update log format and add runID and Id to worflow logs

update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook

use logging package

remove unused modules

add setup name to setuplog

add flag to change log format

change flag name to enableProdLogConfig

move log opts to logger package

remove empty else and reset timeEncoder

update flag description

use get function to handle nil

rename flag and update logger function

Update main.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Update controllers/horizontal_runner_autoscaler_webhook.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Update logging/logger.go

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

copy log opt per each NewLogger call

revert to use autoscaler.log

update flag descript and remove unused imports

add logFormat to readme

 rename setupLog to logger

make fmt

* Fix E2E along the way

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-11-04 10:46:58 +09:00
Yusuke Kuoka 23c8fe4a8b
Fix dead-lock when runner unregistration triggered before PV attachment (#1975)
This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.
2022-11-04 06:29:19 +09:00
Yusuke Kuoka c74ad6195f
Fix runners to do their best to gracefully stop on pod eviction (#1759)
Ref #1535
Ref #1581

Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-11-01 20:30:10 +09:00
Yusuke Kuoka 710e2fbc3a
Prevent runner controller from recreating runner pod when pod was terminated externally (#1851) 2022-10-13 09:04:50 +09:00
Yusuke Kuoka 7ff5b7da8c
Handle missing runner ID more gracefully (#1855)
so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation

While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes.

You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
2022-10-09 16:52:51 +09:00
Nicholas Farley a389292478
Allow `RunnerDeployment`s to configure `dnsPolicy` for runners (#1892)
* Add DnsPolicy field to RunnerPodSpec struct

* Ensure the runnerSpec's DNSPolicy is mirrored to the pod.Spec

* Run `make manifests`
2022-10-05 08:16:11 +09:00
Yusuke Kuoka e4879e7ae4 Tweak E2E and documentation about MTU configuration 2022-09-25 07:50:12 +09:00
renovate[bot] 0deb6809b9
fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 (#1775)
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0

* fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0

* fixup! fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0

* fixup! fixup! fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-09-21 11:04:07 +09:00
Cory Miller c91e76f169
Add golangci-lilnt to CI (#1794)
This introduces a linter to PRs to help with code reviews and code hygiene. I've also gone ahead and fixed (or ignored) the existing lints.

I've only setup the default linters right now. There are many more options that are documented at https://golangci-lint.run/.

The GitHub Action should add appropriate annotations to the lint job for the PR. Contributors can also lint locally using `make lint`.
2022-09-21 09:08:22 +09:00
Barun Mishra 921daff61b
Add cmd line arg for enterprise url. Fix enterprise bug. (#1)
* Add cmd line arg for enterprise url. Fix enterprise bug.

* Fix package import order

* Fix comment
2022-09-05 13:50:17 +01:00
Yusuke Kuoka d4fb6204cb Add TODO comment to the PVC reconciler 2022-08-27 07:14:16 +00:00
Yusuke Kuoka bdcde44642
chore: Bump go-github and minimum GHES version to 3.6 (#1747)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1574
2022-08-24 13:08:40 +09:00
João Carlos Ferra de Almeida 36e95dad47
Fix/multitenancy enterprise url (#1725)
* Fix #1714

* Add Comment
2022-08-16 20:20:06 +09:00
Rahul Kumar 72ca998266
Add Additional Autoscaling Metrics to Prometheus (#1720)
* Add prometheus metrics for autoscaling

* Add desc for prometheus-metrics

* FIX: Typo

* Remove replicas_desired_before in metrics

* Remove Num prefix in metricws
2022-08-15 23:12:00 +09:00
oreonl e511401e51
fix: don't base64 decode secret strings (#1683) 2022-08-03 11:47:07 +09:00
Yusuke Kuoka 97404144eb
Fix excessive runnerreplicaset update issue since 0.25.0 (#1650)
Fixes #1643
2022-07-17 19:43:24 +09:00
Felipe Galindo Sanchez 584745b67d Minor improvements for runner groups
- Add group in runners columns
- Add constant for runner group and labels
2022-07-15 09:47:25 +09:00
Yusuke Kuoka 8071ac7066
Remove github-api-cache-duration flag and code (#1631)
This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe.

Ref #1412
2022-07-12 20:37:24 +09:00
Yusuke Kuoka 618276e3d3
Enhance support for multi-tenancy (#1371)
This enhances every ARC controller and the various K8s custom resources so that the user can now configure a custom GitHub API credentials (that is different from the default one configured per the ARC instance).

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1067#issuecomment-1043716646
2022-07-12 09:45:00 +09:00
Yusuke Kuoka 1cfe1974c4 Add missing job-related permissions to runner pods with k8s container mode 2022-07-10 16:16:32 +09:00
Felipe Galindo Sanchez 11cb9b7882
feat: allow to discover runner statuses (#1268)
* feat: allow to discover runner statuses

* fix manifests

* Bump runner version to 2.289.1 which includes the hooks support

* Add feedback from review

* Update reference to newRunnerPod

* Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific

* Fix additional TestNewRunnerPod test

* Cover additional feedback from review

* fix rbac manager role

* Add permissions to service account for container mode if not provided

* Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-07-10 15:11:29 +09:00
everpcpc fea1457f12
fix: annotate pod instead of label on UnregistrationFailureMessage (#1615)
The error message should go to annotation instead of pod label, since we only check annotations on autoscale:

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/autoscaling.go#L338-L342

Then there is no need to truncate or normalize the error message.
2022-07-09 11:45:05 +09:00
Yusuke Kuoka bfc5ea4727
Fix a regression in webhook-based autoscaler (#1596)
The regression resulted in the webhook-based autoscaler be unable to find visible runner groups and therefore unable to scale up and down the target RunnerDeployment/RunnerSet at all when the webhook-based autoscaler was provided GitHub API credentials to enable the runner groups support. This fixes that.

The regression was introduced via #1578 which is not released yet. Users of existing ARC releases are therefore not affected.
2022-07-04 20:17:09 +09:00
Yusuke Kuoka a517c1ff66
Fix old runner pods stuck in Terminating since #1579 (#1585)
Ref #1579
2022-06-29 22:02:42 +09:00
Yusuke Kuoka 8161136cbd
Fix PercentageRunnersBusy scaling delay (#1579)
* Use a dedicated pod label to say it is a runner pod

Follow-up for #1546

* Fix PercentageRunnersBusy scaling delay

Ref #1374
2022-06-29 20:49:21 +09:00
Yusuke Kuoka ddd417f756
Bump go-github to v45 (#1573)
* Bump go-github to v45

Ref #1402

* fixup! Bump go-github to v45
2022-06-29 06:34:58 +09:00
Thomas Boop 0386c0734c
`containerMode` option to allow running jobs in k8's instead of docker (#1546)
* added containerMode=kubernetes env variables to the runner

* removed unused logging

* restored configs and charts

* restored makefile cert version and acceptance/run

* added workVolumeClaimTemplate in pod definition, including logic

* added claim template name based on the runner

* Apply suggestions from code review

update errors

* added concurrent cleanup before runner pod is deleted

* update manifests

* added retry after 30s if pod cleanup contains err

* added admission webhook check, made workVolumeClaimTemplate mandatory for k8s

* style changes and added comments

* added izZero timestamp check for deleting runner-linked pods

* changed order of local variable to avoid copy if p is deleted

* removed docker from container mode k8s

* restored charts, config, makefile

* restored forked files back and not the ARC ones

* created PersistentVolume on containerMode k8s

* create pv only if storage class name is local-storage

* removed actions if storage class name is local-storage

* added service account validation if container mode kubernetes

* changed the coding style to match rest of the ARC

* added validation to the runnerdeployment webhook

* specified fields more precisely, added webhook validation to the replicaset as well

* remake manifests

* wraped delete runner-linked-pods in kube mode

* fixed empty line

* fixed import

* makefile changes for hooks

* added cleanup secrets

* create manifests

* docs

* update access modes

* update dockerfile

* nit changes

* fixed dockerfile

* rewrite allowing reuse for runners and runnersets

* deepcopy forgot to stage

* changed privileged

* make manifests

* partly moved to finalizer, still need to apply finalizer first

* finalizer added if env variable used in container mode exists

* bump runner version

* error message moved from Error to Info on cleanup pods/secrets

* removed useless dereferencing, added transformation tests of workVolumeClaimTemplate

* Apply suggestions from code review

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* Update controllers/utils_test.go

Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com>

* add hook version to cli, update to 0.1.2

* Apply suggestions from code review

* Update controllers/utils_test.go

* Update runner/Makefile

* Fix missing secret permission and the error handling

* Fix a runnerpod reconciler finalizer to not trigger unnecessary retry

Co-authored-by: Nikola Jokic <nikola-jokic@github.com>
Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-06-28 14:12:40 +09:00
Yusuke Kuoka af96de6184
Fix completed runner pod recreation not to be blocked after max out (#1568)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1477#issuecomment-1164154496
2022-06-28 13:50:07 +09:00
Sam Weston bc7a3cab1b
Add priorityClassName to CRDs (#1513)
* Add pod priorityClassName to controller and crds

* Add missing bits in bases directory

* Regenerate crds
2022-06-28 08:45:19 +09:00
Yusuke Kuoka e2c8163b8c
Make webhook-based scale race-free (#1477)
* Make webhook-based scale operation asynchronous

This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up
scaling up the same horizontal runner autoscaler.

Ref #1321

* Fix typos

* Update rather than Patch HRA to avoid race among webhook-based autoscaler servers

* Batch capacity reservation updates for efficient use of apiserver

* Fix potential never-ending HRA update conflicts in batch update

* Extract batchScaler out of webhook-based autoscaler for testability

* Fix log levels and batch scaler hang on start

* Correlate webhook event with scale trigger amount in logs

* Fix log message
2022-06-27 18:31:48 +09:00
Yusuke Kuoka 0ef9a22cd4
Fix confusing PV controller log (#1526)
Ref #1511
2022-06-14 08:35:04 +09:00
Yusuke Kuoka 9dd26168d6
Fix confusing logs from pv and pvc controllers (#1487)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212
2022-05-26 18:21:23 +09:00
Yusuke Kuoka c7eea169ad
test: Add test for runner with generic ephemeral volume as "work" (#1472)
This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work".
It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.
2022-05-25 10:37:23 +09:00
Yusuke Kuoka 63be0223ad
fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" (#1471)
* fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work"

While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error.

This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears.

Ref #1464

* e2e: Ensure work generic ephemeral volume to work as expected
2022-05-22 10:25:50 +09:00
Yusuke Kuoka 3e988afc09
test: add fuzzing to the test suite (#1463)
As a part of #1298, we add fuzzing based on Go test's fuzzing support to the test suite
2022-05-19 13:34:23 +01:00
Felipe Galindo Sanchez 78c01fd31d
cleanup some unused code and minor refactors (#1274)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 18:38:32 +09:00
Callum Tait 65f7ee92a6
refactor: remove registration runner dead code (#1260)
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 11:23:39 +09:00
Yusuke Kuoka b5194fd75a
Enhance RunnerSet to optionally retain PVs accross restarts (#1340)
* Enhance RunnerSet to optionally retain PVs accross restarts

This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.

Ref #1286

* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs

Ref #1286
2022-05-16 09:26:48 +09:00
Yusuke Kuoka 759349de11
fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395)
Ref #1369
2022-05-14 09:07:17 +01:00
Yusuke Kuoka e46b90f758
fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420)
This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
2022-05-12 09:34:27 +01:00
Yusuke Kuoka 3a7e8c844b
feat: Support arbitrarily setting `privileged: true` for runner container (#1383)
Resolves #1282
2022-05-12 09:25:51 +01:00
Yusuke Kuoka dabbc99c78
refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385)
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.

Ref #1196
2022-05-11 11:42:55 +01:00
Yusuke Kuoka 4053ab3e11
Fix label support for TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1390)
In #1373 we made two mistakes:

- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.

The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.

Ref #1056
Ref #1373
2022-04-27 16:24:21 +01:00
Yusuke Kuoka 1c726ae20c
chore: Add unit test for RunnerReconciler.newPod (#1382)
Adds some unit tests for the runner pod generation logic that is used internally by runner controller as preparation for #1282
2022-04-25 09:59:21 +09:00
Yusuke Kuoka d6cdd5964c
chore: Add unit test for newRunnerPod (#1381)
Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282
2022-04-25 08:52:58 +09:00
Yusuke Kuoka a622968ff2
feat: Add label support to TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1373)
This is an implementation for my intepretation of the "bronze" case proposed in #1056

Ref #1056
2022-04-24 14:41:34 +09:00
Yusuke Kuoka 1551f3b5fc
Remove the default `githubEvent: {}` requiring a event to be defined (#1379)
Ref #1358
2022-04-24 13:37:26 +09:00
Yusuke Kuoka 3ba7179995
Do not enable TotalNumberOfQueuedAndInProgressWorkflowRuns by default (#1372)
Previously, omitting hra.spec.metrics at all resulted in ARC enabling the TotalNumberOfQueuedAndInProgressWorkflowRuns.
That turned out not a good idea so since this change it is not enabled by default.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/728
2022-04-24 13:36:42 +09:00
Callum Tait 24aae58dbc
feat: default scale down flag (#963)
Resolves #899

Co-authored-by: Callum <callum@domain.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-04-20 11:09:09 +09:00
Jeff Billimek 13bfa2da4e
Fix runner pod dnsConfig (#1227)
Fixes #1226
Fixes #1224

Signed-off-by: Jeff Billimek <jeff@billimek.com>
2022-04-20 10:55:20 +09:00
Patrick Ellis 7a5a6381c3
Add WorkflowJob to GitHubEventScaleUpTriggerSpec types (#922) 2022-04-20 09:59:08 +09:00