Commit Graph

121 Commits

Author SHA1 Message Date
Callum Tait 65f7ee92a6
refactor: remove registration runner dead code (#1260)
We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 11:23:39 +09:00
Yusuke Kuoka 759349de11
fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395)
Ref #1369
2022-05-14 09:07:17 +01:00
Yusuke Kuoka e46b90f758
fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420)
This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
2022-05-12 09:34:27 +01:00
Yusuke Kuoka 3a7e8c844b
feat: Support arbitrarily setting `privileged: true` for runner container (#1383)
Resolves #1282
2022-05-12 09:25:51 +01:00
Yusuke Kuoka dabbc99c78
refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385)
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.

Ref #1196
2022-05-11 11:42:55 +01:00
Jeff Billimek 13bfa2da4e
Fix runner pod dnsConfig (#1227)
Fixes #1226
Fixes #1224

Signed-off-by: Jeff Billimek <jeff@billimek.com>
2022-04-20 10:55:20 +09:00
Yusuke Kuoka b09c54045a
Prevent runners from stuck in Terminating when pod disappeared without standard termination process (#1318)
This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.

Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.

Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307
Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273
2022-04-08 10:17:33 +09:00
Felipe Galindo Sanchez e7e48a77e4 Merge remote-tracking branch 'upstream/master' into patch-4 2022-04-04 09:54:08 -07:00
Yusuke Kuoka 631a70a35f
Fix runner pod to be cleaned up earlier regardless of the sync period (#1299)
Ref #1291
2022-04-03 11:12:44 +09:00
Felipe Galindo Sanchez 9657d3e5b3
Fix deleting a runner when pod was deleted
With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource
2022-03-22 14:44:50 -07:00
Yusuke Kuoka 8ca39caff5 Fix log message on runner deletion 2022-03-13 12:11:11 +00:00
Yusuke Kuoka c4b24f8366 Prevent static runners from terminating due to unregister timeout
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
2022-03-13 07:26:36 +00:00
Yusuke Kuoka fa287c4395 Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice
Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.
2022-03-12 13:49:50 +00:00
Yusuke Kuoka 051089733b Use --ephemeral by default
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189
2022-03-12 13:20:07 +00:00
Yusuke Kuoka 9628bb2937
Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204)
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
2022-03-11 19:03:17 +09:00
Yusuke Kuoka 14a878bfae refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet 2022-03-06 05:53:26 +00:00
Felipe Galindo Sanchez d20ad71071
Fix minor log in runner controller (#1175)
Log is mentioning registration only but this is about the standard runner pod
2022-03-03 09:51:30 +09:00
Felipe Galindo Sanchez 27563c4378
Remove unused function (#1173) 2022-03-03 09:02:47 +09:00
Felipe Galindo Sanchez 4a0f68bfe3
Cleanup extra block in runner controller (#1174) 2022-03-03 09:01:34 +09:00
Yusuke Kuoka a3072c110d Prevent runnerset pod unregistration until it gets runner ID
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
2022-03-02 19:03:20 +09:00
Yusuke Kuoka 11be6c1fb6 Prevent runner pod deletion delay when pod disappeared before unregistration 2022-03-02 19:03:20 +09:00
Yusuke Kuoka a6f0e0008f Make unregistration timeout and retry delay configurable in integration tests 2022-02-20 12:05:34 +00:00
Yusuke Kuoka 79a31328a5 Stop recreating ephemeral runner pod
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384
2022-02-20 04:42:19 +00:00
Yusuke Kuoka 3c16188371 Introduce consistent timeouts for runner unregistration and runner pod deletion
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
2022-02-20 04:36:35 +00:00
Tingluo Huang 0b9bef2c08
Try to unconfig runner before deleting the pod to recreate (#1125)
There is a race condition between ARC and GitHub service about deleting runner pod.

- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.

To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.

TODO: I need to do some E2E tests to force the race condition to happen.

Ref #911
2022-02-19 21:22:31 +09:00
Yusuke Kuoka a5ed6bd263
Fix RunerSet managed runner pods to terminate more gracefully (#1126)
Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners.

Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460
2022-02-19 21:19:37 +09:00
Yusuke Kuoka 921f547200
fix: Do recreate runner pod on registration token update (#1087)
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.

This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
2022-02-19 21:18:00 +09:00
Daniel 8a73560dbc
if a Volume is defined by the operator don't add another "work" volume. (#1015)
This allows providing a different `work` Volume.

This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.

This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-2
spec:
  template:
    spec:
      dockerdContainerResources: {}
      env:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      # this is to mount the docker in docker onto NVME disk
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: scratch
        subPathExpr: $(POD_NAME)-docker
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      volumeMounts:
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      volumes:
      - hostPath:
          path: /mnt/disks/ssd0
        name: scratch
      - hostPath:
          path: /mnt/disks/ssd0
        name: work
      nodeSelector:
       cloud.google.com/gke-nodepool: runner-16-with-nvme
      ephemeral: false
      image: ""
      imagePullPolicy: Always
      labels:
        - runner-2
        - self-hosted
      organization: yourorganization

```
2022-01-07 10:01:40 +09:00
Callum Tait ad48851dc9
feat: expose if docker is enabled and wait for docker to be ready (#962)
Resolves #897
Resolves #915
2021-12-29 10:23:35 +09:00
renovate[bot] c64000e11c
fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 (#740)
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0

* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade

* Regenerate manifests with the latest K8s dependencies

Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-17 09:06:55 +09:00
Felipe Galindo Sanchez 9bb21aef1f
Add support for default image pull secret name (#921)
Resolves #896

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-15 09:29:31 +09:00
Pavel Smalenski 91102c8088
Add dockerEnv variable for RunnerDeployment (#912)
Resolves #878

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-14 17:13:24 +09:00
Felipe Galindo Sanchez f0fccc020b
refactor: split Reconciler from Reconcile in a few methods (#926)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-12-12 14:22:55 +09:00
Patrick Ellis ea2dbc2807
Update go-github from v37 -> v39 (#925) 2021-12-11 21:43:40 +09:00
Yusuke Kuoka 898ad3c355
Work-around for offline+busy runners (#993)
Ref #911
2021-12-09 09:31:06 +09:00
apr-1985 0d3de9ee2a
chore: correct logging typo (#904) 2021-10-22 09:03:23 +09:00
Maxim Pogozhiy fce7d6d2a7
Add topologySpreadConstraints (#814) 2021-10-17 21:49:44 +01:00
Tristan Keen 5e3f89bdc5
Correct test to append docker container (#837)
Fixes #835
2021-09-24 09:18:20 +09:00
Tristan Keen 1eb135cace Correct default image logic 2021-09-14 17:00:57 +09:00
Rolf Ahrenberg 14564c7b8e
Allow disabling /runner emptydir mounts and setting storage volume (#674)
* Allow disabling /runner emptydir mounts

* Support defining storage medium for emptydirs

* Fix typos
2021-07-15 06:29:58 +09:00
Sebastien Le Digabel 7f2795b5d6
Adding a default docker registry mirror (#689)
* Adding a default docker registry mirror

This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.

The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
2021-07-15 06:20:08 +09:00
Yusuke Kuoka f858e2e432
Add POC of GitHub Webhook Delivery Forwarder (#682)
* Add POC of GitHub Webhook Delivery Forwarder

* multi-forwarder and ctrl-c existing and fix for non-woring http post

* Rename source files

* Extract signal handling into a dedicated source file

* Faster ctrl-c handling

* Enable automatic creation of repo hook on startup

* Add support for forwarding org hook deliveries

* Set hook secret on hook creation via envvar (HOOK_SECRET)

* Fix org hook support

* Fix HOOK_SECRET for consistency

* Refactor to prepare for custom log position provider

* Refactor to extract inmemory log position provider

* Add configmap-based log position provider

* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder

* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg

* Refactor to extract logger initialization

* Add hookdeliveryforwarder README and bump go-github to unreleased ver
2021-07-14 10:18:55 +09:00
Yusuke Kuoka 6f130c2db5
Fix dockerdWithinRunnerContainer for Runner(Deployment) not working in the main branch (#696)
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/674#issuecomment-878600993
2021-07-13 18:14:15 +09:00
Yusuke Kuoka f19e7ea8a8
chore: Upgrade go-github to v36 (#681) 2021-07-04 17:43:52 +09:00
Yusuke Kuoka acb906164b
RunnerSet: Automatic-recovery from registration timeout and deregistration on pod termination (#652)
Ref #629
Ref #613
Ref #612
2021-06-24 20:39:37 +09:00
Yusuke Kuoka 8b90b0f0e3
Clean up import list (#645)
Resolves #644
2021-06-22 17:55:06 +09:00
Jonathan Gonzalez V a277489003
Added support to enable and disable enableServiceLinks. (#628)
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.

This option it's set by default to `true` in any Kubernetes deployment.

Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>
2021-06-22 17:27:26 +09:00
Yusuke Kuoka 9e4dbf497c
feat: RunnerSet backed by StatefulSet (#629)
* feat: RunnerSet backed by StatefulSet

Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.

Resolves #613
Ref #612

* Upgrade controller-runtime to 0.9.0

* Bump Go to 1.16.x following controller-runtime 0.9.0

* Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup

* Fix startup failure due to missing LeaderElectionID

* Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered

* Allow force-updating statefulset

* Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false

* Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec

* Enable running acceptance tests against arbitrary kind cluster

* RunnerSet supports non-ephemeral runners only today

* fix: docker-build from root Makefile on intel mac

* fix: arch check fixes for mac and ARM

* ci: aligning test data format and patching checks

* fix: removing namespace in test data

* chore: adding more ignores

* chore: removing leading space in shebang

* Re-add metrics to org hra testdata

* Bump cert-manager to v1.1.1 and fix deploy.sh

Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
2021-06-22 17:10:09 +09:00
Jonah Back 8c42f99d0b
feat: avoid setting privileged flag if seLinuxOptions is not null (#599)
Sets the privileged flag to false if SELinuxOptions are present/defined. This is needed because containerd treats SELinux and Privileged controls as mutually exclusive. Also see https://github.com/containerd/cri/blob/aa2d5a97c/pkg/server/container_create.go#L164.

This allows users who use SELinux for managing privileged processes to use GH Actions - otherwise, based on the SELinux policy, the Docker in Docker container might not be privileged enough. 

Signed-off-by: Jonah Back <jonah@jonahback.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-06-04 08:59:11 +09:00
Ameer Ghani 7523ea44f1
feat: allow specifying runtime class in runner spec (#580)
This allows using the `runtimeClassName` directive in the runner's spec.

One of the use-cases for this is Kata Containers, which use `runtimeClassName` in a pod spec as an indicator that the pod should run inside a Kata container. This allows us a greater degree of pod isolation.
2021-06-04 08:56:43 +09:00