Port ADRs from internal repo (#2267)

2023-02-08 14:42:45 +01:00 · 2023-02-08 14:42:45 +01:00 · 08919814b1
parent facae69e0b
commit 08919814b1
5 changed files with 363 additions and 0 deletions
--- a/adrs/2022-10-17-runner-image.md
+++ b/adrs/2022-10-17-runner-image.md
@ -0,0 +1,90 @@
+# ADR 0001: Produce the runner image for the scaleset client
+**Date**: 2022-10-17
+
+**Status**: Done
+
+# Context
+
+user can bring their own runner images, the contract we have are:
+- It must have a runner binary under /actions-runner (/actions-runner/run.sh exists)
+- The WORKDIR is set to /actions-runner
+- If the user inside the container is root, the ENV RUNNER_ALLOW_RUNASROOT should be set to 1
+
+The existing ARC runner images will not work with the new ARC mode out-of-box for the following reason:
+
+- The current runner image requires caller to pass runner configure info, ex: URL and Config Token
+- The current runner image has the runner binary under /runner
+- The current runner image requires a special entrypoint script in order to work around some volume mount limitation for setting up DinD.
+
+However, since we expose the raw runner Pod spec to our user, advanced user can modify the helm values.yaml to make everything lines up properly.
+
+# Guiding Principles
+
+- Build image is separated in two stages.
+
+## The first stage (build)
+- Reuses the same base image, so it is faster to build.
+- Installs utilities needed to download assets (runner and runner-container-hooks).
+- Downloads the runner and stores it into `/actions-runner` directory.
+- Downloads the runner-container-hooks and stores it into `/actions-runner/k8s` directory.
+- You can use build arguments to control the runner version, the target platform and runner container hooks version.
+
+Preview:
+
+```Dockerfile
+FROM mcr.microsoft.com/dotnet/runtime-deps:6.0 as build
+
+ARG RUNNER_ARCH="x64"
+ARG RUNNER_VERSION=2.298.2
+ARG RUNNER_CONTAINER_HOOKS_VERSION=0.1.3
+
+RUN apt update -y && apt install curl unzip -y
+
+WORKDIR /actions-runner
+RUN curl -f -L -o runner.tar.gz https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-${RUNNER_ARCH}-${RUNNER_VERSION}.tar.gz \
+    && tar xzf ./runner.tar.gz \
+    && rm runner.tar.gz
+
+RUN curl -f -L -o runner-container-hooks.zip https://github.com/actions/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
+    && unzip ./runner-container-hooks.zip -d ./k8s \
+    && rm runner-container-hooks.zip
+```
+
+## The main image:
+- Copies assets from the build stage to `/actions-runner`
+- Does not provide an entrypoint. The entrypoint should be set within the container definition.
+
+Preview:
+
+```Dockerfile
+FROM mcr.microsoft.com/dotnet/runtime-deps:6.0
+
+WORKDIR /actions-runner
+COPY --from=build /actions-runner .
+```
+
+## Example of pod spec with the init container copying assets
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: <name>
+spec:
+  containers:
+  - name: runner
+    image: <image>
+    command: ["/runner/run.sh"]
+    volumeMounts:
+    - name: runner
+      mountPath: /runner
+  initContainers:
+  - name: setup
+    image: <image> 
+    command: ["sh", "-c", "cp -r /actions-runner/* /runner/"]
+    volumeMounts:
+    - name: runner
+      mountPath: /runner
+  volumes:
+  - name: runner
+    emptyDir: {}
+```
--- a/adrs/2022-10-27-runnerscaleset-lifetime.md
+++ b/adrs/2022-10-27-runnerscaleset-lifetime.md
@ -0,0 +1,54 @@
+# ADR 0003: Lifetime of RunnerScaleSet on Service
+
+**Date**: 2022-10-27
+
+**Status**: Done
+
+## Context
+
+We have created the RunnerScaleSet object and APIs around it on the GitHub Actions service for better support of any self-hosted runner auto-scale solution, like [actions-runner-controller](https://github.com/actions-runner-controller/actions-runner-controller).
+
+The `RunnerScaleSet` object will represent a set of homogeneous self-hosted runners to the Actions service job routing system.
+
+A `RunnerScaleSet` client (ARC) needs to communicate with the Actions service via HTTP long-poll in a certain protocol to get a workflow job successfully landed on one of its homogeneous self-hosted runners.
+
+In this ADR, I want to discuss the following within the context of actions-runner-controller's new scaling mode:
+- Who and how to create a RunnerScaleSet on the service? 
+- Who and how to delete a RunnerScaleSet on the service?
+- What will happen to all the runners and jobs when the deletion happens?
+
+## RunnerScaleSet creation
+
+- `AutoScalingRunnerSet` custom resource controller will create the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deployment.
+- The creation is via REST API on Actions service `POST _apis/runtime/runnerscalesets`
+- The creation needs to use the runner registration token (admin).
+- `RunnerScaleSet.Name` == `AutoScalingRunnerSet.metadata.Name`
+- The created `RunnerScaleSet` will only have 1 label and it's the `RunnerScaleSet`'s name
+- `AutoScalingRunnerSet` controller will store the `RunnerScaleSet.Id` as an annotation on the k8s resource for future lookup.
+
+## RunnerScaleSet modification
+
+- When the user patch existing `AutoScalingRunnerSet`'s RunnerScaleSet related properly, ex: `runnerGroupName`, `runnerWorkDir`, the controller needs to make an HTTP PATCH call to the `_apis/runtime/runnerscalesets/2` endpoint in order to update the object on the service.
+- We will put the deployed `AutoScalingRunnerSet` resource in an error state when the user tries to patch the resource with a different `githubConfigUrl`
+> Basically, you can't move a deployed `AutoScalingRunnerSet` across GitHub entity, repoA->repoB, repoA->OrgC, etc.
+> We evaluated blocking the change before instead of erroring at runtime and that we decided not to go down this route because it forces us to re-introduce admission webhooks (require cert-manager).
+
+## RunnerScaleSet deletion
+
+- `AutoScalingRunnerSet` custom resource controller will delete the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deletion.
+> `AutoScalingRunnerSet` deletion will contain several steps:
+> - Stop the listener app so no more new jobs coming and no more scaling up/down.
+> - Request scale down to 0
+> - Force stop all runners
+> - Wait for the scale down to 0
+> - Delete the `RunnerScaleSet` object from service via REST API
+- The deletion is via REST API on Actions service `DELETE _apis/runtime/runnerscalesets/1`
+- The deletion needs to use the runner registration token (admin).
+
+The user's `RunnerScaleSet` will be deleted from the service by `DormantRunnerScaleSetCleanupJob` if the particular `AutoScalingRunnerSet` has not connected to the service for the past 7 days. We have a similar rule for self-hosted runners.
+
+## Jobs and Runners on deletion
+
+- `RunnerScaleSet` deletion will be blocked if there is any job assigned to a runner within the `RunnerScaleSet`, which has to scale down to 0 before deletion.
+- Any job that has been assigned to the `RunnerScaleSet` but hasn't been assigned to a runner within the `RunnerScaleSet` will get thrown back to the queue and wait for assignment again.
+- Any offline runners within the `RunnerScaleSet` will be deleted from the service side.
--- a/adrs/2022-11-04-crd-api-group-name.md
+++ b/adrs/2022-11-04-crd-api-group-name.md
@ -0,0 +1,51 @@
+# ADR 0004: Technical detail about actions-runner-controller repository transfer
+**Date**: 2022-11-04
+
+**Status**: Done
+
+# Context
+
+As part of ARC Private Beta: Repository Migration & Open Sourcing Process, we have decided to transfer the current [actions-runner-controller repository](https://github.com/actions-runner-controller/actions-runner-controller) into the [Actions org](https://github.com/actions).
+
+**Goals:**
+- A clear signal that GitHub will start taking over ARC and provide support.
+- Since we are going to deprecate the existing auto-scale mode in ARC at some point, we want to have a clear separation between the legacy mode (not supported) and the new mode (supported).
+- Avoid disrupting users as much as we can, existing ARC users will not notice any difference after the repository transfer, they can keep upgrading to the newer version of ARC and keep using the legacy mode. 
+
+**Challenges**
+- The original creator's name (`summerwind`) is all over the place, including some critical parts of ARC:
+    - The k8s user resource API's full name is `actions.summerwind.dev/v1alpha1/RunnerDeployment`, renaming it to `actions.github.com` is a breaking change and will force the user to rebuild their entire k8s cluster. 
+    - All docker images around ARC (controller + default runner) is published to [dockerhub/summerwind](https://hub.docker.com/u/summerwind)
+- The helm chart for ARC is currently hosted on [GitHub pages](https://actions-runner-controller.github.io/actions-runner-controller) for https://github.com/actions-runner-controller/actions-runner-controller, moving the repository means we will break users who install ARC via the helm chart
+
+
+# Decisions
+
+## APIs group names for k8s custom resources, `actions.summerwind` or `actions.github`
+
+- We will not rename any existing ARC resources API name after moving the repository under Actions org. (keep `summerwind` for old stuff)
+- For any new resource API we are going to add, those will be named properly under GitHub, ex: `actions.github.com/v1alpha1/AutoScalingRunnerSet`
+
+Benefits:
+- A clear separation from existing ARC:
+    - Easy for the support engineer to triage income tickets and figure out whether we need to support the use case from the user
+- We won't break existing users when they upgrade to a newer version of ARC after the repository transfer
+
+Based on the spike done by `@nikola-jokic`, we have confidence that we can host multiple resources with different API names under the same repository, and the published ARC controller can handle both resources properly.
+
+## ARC Docker images
+
+We will not start using the GitHub container registry for hosting ARC images (controller + runner images) right after the repository transfer.
+
+But over time, we will start using GHCR for hosting those images along with our deprecation story.
+
+## Helm chart
+
+We will recreate the https://github.com/actions-runner-controller/actions-runner-controller repository after the repository transfer.
+
+The recreated repository will only contain the helm chart assets which keep powering the https://actions-runner-controller.github.io/actions-runner-controller for users to install ARC via Helm.
+
+Long term, we will switch to hosting the helm chart on GHCR (OCI) instead of using GitHub Pages.
+
+This will require a one-time change to our users by running
+`helm repo remove actions-runner-controller` and `helm repo add actions-runner-controller oci://ghcr.io/actions`
--- a/adrs/2022-12-05-adding-labels-k8s-resources.md
+++ b/adrs/2022-12-05-adding-labels-k8s-resources.md
@ -0,0 +1,80 @@
+# ADR 0007: Adding labels to our resources
+
+**Date**: 2022-12-05
+
+**Status**: Done
+
+## Context
+
+users need to provide us with logs so that we can help support and troubleshoot their issues. We need a way for our users to filter and retrieve the logs we need.
+
+## Proposal
+
+A good start would be a catch-all label to get all logs that are
+ARC-related: one of the [recommended labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/)
+is `app.kubernetes.io/part-of` and we can set that for all ARC components
+to be `actions-runner-controller`.
+
+Assuming standard logging that would allow us to get all ARC logs by running
+
+```bash
+kubectl logs -l 'app.kubernetes.io/part-of=actions-runner-controller'
+```
+which would be very useful for development to begin with.
+
+The proposal is to add these sets of labels to the pods ARC creates:
+
+#### controller-manager
+Labels to be set by the Helm chart:
+```yaml
+metadata:
+  labels:
+    app.kubernetes.io/part-of: actions-runner-controller
+    app.kubernetes.io/component: controller-manager
+    app.kubernetes.io/version: "x.x.x"
+```
+
+#### Listener
+Labels to be set by controller at creation:
+```yaml
+metadata:
+  labels:
+    app.kubernetes.io/part-of: actions-runner-controller
+    app.kubernetes.io/component: runner-scale-set-listener
+    app.kubernetes.io/version: "x.x.x"
+    actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
+    
+    # the following labels are to be extracted by the config URL
+    actions.github.com/enterprise: enterprise
+    actions.github.com/organization: organization
+    actions.github.com/repository: repository
+```
+
+#### Runner
+Labels to be set by controller at creation:
+```yaml
+metadata:
+  labels:
+    app.kubernetes.io/part-of: actions-runner-controller
+    app.kubernetes.io/component: runner
+    app.kubernetes.io/version: "x.x.x"
+    actions.github.com/scale-set-name: scale-set-name # this corresponds to metadata.name as set for AutoscalingRunnerSet
+    actions.github.com/runner-name: runner-name
+    actions.github.com/runner-group-name: runner-group-name
+
+    # the following labels are to be extracted by the config URL
+    actions.github.com/enterprise: enterprise
+    actions.github.com/organization: organization
+    actions.github.com/repository: repository
+```
+
+This would allow us to ask users:
+
+> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/part-of=actions-runner-controller'?
+
+Or for example if they're having problems specifically with runners:
+
+> Can you please send us the logs coming from pods labelled 'app.kubernetes.io/component=runner'?
+
+This way users don't have to understand ARC moving parts but we still have a
+way to target them specifically if we need to.
--- a/adrs/2022-12-27-pick-the-right-runner-to-scale-down.md
+++ b/adrs/2022-12-27-pick-the-right-runner-to-scale-down.md
@ -0,0 +1,88 @@
+# ADR 0008: Pick the right runner to scale down
+**Date**: 2022-12-27
+
+**Status**: Done
+
+## Context
+
+- A custom resource `EphemeralRunnerSet` manage a set of custom resource `EphemeralRunners`
+- The `EphemeralRunnerSet` has `Replicas` in its `Spec`, and the responsibility of the `EphemeralRunnerSet_controller` is to reconcile a given `EphemeralRunnerSet` to have
+ the same amount of `EphemeralRunners` as the `Spec.Replicas` defined.
+ - This means the `EphemeralRunnerSet_controller` will scale up the `EphemeralRunnerSet` by creating more `EphemeralRunner` in the case of the `Spec.Replicas` is higher than
+ the current amount of `EphemeralRunners`.
+ - This also means the `EphemeralRunnerSet_controller` will scale down the `EphemeralRunnerSet` by finding some existing `EphemeralRunner` to delete in the case of
+  the `Spec.Replicas` is less than the current amount of `EphemeralRunners`.
+ 
+ This ADR is about how can we find the right existing `EphemeralRunner` to delete when we need to scale down.
+ 
+ 
+ ## Current approach
+ 
+1. `EphemeralRunnerSet_controller` figure out how many `EphemeralRunner` it needs to delete, ex: need to scale down from 10 to 2 means we need to delete 8 `EphemeralRunner`
+
+2. `EphemeralRunnerSet_controller` find all `EphemeralRunner` that is in the `Running` or `Pending` phase.
+    > `Pending` means the `EphemeralRunner` is still probably creating and a runner has not yet configured with the Actions service.
+    > `Running` means the `EphemeralRunner` is created and a runner has probably configured with Actions service, the runner may sit there idle,
+    > or maybe actively running a workflow job. We don't have a clear answer for it from the ARC side. (Actions service knows it for sure)
+
+3. `EphemeralRunnerSet_controller` make an HTTP DELETE request to the Actions service for each `EphemeralRunner` from the previous step and ask the Actions service to delete the runner via `RunnerId`.
+(The `RunnerId` is generated after the runner registered with the Actions service, and stored on the `EphemeralRunner.Status.RunnerId`)
+  > - The HTTP DELETE request looks like the following:  
+  > `DELETE https://pipelines.actions.githubusercontent.com/WoxlUxJHrKEzIp4Nz3YmrmLlZBonrmj9xCJ1lrzcJ9ZsD1Tnw7/_apis/distributedtask/pools/0/agents/1024`
+  > The Actions service will return 2 types of responses:
+  >  1. 204 (No Content): The runner with Id 1024 has been successfully removed from the service or the runner with Id 1024 doesn't exist.
+  >  2. 400 (Bad Request) with JSON body that contains an error message like `JobStillRunningException`: The service can't remove this runner at this point since it has been
+  >  assigned to a job request, the client won't be able to remove the runner until the runner finishes its current assigned job request.
+
+4. `EphemeralRunnerSet_controller` will ignore any deletion error from runners that are still running a job, and keep trying deletion until the amount of `204` equals the amount of 
+`EphemeralRunner` needs to delete.
+
+## The problem with the current approach
+
+In a busy `AutoScalingRunnerSet`, the scale up and down may happen all the time as jobs are queued up and jobs finished.
+
+We will make way too many HTTP requests to the Actions service and ask it to try to delete a certain runner, and rely on the exception from the service to figure out what to do next.
+
+The runner deletion request is not cheap to the service, for synchronization, the `JobStillRunningException` is raised from the DB call for the request.
+
+So we are wasting resources on both the Actions service (extra load to the database) and the actions-runner-controller (useless outgoing HTTP requests).
+
+In the test ARC that I deployed to Azure, the ARC controller tried to delete RunnerId 12408 for `bbq-beets/ting-test` a total of 35 times within 10 minutes.
+
+## Root cause
+
+The `EphemeralRunnerSet_controller` doesn't know whether a given `EphemeralRunner` is actually running a workflow job or not
+(it only knows the runner is configured at the service), so it can't filter out the `EphemeralRunner`.
+
+## Additional context
+
+The legacy ARC's custom resource allows the runner image to leverage the RunnerJobHook feature to update the status of the runner custom resource in K8S (Mark the runner as running workflow run Id XXX).
+
+This brings a good value to users as it can provide some insight about which runner is running which job for all the runners in the cluster and it looks pretty close to what we want to fix the [root cause](#root-cause)
+
+However, the legacy ARC approach means the service account for running the runner pod needs to have elevated permission to update the custom resource,
+this would be a big `NO` from a security point of view since we may not trust the code running inside the runner pod.
+
+## Possible Solution
+
+The nature of the k8s controller-runtime means we might reconcile the resource base on stale cache data.
+
+I think our goal for the solution should be:
+- Reduce wasteful HTTP requests on a scale-down as much as we can.
+- We can accept that we might make 1 or 2 wasteful requests to Actions service, but we can't accept making 5/10+ of them.
+- See if we can meet feature parity with what the RunnerJobHook support with compromise any security concerns.
+
+Since the root cause of why the reconciliation can't skip an `EphemeralRunner` is that we don't know whether an `EphemeralRunner` is running a job,
+a simple thought is how about we somehow attach some info to the `EphemeralRunner` to indicate it's currently running a job?
+
+How about we send this info from the service to the auto-scaling-listener via the existing HTTP long-poll
+and let the listener patch the `EphemeralRunner.Status` to indicate it's running a job?
+> The listener is normally in a separate namespace with elevated permission and it's something we can trust.
+
+Changes:
+- Introduce a new message type `JobStarted` (in addition to the existing `JobAvailable/JobAssigned/JobCompleted`) on the service side, the message is sent when a runner of the `RunnerScaleSet` get assigned to a job,
+  `RequestId`, `RunnerId`, and `RunnerName` will be included in the message.
+- Add `RequestId (int)` to `EphemeralRunner.Status`, this will indicate which job the runner is running.
+- The `AutoScalingListener` will base on the payload of this new message to patch `EphemeralRunners/RunnerName/Status` with the `RequestId`
+- When `EphemeralRunnerSet_controller` try to find `EphemeralRunner` to delete on a scale down, it will skip any `EphemeralRunner` that has `EphemeralRunner.Status.RequestId` set.
+- In the future, we can expose more info to this `JobStarted` message and introduce more property under `EphemeralRunner.Status` to reach feature parity with legacy ARC's RunnerJobHook