actions-runner-controller/docs/adrs/2023-03-17-workflow-improve...

4.5 KiB

Improve ARC workflows for autoscaling runner sets

Date: 2023-03-17

Status: Done

Context

In the actions-runner-controller repository we essentially have two projects living side by side: the "legacy" actions-runner-controller and the new one GitHub is supporting (gha-runner-scale-set). To hasten progress we relied on existing workflows and added some of our own (e.g.: end-to-end tests). We now got to a point where it's sort of confusing what does what and why, not to mention the increased running times of some those workflows and some GHA-related flaky tests getting in the way of legacy ARC and viceversa. The three main areas we want to cover are: Go code, Kubernetes manifests / Helm charts and E2E tests.

Go code

At the moment we have three workflows that validate Go code:

  • golangci-lint: this is a collection of linters that currently runs on all PRs and push to master
  • Validate ARC: this is a bit of a catch-all workflow, other than Go tests this also validates Kubernetes manifests, runs go generate, go fmt and go vet
  • Run CodeQL

Proposal

I think having one Go workflow that collects everything-Go would help a ton with reliability and understandability of what's going on. This shouldn't be limited to the GHA-supported mode as there are changes that even if made outside the GHA code base could affect us (such as a dependency update). This workflow should only run on changes to *.go files, go.mod and go.sum. It should have these jobs, aiming to cover all existing functionality and eliminate some duplication:

  • test: run all Go tests in the project. We currently use the -short and -coverprofile flags: while -short is used to skip old ARC E2E tests, -coverprofile is adding to the test time without really giving us any value in return. We should also start using actions/setup-go@v5 to take advantage of caching (it would speed up our tests by a lot) or enable it on v3 if we have a strong reason not to upgrade. We should keep ignoring our E2E tests too as those will be run elsewhere (either use Short there too or ignoring the package like we currently do). As a dependency for tests this needs to run make manifests first: we should fail there and then if there is a diff.
  • fmt: we currently run go fmt ./... as part of Validate ARC but do nothing with the results. We should fail in case of a diff. We don't need caching for this job.
  • lint: this corresponds to what's currently the golanci-lint workflow (this also covers go vet which currently happens as part of Validate ARC too)
  • generate: the current behaviour for this is actually quite risky, we generate our code in Validate ARC workflow and use the results to run the tests but we don't validate that up to date generate code is checked in. This job should run go generate and fail on a diff.
  • vulncheck: EDIT: this is covered by CodeQL the Go team is maintaining govulncheck, a tool to recursively analyzing all function calls in Go code and spot vulnerabilities on the call stack.

Kubernetes manifests / Helm charts

We have recently separated Helm chart validation and we validate up-to-dateness of manifests as part of Go / test.

End to end tests

These tests are giving us really good coverage and should be one of the main actors when it comes to trusting our releases. Two improvements that could be done here are:

  • renaming the workflow to GHA E2E: since renaming our resources the gha prefix has been used to identify things related to the mode GitHub supports and these jobs strictly validate the GitHub mode only. Having a shorter name allows for more readability of the various scenarios (e.g. GHA E2E / single-namespace-setup).
  • the test currently monitors and validates the number of pods spawning during the workflow but not the outcome of the workflow. While not necessary to look at pods specifics, we should at least guarantee that the workflow can successfully conclude.