Add development setup for runner scale set controller optimization
- Add CLAUDE.md with project focus on new mode only (actions.github.com API) - Add ENV_SETUP.md for local development with Kind cluster setup - Add tasks.md with comprehensive performance optimization plan - Configure for justanotherspy GitHub username and danielschwartzlol Docker Hub - Use Helm charts version 0.12.1 for runner scale set controller - Focus exclusively on optimizing EphemeralRunnerSetReconciler parallel creation - No cert-manager required for new mode setup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
ddc2918a48
commit
c73b8a2b92
|
|
@ -0,0 +1,234 @@
|
||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
|
## Repository Information
|
||||||
|
|
||||||
|
**THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository.
|
||||||
|
- **Fork Owner**: `justanotherspy`
|
||||||
|
- **Upstream**: `actions/actions-runner-controller`
|
||||||
|
- **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream
|
||||||
|
- **Default Branch**: Work on feature branches, not directly on master
|
||||||
|
|
||||||
|
## Project Focus
|
||||||
|
|
||||||
|
**IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode.
|
||||||
|
|
||||||
|
- **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group
|
||||||
|
- **NO Legacy Development**: Do not work on `actions.summerwind.net` resources
|
||||||
|
- **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager
|
||||||
|
- **GitHub Username**: `justanotherspy` (for test repositories)
|
||||||
|
- **Docker Hub Account**: `danielschwartzlol`
|
||||||
|
|
||||||
|
## Development Configuration
|
||||||
|
|
||||||
|
- **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller`
|
||||||
|
- **Runner Image**: Use official `ghcr.io/actions/actions-runner`
|
||||||
|
- **Helm Charts** (Version 0.12.1):
|
||||||
|
- Controller: `gha-runner-scale-set-controller`
|
||||||
|
- Runner Set: `gha-runner-scale-set`
|
||||||
|
- **Helm Chart Version**: Always use `0.12.1` (latest as of this setup)
|
||||||
|
- **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md)
|
||||||
|
- **Test Repository**: `justanotherspy/test-runner-repo`
|
||||||
|
|
||||||
|
## Key Components (New Mode Only)
|
||||||
|
|
||||||
|
### Controllers to Focus On
|
||||||
|
|
||||||
|
**AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`)
|
||||||
|
- Manages runner scale set lifecycle
|
||||||
|
- Creates EphemeralRunnerSets based on demand
|
||||||
|
- Handles runner group configuration
|
||||||
|
|
||||||
|
**EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`)
|
||||||
|
- **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop
|
||||||
|
- `createEphemeralRunners()` method at line 359-386 needs parallelization
|
||||||
|
- Manages replicas of EphemeralRunners
|
||||||
|
|
||||||
|
**EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`)
|
||||||
|
- Manages individual runner pods
|
||||||
|
- Handles runner registration with GitHub
|
||||||
|
|
||||||
|
**AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`)
|
||||||
|
- Manages the listener pod that receives GitHub webhooks
|
||||||
|
- Triggers scaling events
|
||||||
|
|
||||||
|
### Resource Hierarchy (New Mode)
|
||||||
|
|
||||||
|
```text
|
||||||
|
AutoscalingRunnerSet
|
||||||
|
├── AutoscalingListener (webhook receiver pod)
|
||||||
|
└── EphemeralRunnerSet
|
||||||
|
└── EphemeralRunner (Pod)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Optimization Focus
|
||||||
|
|
||||||
|
### Current Problem
|
||||||
|
- `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially
|
||||||
|
- Time complexity: O(n) where n = number of runners
|
||||||
|
- Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383`
|
||||||
|
|
||||||
|
### Optimization Goal
|
||||||
|
- Implement parallel runner creation with worker pool pattern
|
||||||
|
- Target: 10x improvement (create 100 runners in < 30 seconds)
|
||||||
|
- Configurable concurrency (default: 10 parallel creations)
|
||||||
|
|
||||||
|
## Build Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build controller for runner scale set mode
|
||||||
|
make docker-build
|
||||||
|
docker tag danielschwartzlol/actions-runner-controller:dev \
|
||||||
|
danielschwartzlol/gha-runner-scale-set-controller:dev
|
||||||
|
|
||||||
|
# Run controller locally in scale set mode
|
||||||
|
make run-scaleset
|
||||||
|
|
||||||
|
# Generate CRDs (only actions.github.com ones matter)
|
||||||
|
make manifests
|
||||||
|
|
||||||
|
# Run tests for new mode controllers
|
||||||
|
go test -v ./controllers/actions.github.com/...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Unit tests for runner scale set controllers
|
||||||
|
go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet
|
||||||
|
|
||||||
|
# Integration tests for new mode
|
||||||
|
KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \
|
||||||
|
go test -v ./controllers/actions.github.com/...
|
||||||
|
|
||||||
|
# Benchmark runner creation
|
||||||
|
go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Local Development Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Create Kind cluster (no cert-manager needed)
|
||||||
|
kind create cluster --name arc-dev
|
||||||
|
|
||||||
|
# 2. Build and load controller
|
||||||
|
VERSION=dev make docker-build
|
||||||
|
docker tag danielschwartzlol/actions-runner-controller:dev \
|
||||||
|
danielschwartzlol/gha-runner-scale-set-controller:dev
|
||||||
|
kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev
|
||||||
|
|
||||||
|
# 3. Install controller with Helm (v0.12.1)
|
||||||
|
helm install arc-controller \
|
||||||
|
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
|
||||||
|
--version 0.12.1 \
|
||||||
|
--set image.repository=danielschwartzlol/gha-runner-scale-set-controller \
|
||||||
|
--set image.tag=dev \
|
||||||
|
--set imagePullPolicy=Never
|
||||||
|
|
||||||
|
# 4. Deploy runner scale set (v0.12.1)
|
||||||
|
helm install arc-runner-set \
|
||||||
|
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
|
||||||
|
--version 0.12.1 \
|
||||||
|
--set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \
|
||||||
|
--set githubConfigSecret="github-auth"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Important Files for Optimization
|
||||||
|
|
||||||
|
### Primary Focus
|
||||||
|
- `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic
|
||||||
|
- `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management
|
||||||
|
- `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
- `charts/gha-runner-scale-set-controller/` - Controller Helm chart
|
||||||
|
- `charts/gha-runner-scale-set/` - Runner set Helm chart
|
||||||
|
- `cmd/ghalistener/` - Listener pod that receives GitHub webhooks
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
- `controllers/actions.github.com/ephemeralrunnerset_controller_test.go`
|
||||||
|
- `controllers/actions.github.com/ephemeralrunner_controller_test.go`
|
||||||
|
|
||||||
|
## Code Patterns for New Mode
|
||||||
|
|
||||||
|
### Creating Resources in Parallel
|
||||||
|
```go
|
||||||
|
// Example pattern for parallel creation
|
||||||
|
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
|
||||||
|
ctx context.Context,
|
||||||
|
runnerSet *v1alpha1.EphemeralRunnerSet,
|
||||||
|
count int,
|
||||||
|
log logr.Logger,
|
||||||
|
) error {
|
||||||
|
workers := 10 // Configurable
|
||||||
|
jobs := make(chan int, count)
|
||||||
|
results := make(chan error, count)
|
||||||
|
|
||||||
|
// Start workers
|
||||||
|
for w := 0; w < workers; w++ {
|
||||||
|
go r.createRunnerWorker(ctx, runnerSet, jobs, results, log)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Queue jobs
|
||||||
|
for i := 0; i < count; i++ {
|
||||||
|
jobs <- i
|
||||||
|
}
|
||||||
|
close(jobs)
|
||||||
|
|
||||||
|
// Collect results
|
||||||
|
var errs []error
|
||||||
|
for i := 0; i < count; i++ {
|
||||||
|
if err := <-results; err != nil {
|
||||||
|
errs = append(errs, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return multierr.Combine(errs...)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## GitHub API Integration
|
||||||
|
|
||||||
|
- Use `github.Client` interface for testability
|
||||||
|
- Implement exponential backoff for rate limiting
|
||||||
|
- Runner scale sets register with GitHub using JIT configuration
|
||||||
|
- Default runner group: "default"
|
||||||
|
|
||||||
|
## DO NOT Work On
|
||||||
|
|
||||||
|
- **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/`
|
||||||
|
- **Cert-Manager**: Not used in new mode
|
||||||
|
- **Webhooks**: New mode uses listener pod instead
|
||||||
|
- **RunnerDeployment**: Legacy resource type
|
||||||
|
- **HorizontalRunnerAutoscaler**: Legacy autoscaling
|
||||||
|
|
||||||
|
## Testing Performance Improvements
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create many runners to test parallel creation
|
||||||
|
kubectl -n arc-runners patch ephemeralrunnerset <name> \
|
||||||
|
--type merge -p '{"spec":{"replicas":100}}'
|
||||||
|
|
||||||
|
# Monitor creation time
|
||||||
|
time kubectl -n arc-runners wait --for=condition=Ready \
|
||||||
|
ephemeralrunners --all --timeout=600s
|
||||||
|
|
||||||
|
# Check controller metrics
|
||||||
|
kubectl port-forward -n arc-systems service/arc-controller 8080:80
|
||||||
|
curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Metrics to Track
|
||||||
|
|
||||||
|
- `ephemeral_runner_creation_duration_seconds` - Time to create each runner
|
||||||
|
- `ephemeral_runner_set_replicas` - Current vs desired replicas
|
||||||
|
- `controller_runtime_reconcile_time_seconds` - Reconciliation performance
|
||||||
|
|
||||||
|
## Files Referenced
|
||||||
|
|
||||||
|
@ENV_SETUP.md - Complete setup guide for new mode
|
||||||
|
@tasks.md - Performance optimization task plan
|
||||||
|
@controllers/actions.github.com/ephemeralrunnerset_controller.go
|
||||||
|
@controllers/actions.github.com/ephemeralrunner_controller.go
|
||||||
|
@controllers/actions.github.com/autoscalingrunnerset_controller.go
|
||||||
|
|
@ -0,0 +1,382 @@
|
||||||
|
# Local Development Environment Setup - Runner Scale Set Controller
|
||||||
|
|
||||||
|
This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode).
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
- **NO cert-manager required** - The new mode doesn't use webhooks
|
||||||
|
- **NO legacy controller** - We only work with the new `actions.github.com` API group
|
||||||
|
- Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
|
||||||
|
- GitHub username: `justanotherspy`
|
||||||
|
- Docker Hub account: `danielschwartzlol`
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
### Required Tools
|
||||||
|
|
||||||
|
1. **Docker** - For running containers and Kind cluster
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ubuntu/Debian
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install docker.io
|
||||||
|
sudo usermod -aG docker $USER
|
||||||
|
# Log out and back in for group changes to take effect
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Kind** - Kubernetes in Docker
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Kind
|
||||||
|
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
|
||||||
|
chmod +x ./kind
|
||||||
|
sudo mv ./kind /usr/local/bin/kind
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **kubectl** - Kubernetes CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
|
||||||
|
chmod +x kubectl
|
||||||
|
sudo mv kubectl /usr/local/bin/
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Helm** - Kubernetes package manager
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Go** - For building the controller (1.21+)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Go 1.21
|
||||||
|
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
|
||||||
|
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
|
||||||
|
export PATH=$PATH:/usr/local/go/bin
|
||||||
|
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
|
||||||
|
```
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
|
||||||
|
Add these to your `.bashrc` or `.zshrc`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Docker Hub Configuration
|
||||||
|
export DOCKER_USER="danielschwartzlol"
|
||||||
|
export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller"
|
||||||
|
export RUNNER_IMAGE="ghcr.io/actions/actions-runner" # Official runner image
|
||||||
|
|
||||||
|
# GitHub Configuration
|
||||||
|
export GITHUB_TOKEN="your-github-pat-token-here"
|
||||||
|
export GITHUB_USERNAME="justanotherspy"
|
||||||
|
|
||||||
|
# Or for GitHub App authentication (recommended):
|
||||||
|
# export APP_ID="your-app-id"
|
||||||
|
# export INSTALLATION_ID="your-installation-id"
|
||||||
|
# export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem"
|
||||||
|
|
||||||
|
# Test Repository Configuration
|
||||||
|
export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo"
|
||||||
|
export TEST_ORG="" # Optional: Your test organization
|
||||||
|
|
||||||
|
# Development Settings
|
||||||
|
export VERSION="dev"
|
||||||
|
export CLUSTER_NAME="arc-dev"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 1: Build the Controller Image
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build the controller image with scale set mode
|
||||||
|
make docker-build
|
||||||
|
|
||||||
|
# Tag it for our use
|
||||||
|
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
|
||||||
|
${CONTROLLER_IMAGE}:${VERSION}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 2: Create Kind Cluster
|
||||||
|
|
||||||
|
Create a simple Kind cluster (no special config needed for new mode):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create Kind cluster
|
||||||
|
cat <<EOF | kind create cluster --name ${CLUSTER_NAME} --config=-
|
||||||
|
kind: Cluster
|
||||||
|
apiVersion: kind.x-k8s.io/v1alpha4
|
||||||
|
nodes:
|
||||||
|
- role: control-plane
|
||||||
|
kubeadmConfigPatches:
|
||||||
|
- |
|
||||||
|
kind: InitConfiguration
|
||||||
|
nodeRegistration:
|
||||||
|
kubeletExtraArgs:
|
||||||
|
node-labels: "ingress-ready=true"
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Verify cluster is running
|
||||||
|
kubectl cluster-info --context kind-${CLUSTER_NAME}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 3: Load Controller Image into Kind
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Load the controller image
|
||||||
|
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
|
||||||
|
|
||||||
|
# Verify image is loaded
|
||||||
|
docker exec -it ${CLUSTER_NAME}-control-plane crictl images | grep ${DOCKER_USER}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 4: Create GitHub Authentication Secret
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Create namespace
|
||||||
|
kubectl create namespace arc-systems
|
||||||
|
|
||||||
|
# For PAT authentication
|
||||||
|
kubectl create secret generic github-auth \
|
||||||
|
--namespace=arc-systems \
|
||||||
|
--from-literal=github_token=${GITHUB_TOKEN}
|
||||||
|
|
||||||
|
# For GitHub App authentication (if using App instead)
|
||||||
|
kubectl create secret generic github-auth \
|
||||||
|
--namespace=arc-systems \
|
||||||
|
--from-file=github_app_id=${APP_ID} \
|
||||||
|
--from-file=github_app_installation_id=${INSTALLATION_ID} \
|
||||||
|
--from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 5: Install Runner Scale Set Controller
|
||||||
|
|
||||||
|
### Option A: Using Helm (Recommended)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install the controller
|
||||||
|
helm install arc-controller \
|
||||||
|
--namespace arc-systems \
|
||||||
|
--create-namespace \
|
||||||
|
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
|
||||||
|
--version 0.12.1 \
|
||||||
|
--set image.repository=${CONTROLLER_IMAGE} \
|
||||||
|
--set image.tag=${VERSION} \
|
||||||
|
--set imagePullPolicy=Never
|
||||||
|
|
||||||
|
# Verify controller is running
|
||||||
|
kubectl -n arc-systems get pods -l app.kubernetes.io/name=gha-runner-scale-set-controller
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option B: Manual Deployment (for development)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run the controller locally (for debugging)
|
||||||
|
CONTROLLER_MANAGER_POD_NAMESPACE=arc-systems \
|
||||||
|
CONTROLLER_MANAGER_CONTAINER_IMAGE="${CONTROLLER_IMAGE}:${VERSION}" \
|
||||||
|
make run-scaleset
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 6: Deploy Runner Scale Set
|
||||||
|
|
||||||
|
Create a runner scale set for your repository:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install runner scale set
|
||||||
|
helm install arc-runner-set \
|
||||||
|
--namespace arc-runners \
|
||||||
|
--create-namespace \
|
||||||
|
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
|
||||||
|
--version 0.12.1 \
|
||||||
|
--set githubConfigUrl="https://github.com/${TEST_REPO}" \
|
||||||
|
--set githubConfigSecret="github-auth" \
|
||||||
|
--set controllerServiceAccount.namespace="arc-systems" \
|
||||||
|
--set controllerServiceAccount.name="arc-controller-gha-rs-controller" \
|
||||||
|
--set minRunners=1 \
|
||||||
|
--set maxRunners=10 \
|
||||||
|
--set runnerGroup="default" \
|
||||||
|
--set runnerScaleSetName="test-scale-set"
|
||||||
|
|
||||||
|
# Watch the runner scale set
|
||||||
|
kubectl -n arc-runners get autoscalingrunnersets -w
|
||||||
|
kubectl -n arc-runners get ephemeralrunnersets -w
|
||||||
|
kubectl -n arc-runners get ephemeralrunners -w
|
||||||
|
```
|
||||||
|
|
||||||
|
## Step 7: Verify Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check controller logs
|
||||||
|
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
|
||||||
|
|
||||||
|
# Check listener logs
|
||||||
|
kubectl -n arc-systems logs -l app.kubernetes.io/name=arc-runner-set-listener -f
|
||||||
|
|
||||||
|
# Check runner pods
|
||||||
|
kubectl -n arc-runners get pods
|
||||||
|
|
||||||
|
# Get runner scale set status
|
||||||
|
kubectl -n arc-runners get autoscalingrunnersets -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development Workflow
|
||||||
|
|
||||||
|
### Quick Iteration for Controller Changes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Make your code changes
|
||||||
|
|
||||||
|
# 2. Rebuild controller
|
||||||
|
VERSION=dev-$(date +%s) make docker-build
|
||||||
|
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
|
||||||
|
${CONTROLLER_IMAGE}:${VERSION}
|
||||||
|
|
||||||
|
# 3. Load into Kind
|
||||||
|
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
|
||||||
|
|
||||||
|
# 4. Update the deployment
|
||||||
|
kubectl -n arc-systems set image deployment/arc-controller-gha-rs-controller \
|
||||||
|
manager=${CONTROLLER_IMAGE}:${VERSION}
|
||||||
|
|
||||||
|
# 5. Watch logs
|
||||||
|
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing Parallel Runner Creation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Scale up to test parallel creation
|
||||||
|
kubectl -n arc-runners patch autoscalingrunnerset arc-runner-set-runner-set \
|
||||||
|
--type merge \
|
||||||
|
-p '{"spec":{"maxRunners":50}}'
|
||||||
|
|
||||||
|
# Trigger scale up by running workflows in your test repo
|
||||||
|
# Or manually patch the ephemeralrunnerset
|
||||||
|
kubectl -n arc-runners patch ephemeralrunnerset <name> \
|
||||||
|
--type merge \
|
||||||
|
-p '{"spec":{"replicas":50}}'
|
||||||
|
|
||||||
|
# Monitor creation time
|
||||||
|
time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s
|
||||||
|
|
||||||
|
# Check metrics
|
||||||
|
kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80
|
||||||
|
curl http://localhost:8080/metrics | grep ephemeral
|
||||||
|
```
|
||||||
|
|
||||||
|
## Debugging
|
||||||
|
|
||||||
|
### Enable Verbose Logging
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Update controller deployment with debug logging
|
||||||
|
kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller
|
||||||
|
|
||||||
|
# Add to container args:
|
||||||
|
# - "--log-level=debug"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get all resources
|
||||||
|
kubectl get all -n arc-systems
|
||||||
|
kubectl get all -n arc-runners
|
||||||
|
|
||||||
|
# Describe runner set
|
||||||
|
kubectl -n arc-runners describe autoscalingrunnerset
|
||||||
|
|
||||||
|
# Get events
|
||||||
|
kubectl -n arc-runners get events --sort-by='.lastTimestamp'
|
||||||
|
|
||||||
|
# Port forward for pprof debugging
|
||||||
|
kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060
|
||||||
|
go tool pprof http://localhost:6060/debug/pprof/profile
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Testing Script
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
# perf-test.sh
|
||||||
|
|
||||||
|
NAMESPACE="arc-runners"
|
||||||
|
REPLICAS="${1:-100}"
|
||||||
|
|
||||||
|
echo "Testing creation of ${REPLICAS} runners..."
|
||||||
|
|
||||||
|
# Record start time
|
||||||
|
START=$(date +%s)
|
||||||
|
|
||||||
|
# Scale up
|
||||||
|
kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \
|
||||||
|
--type merge \
|
||||||
|
-p "{\"spec\":{\"replicas\":${REPLICAS}}}"
|
||||||
|
|
||||||
|
# Wait for all runners
|
||||||
|
kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s
|
||||||
|
|
||||||
|
# Record end time
|
||||||
|
END=$(date +%s)
|
||||||
|
DURATION=$((END - START))
|
||||||
|
|
||||||
|
echo "Created ${REPLICAS} runners in ${DURATION} seconds"
|
||||||
|
echo "Average time per runner: $((DURATION / REPLICAS)) seconds"
|
||||||
|
|
||||||
|
# Get runner creation events
|
||||||
|
kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cleanup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Delete runner scale set
|
||||||
|
helm uninstall arc-runner-set -n arc-runners
|
||||||
|
|
||||||
|
# Delete controller
|
||||||
|
helm uninstall arc-controller -n arc-systems
|
||||||
|
|
||||||
|
# Delete namespaces
|
||||||
|
kubectl delete namespace arc-systems arc-runners
|
||||||
|
|
||||||
|
# Delete Kind cluster
|
||||||
|
kind delete cluster --name ${CLUSTER_NAME}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Runner Scale Set Not Creating Runners
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check if runner scale set is registered
|
||||||
|
kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId
|
||||||
|
|
||||||
|
# Check GitHub API connectivity
|
||||||
|
kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \
|
||||||
|
curl -H "Authorization: token ${GITHUB_TOKEN}" \
|
||||||
|
https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token
|
||||||
|
```
|
||||||
|
|
||||||
|
### Runners Not Picking Up Jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ensure runner group matches your workflow
|
||||||
|
# In workflow file:
|
||||||
|
# runs-on: [self-hosted, linux, x64, default] # default = runner group
|
||||||
|
|
||||||
|
# Check runner registration
|
||||||
|
kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Differences from Legacy Mode
|
||||||
|
|
||||||
|
1. **No Cert-Manager**: New mode doesn't use admission webhooks
|
||||||
|
2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner`
|
||||||
|
3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
|
||||||
|
4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks
|
||||||
|
5. **No Runner Deployment**: Only uses ephemeral runners
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
- [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller)
|
||||||
|
- [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts)
|
||||||
|
- [Kind Documentation](https://kind.sigs.k8s.io/)
|
||||||
|
|
@ -0,0 +1,246 @@
|
||||||
|
# Runner Scale Set Controller Performance Optimization
|
||||||
|
|
||||||
|
## Problem Analysis
|
||||||
|
|
||||||
|
Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
|
||||||
|
|
||||||
|
### Current Sequential Implementation Issues:
|
||||||
|
1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
|
||||||
|
2. **Blocking loop**: Each runner creation blocks until the API call completes
|
||||||
|
3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
|
||||||
|
4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
|
||||||
|
|
||||||
|
### Key Bottlenecks Identified:
|
||||||
|
- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
|
||||||
|
- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
|
||||||
|
- **No batching**: Individual API calls instead of batch operations
|
||||||
|
- **No concurrency**: Single-threaded execution path
|
||||||
|
|
||||||
|
## Proposed Task List for Performance Improvement
|
||||||
|
|
||||||
|
### Phase 1: Research & Design (Week 1)
|
||||||
|
- [ ] **Task 1.1**: Benchmark current performance
|
||||||
|
- Measure time to create 10, 50, 100, 500 runners
|
||||||
|
- Profile CPU/memory usage during scale-up
|
||||||
|
- Document baseline metrics for comparison
|
||||||
|
|
||||||
|
- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
|
||||||
|
- Study controller-runtime workqueue patterns
|
||||||
|
- Investigate rate limiting considerations
|
||||||
|
- Review best practices for bulk operations
|
||||||
|
|
||||||
|
- [ ] **Task 1.3**: Design concurrent runner creation architecture
|
||||||
|
- Define optimal concurrency level (suggest: configurable, default 10)
|
||||||
|
- Design error handling and retry strategy
|
||||||
|
- Plan backward compatibility approach
|
||||||
|
|
||||||
|
### Phase 2: Implementation (Week 2-3)
|
||||||
|
|
||||||
|
- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
|
||||||
|
```go
|
||||||
|
// Suggested approach:
|
||||||
|
// - Use worker pool pattern with configurable concurrency
|
||||||
|
// - Implement error aggregation
|
||||||
|
// - Add progress tracking
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] **Task 2.2**: Implement configurable concurrency controls
|
||||||
|
- Add `--runner-creation-concurrency` flag (default: 10)
|
||||||
|
- Add `--runner-creation-timeout` flag (default: 30s)
|
||||||
|
- Environment variable overrides for containerized deployments
|
||||||
|
|
||||||
|
- [ ] **Task 2.3**: Add comprehensive error handling
|
||||||
|
- Implement exponential backoff for failed creations
|
||||||
|
- Partial success handling (some runners created, some failed)
|
||||||
|
- Detailed error reporting and metrics
|
||||||
|
|
||||||
|
- [ ] **Task 2.4**: Implement progress tracking and observability
|
||||||
|
- Add prometheus metrics for creation time per runner
|
||||||
|
- Log progress at intervals (e.g., "Created 50/100 runners")
|
||||||
|
- Add events to AutoscalingRunnerSet for visibility
|
||||||
|
|
||||||
|
### Phase 3: Testing (Week 3-4)
|
||||||
|
|
||||||
|
- [ ] **Task 3.1**: Unit tests for concurrent creation
|
||||||
|
- Test with mock client
|
||||||
|
- Verify error handling
|
||||||
|
- Test concurrency limits
|
||||||
|
- Test partial failures
|
||||||
|
|
||||||
|
- [ ] **Task 3.2**: Integration tests
|
||||||
|
- Test with real Kubernetes API
|
||||||
|
- Verify resource creation order
|
||||||
|
- Test rollback on failure
|
||||||
|
- Test with various concurrency levels
|
||||||
|
|
||||||
|
- [ ] **Task 3.3**: Load testing
|
||||||
|
- Test creating 100+ runners simultaneously
|
||||||
|
- Monitor API server impact
|
||||||
|
- Measure improvement vs baseline
|
||||||
|
- Test with rate limiting
|
||||||
|
|
||||||
|
- [ ] **Task 3.4**: Chaos testing
|
||||||
|
- Test with network failures
|
||||||
|
- Test with API server throttling
|
||||||
|
- Test with partial quota exhaustion
|
||||||
|
- Test controller restart during creation
|
||||||
|
|
||||||
|
### Phase 4: Optimization & Tuning (Week 4-5)
|
||||||
|
|
||||||
|
- [ ] **Task 4.1**: Implement adaptive concurrency
|
||||||
|
- Start with low concurrency, increase based on success rate
|
||||||
|
- Back off on errors or throttling
|
||||||
|
- Self-tuning based on cluster capacity
|
||||||
|
|
||||||
|
- [ ] **Task 4.2**: Add bulk creation API support (if available)
|
||||||
|
- Research if Actions API supports bulk runner registration
|
||||||
|
- Implement batch registration if supported
|
||||||
|
- Fall back to parallel individual creation
|
||||||
|
|
||||||
|
- [ ] **Task 4.3**: Optimize resource creation
|
||||||
|
- Pre-compute runner configurations
|
||||||
|
- Cache common data (secrets, configs)
|
||||||
|
- Minimize API calls per runner
|
||||||
|
|
||||||
|
### Phase 5: Documentation & Rollout (Week 5-6)
|
||||||
|
|
||||||
|
- [ ] **Task 5.1**: Document configuration options
|
||||||
|
- Update CLAUDE.md with new flags
|
||||||
|
- Add tuning guide for different cluster sizes
|
||||||
|
- Document performance improvements
|
||||||
|
|
||||||
|
- [ ] **Task 5.2**: Create migration guide
|
||||||
|
- Document any breaking changes
|
||||||
|
- Provide upgrade path
|
||||||
|
- Include rollback procedures
|
||||||
|
|
||||||
|
- [ ] **Task 5.3**: Performance report
|
||||||
|
- Before/after benchmarks
|
||||||
|
- Scalability analysis
|
||||||
|
- Recommendations for different use cases
|
||||||
|
|
||||||
|
## Implementation Details
|
||||||
|
|
||||||
|
### Suggested Code Structure
|
||||||
|
|
||||||
|
```go
|
||||||
|
// ephemeralrunnerset_controller.go
|
||||||
|
|
||||||
|
type runnerCreationJob struct {
|
||||||
|
runner *v1alpha1.EphemeralRunner
|
||||||
|
index int
|
||||||
|
err error
|
||||||
|
}
|
||||||
|
|
||||||
|
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
|
||||||
|
ctx context.Context,
|
||||||
|
runnerSet *v1alpha1.EphemeralRunnerSet,
|
||||||
|
count int,
|
||||||
|
log logr.Logger,
|
||||||
|
) error {
|
||||||
|
concurrency := r.getConfiguredConcurrency() // Default: 10
|
||||||
|
|
||||||
|
jobs := make(chan runnerCreationJob, count)
|
||||||
|
results := make(chan runnerCreationJob, count)
|
||||||
|
|
||||||
|
// Start workers
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
for i := 0; i < concurrency; i++ {
|
||||||
|
wg.Add(1)
|
||||||
|
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Queue jobs
|
||||||
|
for i := 0; i < count; i++ {
|
||||||
|
jobs <- runnerCreationJob{
|
||||||
|
runner: r.newEphemeralRunner(runnerSet),
|
||||||
|
index: i,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
close(jobs)
|
||||||
|
|
||||||
|
// Wait for completion
|
||||||
|
go func() {
|
||||||
|
wg.Wait()
|
||||||
|
close(results)
|
||||||
|
}()
|
||||||
|
|
||||||
|
// Collect results and handle errors
|
||||||
|
var errs []error
|
||||||
|
created := 0
|
||||||
|
for result := range results {
|
||||||
|
if result.err != nil {
|
||||||
|
errs = append(errs, result.err)
|
||||||
|
} else {
|
||||||
|
created++
|
||||||
|
if created%10 == 0 || created == count {
|
||||||
|
log.Info("Runner creation progress", "created", created, "total", count)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return multierr.Combine(errs...)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Success Metrics
|
||||||
|
|
||||||
|
1. **Performance**:
|
||||||
|
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
|
||||||
|
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
|
||||||
|
|
||||||
|
2. **Reliability**:
|
||||||
|
- Handle partial failures gracefully
|
||||||
|
- No runner leaks on error
|
||||||
|
- Proper cleanup on controller restart
|
||||||
|
|
||||||
|
3. **Observability**:
|
||||||
|
- Clear progress tracking
|
||||||
|
- Detailed metrics and logs
|
||||||
|
- Actionable error messages
|
||||||
|
|
||||||
|
4. **Compatibility**:
|
||||||
|
- Backward compatible by default
|
||||||
|
- Configurable for different environments
|
||||||
|
- No breaking changes to CRDs
|
||||||
|
|
||||||
|
## Risk Mitigation
|
||||||
|
|
||||||
|
1. **API Server Overload**: Implement rate limiting and backoff
|
||||||
|
2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
|
||||||
|
3. **Partial Failures**: Implement proper rollback and cleanup
|
||||||
|
4. **Race Conditions**: Use proper locking and atomic operations
|
||||||
|
|
||||||
|
## Testing Requirements
|
||||||
|
|
||||||
|
- Unit test coverage > 80%
|
||||||
|
- Integration tests for all scenarios
|
||||||
|
- Performance regression tests
|
||||||
|
- Documentation for all new features
|
||||||
|
- Backward compatibility tests
|
||||||
|
|
||||||
|
## Rollout Plan
|
||||||
|
|
||||||
|
1. **Alpha**: Deploy to dev environment with conservative defaults
|
||||||
|
2. **Beta**: Test with select users, gather feedback
|
||||||
|
3. **GA**: Full rollout with documentation and migration guide
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- No changes to CRDs required
|
||||||
|
- Compatible with existing Actions Runner Controller versions
|
||||||
|
- Requires Go 1.21+ for errors.Join support (already in use)
|
||||||
|
|
||||||
|
## Timeline Estimate
|
||||||
|
|
||||||
|
- Total Duration: 5-6 weeks
|
||||||
|
- Developer Resources: 1-2 engineers
|
||||||
|
- Review & Testing: Additional 1 week
|
||||||
|
|
||||||
|
## Notes for Implementation
|
||||||
|
|
||||||
|
1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
|
||||||
|
2. Leverage existing `multierr` package for error aggregation
|
||||||
|
3. Use context cancellation for proper cleanup
|
||||||
|
4. Consider implementing circuit breaker pattern for API failures
|
||||||
|
5. Add feature flag to enable/disable parallel creation
|
||||||
Loading…
Reference in New Issue