Add development setup for runner scale set controller optimization
- Add CLAUDE.md with project focus on new mode only (actions.github.com API) - Add ENV_SETUP.md for local development with Kind cluster setup - Add tasks.md with comprehensive performance optimization plan - Configure for justanotherspy GitHub username and danielschwartzlol Docker Hub - Use Helm charts version 0.12.1 for runner scale set controller - Focus exclusively on optimizing EphemeralRunnerSetReconciler parallel creation - No cert-manager required for new mode setup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
ddc2918a48
commit
c73b8a2b92
|
|
@ -0,0 +1,234 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Repository Information
|
||||
|
||||
**THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository.
|
||||
- **Fork Owner**: `justanotherspy`
|
||||
- **Upstream**: `actions/actions-runner-controller`
|
||||
- **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream
|
||||
- **Default Branch**: Work on feature branches, not directly on master
|
||||
|
||||
## Project Focus
|
||||
|
||||
**IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode.
|
||||
|
||||
- **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group
|
||||
- **NO Legacy Development**: Do not work on `actions.summerwind.net` resources
|
||||
- **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager
|
||||
- **GitHub Username**: `justanotherspy` (for test repositories)
|
||||
- **Docker Hub Account**: `danielschwartzlol`
|
||||
|
||||
## Development Configuration
|
||||
|
||||
- **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller`
|
||||
- **Runner Image**: Use official `ghcr.io/actions/actions-runner`
|
||||
- **Helm Charts** (Version 0.12.1):
|
||||
- Controller: `gha-runner-scale-set-controller`
|
||||
- Runner Set: `gha-runner-scale-set`
|
||||
- **Helm Chart Version**: Always use `0.12.1` (latest as of this setup)
|
||||
- **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md)
|
||||
- **Test Repository**: `justanotherspy/test-runner-repo`
|
||||
|
||||
## Key Components (New Mode Only)
|
||||
|
||||
### Controllers to Focus On
|
||||
|
||||
**AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`)
|
||||
- Manages runner scale set lifecycle
|
||||
- Creates EphemeralRunnerSets based on demand
|
||||
- Handles runner group configuration
|
||||
|
||||
**EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`)
|
||||
- **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop
|
||||
- `createEphemeralRunners()` method at line 359-386 needs parallelization
|
||||
- Manages replicas of EphemeralRunners
|
||||
|
||||
**EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`)
|
||||
- Manages individual runner pods
|
||||
- Handles runner registration with GitHub
|
||||
|
||||
**AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`)
|
||||
- Manages the listener pod that receives GitHub webhooks
|
||||
- Triggers scaling events
|
||||
|
||||
### Resource Hierarchy (New Mode)
|
||||
|
||||
```text
|
||||
AutoscalingRunnerSet
|
||||
├── AutoscalingListener (webhook receiver pod)
|
||||
└── EphemeralRunnerSet
|
||||
└── EphemeralRunner (Pod)
|
||||
```
|
||||
|
||||
## Performance Optimization Focus
|
||||
|
||||
### Current Problem
|
||||
- `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially
|
||||
- Time complexity: O(n) where n = number of runners
|
||||
- Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383`
|
||||
|
||||
### Optimization Goal
|
||||
- Implement parallel runner creation with worker pool pattern
|
||||
- Target: 10x improvement (create 100 runners in < 30 seconds)
|
||||
- Configurable concurrency (default: 10 parallel creations)
|
||||
|
||||
## Build Commands
|
||||
|
||||
```bash
|
||||
# Build controller for runner scale set mode
|
||||
make docker-build
|
||||
docker tag danielschwartzlol/actions-runner-controller:dev \
|
||||
danielschwartzlol/gha-runner-scale-set-controller:dev
|
||||
|
||||
# Run controller locally in scale set mode
|
||||
make run-scaleset
|
||||
|
||||
# Generate CRDs (only actions.github.com ones matter)
|
||||
make manifests
|
||||
|
||||
# Run tests for new mode controllers
|
||||
go test -v ./controllers/actions.github.com/...
|
||||
```
|
||||
|
||||
## Testing Commands
|
||||
|
||||
```bash
|
||||
# Unit tests for runner scale set controllers
|
||||
go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet
|
||||
|
||||
# Integration tests for new mode
|
||||
KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \
|
||||
go test -v ./controllers/actions.github.com/...
|
||||
|
||||
# Benchmark runner creation
|
||||
go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/...
|
||||
```
|
||||
|
||||
## Local Development Workflow
|
||||
|
||||
```bash
|
||||
# 1. Create Kind cluster (no cert-manager needed)
|
||||
kind create cluster --name arc-dev
|
||||
|
||||
# 2. Build and load controller
|
||||
VERSION=dev make docker-build
|
||||
docker tag danielschwartzlol/actions-runner-controller:dev \
|
||||
danielschwartzlol/gha-runner-scale-set-controller:dev
|
||||
kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev
|
||||
|
||||
# 3. Install controller with Helm (v0.12.1)
|
||||
helm install arc-controller \
|
||||
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
|
||||
--version 0.12.1 \
|
||||
--set image.repository=danielschwartzlol/gha-runner-scale-set-controller \
|
||||
--set image.tag=dev \
|
||||
--set imagePullPolicy=Never
|
||||
|
||||
# 4. Deploy runner scale set (v0.12.1)
|
||||
helm install arc-runner-set \
|
||||
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
|
||||
--version 0.12.1 \
|
||||
--set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \
|
||||
--set githubConfigSecret="github-auth"
|
||||
```
|
||||
|
||||
## Important Files for Optimization
|
||||
|
||||
### Primary Focus
|
||||
- `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic
|
||||
- `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management
|
||||
- `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration
|
||||
|
||||
### Configuration
|
||||
- `charts/gha-runner-scale-set-controller/` - Controller Helm chart
|
||||
- `charts/gha-runner-scale-set/` - Runner set Helm chart
|
||||
- `cmd/ghalistener/` - Listener pod that receives GitHub webhooks
|
||||
|
||||
### Tests
|
||||
- `controllers/actions.github.com/ephemeralrunnerset_controller_test.go`
|
||||
- `controllers/actions.github.com/ephemeralrunner_controller_test.go`
|
||||
|
||||
## Code Patterns for New Mode
|
||||
|
||||
### Creating Resources in Parallel
|
||||
```go
|
||||
// Example pattern for parallel creation
|
||||
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
|
||||
ctx context.Context,
|
||||
runnerSet *v1alpha1.EphemeralRunnerSet,
|
||||
count int,
|
||||
log logr.Logger,
|
||||
) error {
|
||||
workers := 10 // Configurable
|
||||
jobs := make(chan int, count)
|
||||
results := make(chan error, count)
|
||||
|
||||
// Start workers
|
||||
for w := 0; w < workers; w++ {
|
||||
go r.createRunnerWorker(ctx, runnerSet, jobs, results, log)
|
||||
}
|
||||
|
||||
// Queue jobs
|
||||
for i := 0; i < count; i++ {
|
||||
jobs <- i
|
||||
}
|
||||
close(jobs)
|
||||
|
||||
// Collect results
|
||||
var errs []error
|
||||
for i := 0; i < count; i++ {
|
||||
if err := <-results; err != nil {
|
||||
errs = append(errs, err)
|
||||
}
|
||||
}
|
||||
|
||||
return multierr.Combine(errs...)
|
||||
}
|
||||
```
|
||||
|
||||
## GitHub API Integration
|
||||
|
||||
- Use `github.Client` interface for testability
|
||||
- Implement exponential backoff for rate limiting
|
||||
- Runner scale sets register with GitHub using JIT configuration
|
||||
- Default runner group: "default"
|
||||
|
||||
## DO NOT Work On
|
||||
|
||||
- **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/`
|
||||
- **Cert-Manager**: Not used in new mode
|
||||
- **Webhooks**: New mode uses listener pod instead
|
||||
- **RunnerDeployment**: Legacy resource type
|
||||
- **HorizontalRunnerAutoscaler**: Legacy autoscaling
|
||||
|
||||
## Testing Performance Improvements
|
||||
|
||||
```bash
|
||||
# Create many runners to test parallel creation
|
||||
kubectl -n arc-runners patch ephemeralrunnerset <name> \
|
||||
--type merge -p '{"spec":{"replicas":100}}'
|
||||
|
||||
# Monitor creation time
|
||||
time kubectl -n arc-runners wait --for=condition=Ready \
|
||||
ephemeralrunners --all --timeout=600s
|
||||
|
||||
# Check controller metrics
|
||||
kubectl port-forward -n arc-systems service/arc-controller 8080:80
|
||||
curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration
|
||||
```
|
||||
|
||||
## Key Metrics to Track
|
||||
|
||||
- `ephemeral_runner_creation_duration_seconds` - Time to create each runner
|
||||
- `ephemeral_runner_set_replicas` - Current vs desired replicas
|
||||
- `controller_runtime_reconcile_time_seconds` - Reconciliation performance
|
||||
|
||||
## Files Referenced
|
||||
|
||||
@ENV_SETUP.md - Complete setup guide for new mode
|
||||
@tasks.md - Performance optimization task plan
|
||||
@controllers/actions.github.com/ephemeralrunnerset_controller.go
|
||||
@controllers/actions.github.com/ephemeralrunner_controller.go
|
||||
@controllers/actions.github.com/autoscalingrunnerset_controller.go
|
||||
|
|
@ -0,0 +1,382 @@
|
|||
# Local Development Environment Setup - Runner Scale Set Controller
|
||||
|
||||
This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode).
|
||||
|
||||
## Important Notes
|
||||
|
||||
- **NO cert-manager required** - The new mode doesn't use webhooks
|
||||
- **NO legacy controller** - We only work with the new `actions.github.com` API group
|
||||
- Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
|
||||
- GitHub username: `justanotherspy`
|
||||
- Docker Hub account: `danielschwartzlol`
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Required Tools
|
||||
|
||||
1. **Docker** - For running containers and Kind cluster
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get update
|
||||
sudo apt-get install docker.io
|
||||
sudo usermod -aG docker $USER
|
||||
# Log out and back in for group changes to take effect
|
||||
```
|
||||
|
||||
2. **Kind** - Kubernetes in Docker
|
||||
|
||||
```bash
|
||||
# Install Kind
|
||||
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
|
||||
chmod +x ./kind
|
||||
sudo mv ./kind /usr/local/bin/kind
|
||||
```
|
||||
|
||||
3. **kubectl** - Kubernetes CLI
|
||||
|
||||
```bash
|
||||
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
|
||||
chmod +x kubectl
|
||||
sudo mv kubectl /usr/local/bin/
|
||||
```
|
||||
|
||||
4. **Helm** - Kubernetes package manager
|
||||
|
||||
```bash
|
||||
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
|
||||
```
|
||||
|
||||
5. **Go** - For building the controller (1.21+)
|
||||
|
||||
```bash
|
||||
# Install Go 1.21
|
||||
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
|
||||
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
|
||||
export PATH=$PATH:/usr/local/go/bin
|
||||
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Add these to your `.bashrc` or `.zshrc`:
|
||||
|
||||
```bash
|
||||
# Docker Hub Configuration
|
||||
export DOCKER_USER="danielschwartzlol"
|
||||
export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller"
|
||||
export RUNNER_IMAGE="ghcr.io/actions/actions-runner" # Official runner image
|
||||
|
||||
# GitHub Configuration
|
||||
export GITHUB_TOKEN="your-github-pat-token-here"
|
||||
export GITHUB_USERNAME="justanotherspy"
|
||||
|
||||
# Or for GitHub App authentication (recommended):
|
||||
# export APP_ID="your-app-id"
|
||||
# export INSTALLATION_ID="your-installation-id"
|
||||
# export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem"
|
||||
|
||||
# Test Repository Configuration
|
||||
export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo"
|
||||
export TEST_ORG="" # Optional: Your test organization
|
||||
|
||||
# Development Settings
|
||||
export VERSION="dev"
|
||||
export CLUSTER_NAME="arc-dev"
|
||||
```
|
||||
|
||||
## Step 1: Build the Controller Image
|
||||
|
||||
```bash
|
||||
# Build the controller image with scale set mode
|
||||
make docker-build
|
||||
|
||||
# Tag it for our use
|
||||
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
|
||||
${CONTROLLER_IMAGE}:${VERSION}
|
||||
```
|
||||
|
||||
## Step 2: Create Kind Cluster
|
||||
|
||||
Create a simple Kind cluster (no special config needed for new mode):
|
||||
|
||||
```bash
|
||||
# Create Kind cluster
|
||||
cat <<EOF | kind create cluster --name ${CLUSTER_NAME} --config=-
|
||||
kind: Cluster
|
||||
apiVersion: kind.x-k8s.io/v1alpha4
|
||||
nodes:
|
||||
- role: control-plane
|
||||
kubeadmConfigPatches:
|
||||
- |
|
||||
kind: InitConfiguration
|
||||
nodeRegistration:
|
||||
kubeletExtraArgs:
|
||||
node-labels: "ingress-ready=true"
|
||||
EOF
|
||||
|
||||
# Verify cluster is running
|
||||
kubectl cluster-info --context kind-${CLUSTER_NAME}
|
||||
```
|
||||
|
||||
## Step 3: Load Controller Image into Kind
|
||||
|
||||
```bash
|
||||
# Load the controller image
|
||||
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
|
||||
|
||||
# Verify image is loaded
|
||||
docker exec -it ${CLUSTER_NAME}-control-plane crictl images | grep ${DOCKER_USER}
|
||||
```
|
||||
|
||||
## Step 4: Create GitHub Authentication Secret
|
||||
|
||||
```bash
|
||||
# Create namespace
|
||||
kubectl create namespace arc-systems
|
||||
|
||||
# For PAT authentication
|
||||
kubectl create secret generic github-auth \
|
||||
--namespace=arc-systems \
|
||||
--from-literal=github_token=${GITHUB_TOKEN}
|
||||
|
||||
# For GitHub App authentication (if using App instead)
|
||||
kubectl create secret generic github-auth \
|
||||
--namespace=arc-systems \
|
||||
--from-file=github_app_id=${APP_ID} \
|
||||
--from-file=github_app_installation_id=${INSTALLATION_ID} \
|
||||
--from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
|
||||
```
|
||||
|
||||
## Step 5: Install Runner Scale Set Controller
|
||||
|
||||
### Option A: Using Helm (Recommended)
|
||||
|
||||
```bash
|
||||
# Install the controller
|
||||
helm install arc-controller \
|
||||
--namespace arc-systems \
|
||||
--create-namespace \
|
||||
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
|
||||
--version 0.12.1 \
|
||||
--set image.repository=${CONTROLLER_IMAGE} \
|
||||
--set image.tag=${VERSION} \
|
||||
--set imagePullPolicy=Never
|
||||
|
||||
# Verify controller is running
|
||||
kubectl -n arc-systems get pods -l app.kubernetes.io/name=gha-runner-scale-set-controller
|
||||
```
|
||||
|
||||
### Option B: Manual Deployment (for development)
|
||||
|
||||
```bash
|
||||
# Run the controller locally (for debugging)
|
||||
CONTROLLER_MANAGER_POD_NAMESPACE=arc-systems \
|
||||
CONTROLLER_MANAGER_CONTAINER_IMAGE="${CONTROLLER_IMAGE}:${VERSION}" \
|
||||
make run-scaleset
|
||||
```
|
||||
|
||||
## Step 6: Deploy Runner Scale Set
|
||||
|
||||
Create a runner scale set for your repository:
|
||||
|
||||
```bash
|
||||
# Install runner scale set
|
||||
helm install arc-runner-set \
|
||||
--namespace arc-runners \
|
||||
--create-namespace \
|
||||
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
|
||||
--version 0.12.1 \
|
||||
--set githubConfigUrl="https://github.com/${TEST_REPO}" \
|
||||
--set githubConfigSecret="github-auth" \
|
||||
--set controllerServiceAccount.namespace="arc-systems" \
|
||||
--set controllerServiceAccount.name="arc-controller-gha-rs-controller" \
|
||||
--set minRunners=1 \
|
||||
--set maxRunners=10 \
|
||||
--set runnerGroup="default" \
|
||||
--set runnerScaleSetName="test-scale-set"
|
||||
|
||||
# Watch the runner scale set
|
||||
kubectl -n arc-runners get autoscalingrunnersets -w
|
||||
kubectl -n arc-runners get ephemeralrunnersets -w
|
||||
kubectl -n arc-runners get ephemeralrunners -w
|
||||
```
|
||||
|
||||
## Step 7: Verify Installation
|
||||
|
||||
```bash
|
||||
# Check controller logs
|
||||
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
|
||||
|
||||
# Check listener logs
|
||||
kubectl -n arc-systems logs -l app.kubernetes.io/name=arc-runner-set-listener -f
|
||||
|
||||
# Check runner pods
|
||||
kubectl -n arc-runners get pods
|
||||
|
||||
# Get runner scale set status
|
||||
kubectl -n arc-runners get autoscalingrunnersets -o wide
|
||||
```
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Quick Iteration for Controller Changes
|
||||
|
||||
```bash
|
||||
# 1. Make your code changes
|
||||
|
||||
# 2. Rebuild controller
|
||||
VERSION=dev-$(date +%s) make docker-build
|
||||
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
|
||||
${CONTROLLER_IMAGE}:${VERSION}
|
||||
|
||||
# 3. Load into Kind
|
||||
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
|
||||
|
||||
# 4. Update the deployment
|
||||
kubectl -n arc-systems set image deployment/arc-controller-gha-rs-controller \
|
||||
manager=${CONTROLLER_IMAGE}:${VERSION}
|
||||
|
||||
# 5. Watch logs
|
||||
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
|
||||
```
|
||||
|
||||
### Testing Parallel Runner Creation
|
||||
|
||||
```bash
|
||||
# Scale up to test parallel creation
|
||||
kubectl -n arc-runners patch autoscalingrunnerset arc-runner-set-runner-set \
|
||||
--type merge \
|
||||
-p '{"spec":{"maxRunners":50}}'
|
||||
|
||||
# Trigger scale up by running workflows in your test repo
|
||||
# Or manually patch the ephemeralrunnerset
|
||||
kubectl -n arc-runners patch ephemeralrunnerset <name> \
|
||||
--type merge \
|
||||
-p '{"spec":{"replicas":50}}'
|
||||
|
||||
# Monitor creation time
|
||||
time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s
|
||||
|
||||
# Check metrics
|
||||
kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80
|
||||
curl http://localhost:8080/metrics | grep ephemeral
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
### Enable Verbose Logging
|
||||
|
||||
```bash
|
||||
# Update controller deployment with debug logging
|
||||
kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller
|
||||
|
||||
# Add to container args:
|
||||
# - "--log-level=debug"
|
||||
```
|
||||
|
||||
### Common Commands
|
||||
|
||||
```bash
|
||||
# Get all resources
|
||||
kubectl get all -n arc-systems
|
||||
kubectl get all -n arc-runners
|
||||
|
||||
# Describe runner set
|
||||
kubectl -n arc-runners describe autoscalingrunnerset
|
||||
|
||||
# Get events
|
||||
kubectl -n arc-runners get events --sort-by='.lastTimestamp'
|
||||
|
||||
# Port forward for pprof debugging
|
||||
kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060
|
||||
go tool pprof http://localhost:6060/debug/pprof/profile
|
||||
```
|
||||
|
||||
## Performance Testing Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# perf-test.sh
|
||||
|
||||
NAMESPACE="arc-runners"
|
||||
REPLICAS="${1:-100}"
|
||||
|
||||
echo "Testing creation of ${REPLICAS} runners..."
|
||||
|
||||
# Record start time
|
||||
START=$(date +%s)
|
||||
|
||||
# Scale up
|
||||
kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \
|
||||
--type merge \
|
||||
-p "{\"spec\":{\"replicas\":${REPLICAS}}}"
|
||||
|
||||
# Wait for all runners
|
||||
kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s
|
||||
|
||||
# Record end time
|
||||
END=$(date +%s)
|
||||
DURATION=$((END - START))
|
||||
|
||||
echo "Created ${REPLICAS} runners in ${DURATION} seconds"
|
||||
echo "Average time per runner: $((DURATION / REPLICAS)) seconds"
|
||||
|
||||
# Get runner creation events
|
||||
kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner
|
||||
```
|
||||
|
||||
## Cleanup
|
||||
|
||||
```bash
|
||||
# Delete runner scale set
|
||||
helm uninstall arc-runner-set -n arc-runners
|
||||
|
||||
# Delete controller
|
||||
helm uninstall arc-controller -n arc-systems
|
||||
|
||||
# Delete namespaces
|
||||
kubectl delete namespace arc-systems arc-runners
|
||||
|
||||
# Delete Kind cluster
|
||||
kind delete cluster --name ${CLUSTER_NAME}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Runner Scale Set Not Creating Runners
|
||||
|
||||
```bash
|
||||
# Check if runner scale set is registered
|
||||
kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId
|
||||
|
||||
# Check GitHub API connectivity
|
||||
kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \
|
||||
curl -H "Authorization: token ${GITHUB_TOKEN}" \
|
||||
https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token
|
||||
```
|
||||
|
||||
### Runners Not Picking Up Jobs
|
||||
|
||||
```bash
|
||||
# Ensure runner group matches your workflow
|
||||
# In workflow file:
|
||||
# runs-on: [self-hosted, linux, x64, default] # default = runner group
|
||||
|
||||
# Check runner registration
|
||||
kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100
|
||||
```
|
||||
|
||||
## Key Differences from Legacy Mode
|
||||
|
||||
1. **No Cert-Manager**: New mode doesn't use admission webhooks
|
||||
2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner`
|
||||
3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
|
||||
4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks
|
||||
5. **No Runner Deployment**: Only uses ephemeral runners
|
||||
|
||||
## Resources
|
||||
|
||||
- [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller)
|
||||
- [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts)
|
||||
- [Kind Documentation](https://kind.sigs.k8s.io/)
|
||||
|
|
@ -0,0 +1,246 @@
|
|||
# Runner Scale Set Controller Performance Optimization
|
||||
|
||||
## Problem Analysis
|
||||
|
||||
Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
|
||||
|
||||
### Current Sequential Implementation Issues:
|
||||
1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
|
||||
2. **Blocking loop**: Each runner creation blocks until the API call completes
|
||||
3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
|
||||
4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
|
||||
|
||||
### Key Bottlenecks Identified:
|
||||
- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
|
||||
- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
|
||||
- **No batching**: Individual API calls instead of batch operations
|
||||
- **No concurrency**: Single-threaded execution path
|
||||
|
||||
## Proposed Task List for Performance Improvement
|
||||
|
||||
### Phase 1: Research & Design (Week 1)
|
||||
- [ ] **Task 1.1**: Benchmark current performance
|
||||
- Measure time to create 10, 50, 100, 500 runners
|
||||
- Profile CPU/memory usage during scale-up
|
||||
- Document baseline metrics for comparison
|
||||
|
||||
- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
|
||||
- Study controller-runtime workqueue patterns
|
||||
- Investigate rate limiting considerations
|
||||
- Review best practices for bulk operations
|
||||
|
||||
- [ ] **Task 1.3**: Design concurrent runner creation architecture
|
||||
- Define optimal concurrency level (suggest: configurable, default 10)
|
||||
- Design error handling and retry strategy
|
||||
- Plan backward compatibility approach
|
||||
|
||||
### Phase 2: Implementation (Week 2-3)
|
||||
|
||||
- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
|
||||
```go
|
||||
// Suggested approach:
|
||||
// - Use worker pool pattern with configurable concurrency
|
||||
// - Implement error aggregation
|
||||
// - Add progress tracking
|
||||
```
|
||||
|
||||
- [ ] **Task 2.2**: Implement configurable concurrency controls
|
||||
- Add `--runner-creation-concurrency` flag (default: 10)
|
||||
- Add `--runner-creation-timeout` flag (default: 30s)
|
||||
- Environment variable overrides for containerized deployments
|
||||
|
||||
- [ ] **Task 2.3**: Add comprehensive error handling
|
||||
- Implement exponential backoff for failed creations
|
||||
- Partial success handling (some runners created, some failed)
|
||||
- Detailed error reporting and metrics
|
||||
|
||||
- [ ] **Task 2.4**: Implement progress tracking and observability
|
||||
- Add prometheus metrics for creation time per runner
|
||||
- Log progress at intervals (e.g., "Created 50/100 runners")
|
||||
- Add events to AutoscalingRunnerSet for visibility
|
||||
|
||||
### Phase 3: Testing (Week 3-4)
|
||||
|
||||
- [ ] **Task 3.1**: Unit tests for concurrent creation
|
||||
- Test with mock client
|
||||
- Verify error handling
|
||||
- Test concurrency limits
|
||||
- Test partial failures
|
||||
|
||||
- [ ] **Task 3.2**: Integration tests
|
||||
- Test with real Kubernetes API
|
||||
- Verify resource creation order
|
||||
- Test rollback on failure
|
||||
- Test with various concurrency levels
|
||||
|
||||
- [ ] **Task 3.3**: Load testing
|
||||
- Test creating 100+ runners simultaneously
|
||||
- Monitor API server impact
|
||||
- Measure improvement vs baseline
|
||||
- Test with rate limiting
|
||||
|
||||
- [ ] **Task 3.4**: Chaos testing
|
||||
- Test with network failures
|
||||
- Test with API server throttling
|
||||
- Test with partial quota exhaustion
|
||||
- Test controller restart during creation
|
||||
|
||||
### Phase 4: Optimization & Tuning (Week 4-5)
|
||||
|
||||
- [ ] **Task 4.1**: Implement adaptive concurrency
|
||||
- Start with low concurrency, increase based on success rate
|
||||
- Back off on errors or throttling
|
||||
- Self-tuning based on cluster capacity
|
||||
|
||||
- [ ] **Task 4.2**: Add bulk creation API support (if available)
|
||||
- Research if Actions API supports bulk runner registration
|
||||
- Implement batch registration if supported
|
||||
- Fall back to parallel individual creation
|
||||
|
||||
- [ ] **Task 4.3**: Optimize resource creation
|
||||
- Pre-compute runner configurations
|
||||
- Cache common data (secrets, configs)
|
||||
- Minimize API calls per runner
|
||||
|
||||
### Phase 5: Documentation & Rollout (Week 5-6)
|
||||
|
||||
- [ ] **Task 5.1**: Document configuration options
|
||||
- Update CLAUDE.md with new flags
|
||||
- Add tuning guide for different cluster sizes
|
||||
- Document performance improvements
|
||||
|
||||
- [ ] **Task 5.2**: Create migration guide
|
||||
- Document any breaking changes
|
||||
- Provide upgrade path
|
||||
- Include rollback procedures
|
||||
|
||||
- [ ] **Task 5.3**: Performance report
|
||||
- Before/after benchmarks
|
||||
- Scalability analysis
|
||||
- Recommendations for different use cases
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Suggested Code Structure
|
||||
|
||||
```go
|
||||
// ephemeralrunnerset_controller.go
|
||||
|
||||
type runnerCreationJob struct {
|
||||
runner *v1alpha1.EphemeralRunner
|
||||
index int
|
||||
err error
|
||||
}
|
||||
|
||||
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
|
||||
ctx context.Context,
|
||||
runnerSet *v1alpha1.EphemeralRunnerSet,
|
||||
count int,
|
||||
log logr.Logger,
|
||||
) error {
|
||||
concurrency := r.getConfiguredConcurrency() // Default: 10
|
||||
|
||||
jobs := make(chan runnerCreationJob, count)
|
||||
results := make(chan runnerCreationJob, count)
|
||||
|
||||
// Start workers
|
||||
var wg sync.WaitGroup
|
||||
for i := 0; i < concurrency; i++ {
|
||||
wg.Add(1)
|
||||
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
|
||||
}
|
||||
|
||||
// Queue jobs
|
||||
for i := 0; i < count; i++ {
|
||||
jobs <- runnerCreationJob{
|
||||
runner: r.newEphemeralRunner(runnerSet),
|
||||
index: i,
|
||||
}
|
||||
}
|
||||
close(jobs)
|
||||
|
||||
// Wait for completion
|
||||
go func() {
|
||||
wg.Wait()
|
||||
close(results)
|
||||
}()
|
||||
|
||||
// Collect results and handle errors
|
||||
var errs []error
|
||||
created := 0
|
||||
for result := range results {
|
||||
if result.err != nil {
|
||||
errs = append(errs, result.err)
|
||||
} else {
|
||||
created++
|
||||
if created%10 == 0 || created == count {
|
||||
log.Info("Runner creation progress", "created", created, "total", count)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return multierr.Combine(errs...)
|
||||
}
|
||||
```
|
||||
|
||||
## Success Metrics
|
||||
|
||||
1. **Performance**:
|
||||
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
|
||||
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
|
||||
|
||||
2. **Reliability**:
|
||||
- Handle partial failures gracefully
|
||||
- No runner leaks on error
|
||||
- Proper cleanup on controller restart
|
||||
|
||||
3. **Observability**:
|
||||
- Clear progress tracking
|
||||
- Detailed metrics and logs
|
||||
- Actionable error messages
|
||||
|
||||
4. **Compatibility**:
|
||||
- Backward compatible by default
|
||||
- Configurable for different environments
|
||||
- No breaking changes to CRDs
|
||||
|
||||
## Risk Mitigation
|
||||
|
||||
1. **API Server Overload**: Implement rate limiting and backoff
|
||||
2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
|
||||
3. **Partial Failures**: Implement proper rollback and cleanup
|
||||
4. **Race Conditions**: Use proper locking and atomic operations
|
||||
|
||||
## Testing Requirements
|
||||
|
||||
- Unit test coverage > 80%
|
||||
- Integration tests for all scenarios
|
||||
- Performance regression tests
|
||||
- Documentation for all new features
|
||||
- Backward compatibility tests
|
||||
|
||||
## Rollout Plan
|
||||
|
||||
1. **Alpha**: Deploy to dev environment with conservative defaults
|
||||
2. **Beta**: Test with select users, gather feedback
|
||||
3. **GA**: Full rollout with documentation and migration guide
|
||||
|
||||
## Dependencies
|
||||
|
||||
- No changes to CRDs required
|
||||
- Compatible with existing Actions Runner Controller versions
|
||||
- Requires Go 1.21+ for errors.Join support (already in use)
|
||||
|
||||
## Timeline Estimate
|
||||
|
||||
- Total Duration: 5-6 weeks
|
||||
- Developer Resources: 1-2 engineers
|
||||
- Review & Testing: Additional 1 week
|
||||
|
||||
## Notes for Implementation
|
||||
|
||||
1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
|
||||
2. Leverage existing `multierr` package for error aggregation
|
||||
3. Use context cancellation for proper cleanup
|
||||
4. Consider implementing circuit breaker pattern for API failures
|
||||
5. Add feature flag to enable/disable parallel creation
|
||||
Loading…
Reference in New Issue