Add development setup for runner scale set controller optimization

- Add CLAUDE.md with project focus on new mode only (actions.github.com API)
- Add ENV_SETUP.md for local development with Kind cluster setup
- Add tasks.md with comprehensive performance optimization plan
- Configure for justanotherspy GitHub username and danielschwartzlol Docker Hub
- Use Helm charts version 0.12.1 for runner scale set controller
- Focus exclusively on optimizing EphemeralRunnerSetReconciler parallel creation
- No cert-manager required for new mode setup

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Daniel Schwartz 2025-08-19 15:03:15 +02:00
parent ddc2918a48
commit c73b8a2b92
No known key found for this signature in database
3 changed files with 862 additions and 0 deletions

234
CLAUDE.md Normal file
View File

@ -0,0 +1,234 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Repository Information
**THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository.
- **Fork Owner**: `justanotherspy`
- **Upstream**: `actions/actions-runner-controller`
- **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream
- **Default Branch**: Work on feature branches, not directly on master
## Project Focus
**IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode.
- **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group
- **NO Legacy Development**: Do not work on `actions.summerwind.net` resources
- **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager
- **GitHub Username**: `justanotherspy` (for test repositories)
- **Docker Hub Account**: `danielschwartzlol`
## Development Configuration
- **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller`
- **Runner Image**: Use official `ghcr.io/actions/actions-runner`
- **Helm Charts** (Version 0.12.1):
- Controller: `gha-runner-scale-set-controller`
- Runner Set: `gha-runner-scale-set`
- **Helm Chart Version**: Always use `0.12.1` (latest as of this setup)
- **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md)
- **Test Repository**: `justanotherspy/test-runner-repo`
## Key Components (New Mode Only)
### Controllers to Focus On
**AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`)
- Manages runner scale set lifecycle
- Creates EphemeralRunnerSets based on demand
- Handles runner group configuration
**EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`)
- **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop
- `createEphemeralRunners()` method at line 359-386 needs parallelization
- Manages replicas of EphemeralRunners
**EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`)
- Manages individual runner pods
- Handles runner registration with GitHub
**AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`)
- Manages the listener pod that receives GitHub webhooks
- Triggers scaling events
### Resource Hierarchy (New Mode)
```text
AutoscalingRunnerSet
├── AutoscalingListener (webhook receiver pod)
└── EphemeralRunnerSet
└── EphemeralRunner (Pod)
```
## Performance Optimization Focus
### Current Problem
- `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially
- Time complexity: O(n) where n = number of runners
- Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383`
### Optimization Goal
- Implement parallel runner creation with worker pool pattern
- Target: 10x improvement (create 100 runners in < 30 seconds)
- Configurable concurrency (default: 10 parallel creations)
## Build Commands
```bash
# Build controller for runner scale set mode
make docker-build
docker tag danielschwartzlol/actions-runner-controller:dev \
danielschwartzlol/gha-runner-scale-set-controller:dev
# Run controller locally in scale set mode
make run-scaleset
# Generate CRDs (only actions.github.com ones matter)
make manifests
# Run tests for new mode controllers
go test -v ./controllers/actions.github.com/...
```
## Testing Commands
```bash
# Unit tests for runner scale set controllers
go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet
# Integration tests for new mode
KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \
go test -v ./controllers/actions.github.com/...
# Benchmark runner creation
go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/...
```
## Local Development Workflow
```bash
# 1. Create Kind cluster (no cert-manager needed)
kind create cluster --name arc-dev
# 2. Build and load controller
VERSION=dev make docker-build
docker tag danielschwartzlol/actions-runner-controller:dev \
danielschwartzlol/gha-runner-scale-set-controller:dev
kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev
# 3. Install controller with Helm (v0.12.1)
helm install arc-controller \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
--version 0.12.1 \
--set image.repository=danielschwartzlol/gha-runner-scale-set-controller \
--set image.tag=dev \
--set imagePullPolicy=Never
# 4. Deploy runner scale set (v0.12.1)
helm install arc-runner-set \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
--version 0.12.1 \
--set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \
--set githubConfigSecret="github-auth"
```
## Important Files for Optimization
### Primary Focus
- `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic
- `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management
- `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration
### Configuration
- `charts/gha-runner-scale-set-controller/` - Controller Helm chart
- `charts/gha-runner-scale-set/` - Runner set Helm chart
- `cmd/ghalistener/` - Listener pod that receives GitHub webhooks
### Tests
- `controllers/actions.github.com/ephemeralrunnerset_controller_test.go`
- `controllers/actions.github.com/ephemeralrunner_controller_test.go`
## Code Patterns for New Mode
### Creating Resources in Parallel
```go
// Example pattern for parallel creation
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
ctx context.Context,
runnerSet *v1alpha1.EphemeralRunnerSet,
count int,
log logr.Logger,
) error {
workers := 10 // Configurable
jobs := make(chan int, count)
results := make(chan error, count)
// Start workers
for w := 0; w < workers; w++ {
go r.createRunnerWorker(ctx, runnerSet, jobs, results, log)
}
// Queue jobs
for i := 0; i < count; i++ {
jobs <- i
}
close(jobs)
// Collect results
var errs []error
for i := 0; i < count; i++ {
if err := <-results; err != nil {
errs = append(errs, err)
}
}
return multierr.Combine(errs...)
}
```
## GitHub API Integration
- Use `github.Client` interface for testability
- Implement exponential backoff for rate limiting
- Runner scale sets register with GitHub using JIT configuration
- Default runner group: "default"
## DO NOT Work On
- **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/`
- **Cert-Manager**: Not used in new mode
- **Webhooks**: New mode uses listener pod instead
- **RunnerDeployment**: Legacy resource type
- **HorizontalRunnerAutoscaler**: Legacy autoscaling
## Testing Performance Improvements
```bash
# Create many runners to test parallel creation
kubectl -n arc-runners patch ephemeralrunnerset <name> \
--type merge -p '{"spec":{"replicas":100}}'
# Monitor creation time
time kubectl -n arc-runners wait --for=condition=Ready \
ephemeralrunners --all --timeout=600s
# Check controller metrics
kubectl port-forward -n arc-systems service/arc-controller 8080:80
curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration
```
## Key Metrics to Track
- `ephemeral_runner_creation_duration_seconds` - Time to create each runner
- `ephemeral_runner_set_replicas` - Current vs desired replicas
- `controller_runtime_reconcile_time_seconds` - Reconciliation performance
## Files Referenced
@ENV_SETUP.md - Complete setup guide for new mode
@tasks.md - Performance optimization task plan
@controllers/actions.github.com/ephemeralrunnerset_controller.go
@controllers/actions.github.com/ephemeralrunner_controller.go
@controllers/actions.github.com/autoscalingrunnerset_controller.go

382
ENV_SETUP.md Normal file
View File

@ -0,0 +1,382 @@
# Local Development Environment Setup - Runner Scale Set Controller
This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode).
## Important Notes
- **NO cert-manager required** - The new mode doesn't use webhooks
- **NO legacy controller** - We only work with the new `actions.github.com` API group
- Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
- GitHub username: `justanotherspy`
- Docker Hub account: `danielschwartzlol`
## Prerequisites
### Required Tools
1. **Docker** - For running containers and Kind cluster
```bash
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io
sudo usermod -aG docker $USER
# Log out and back in for group changes to take effect
```
2. **Kind** - Kubernetes in Docker
```bash
# Install Kind
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
```
3. **kubectl** - Kubernetes CLI
```bash
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
```
4. **Helm** - Kubernetes package manager
```bash
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
```
5. **Go** - For building the controller (1.21+)
```bash
# Install Go 1.21
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
```
### Environment Variables
Add these to your `.bashrc` or `.zshrc`:
```bash
# Docker Hub Configuration
export DOCKER_USER="danielschwartzlol"
export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller"
export RUNNER_IMAGE="ghcr.io/actions/actions-runner" # Official runner image
# GitHub Configuration
export GITHUB_TOKEN="your-github-pat-token-here"
export GITHUB_USERNAME="justanotherspy"
# Or for GitHub App authentication (recommended):
# export APP_ID="your-app-id"
# export INSTALLATION_ID="your-installation-id"
# export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem"
# Test Repository Configuration
export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo"
export TEST_ORG="" # Optional: Your test organization
# Development Settings
export VERSION="dev"
export CLUSTER_NAME="arc-dev"
```
## Step 1: Build the Controller Image
```bash
# Build the controller image with scale set mode
make docker-build
# Tag it for our use
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
${CONTROLLER_IMAGE}:${VERSION}
```
## Step 2: Create Kind Cluster
Create a simple Kind cluster (no special config needed for new mode):
```bash
# Create Kind cluster
cat <<EOF | kind create cluster --name ${CLUSTER_NAME} --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
kubeadmConfigPatches:
- |
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
node-labels: "ingress-ready=true"
EOF
# Verify cluster is running
kubectl cluster-info --context kind-${CLUSTER_NAME}
```
## Step 3: Load Controller Image into Kind
```bash
# Load the controller image
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
# Verify image is loaded
docker exec -it ${CLUSTER_NAME}-control-plane crictl images | grep ${DOCKER_USER}
```
## Step 4: Create GitHub Authentication Secret
```bash
# Create namespace
kubectl create namespace arc-systems
# For PAT authentication
kubectl create secret generic github-auth \
--namespace=arc-systems \
--from-literal=github_token=${GITHUB_TOKEN}
# For GitHub App authentication (if using App instead)
kubectl create secret generic github-auth \
--namespace=arc-systems \
--from-file=github_app_id=${APP_ID} \
--from-file=github_app_installation_id=${INSTALLATION_ID} \
--from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
```
## Step 5: Install Runner Scale Set Controller
### Option A: Using Helm (Recommended)
```bash
# Install the controller
helm install arc-controller \
--namespace arc-systems \
--create-namespace \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
--version 0.12.1 \
--set image.repository=${CONTROLLER_IMAGE} \
--set image.tag=${VERSION} \
--set imagePullPolicy=Never
# Verify controller is running
kubectl -n arc-systems get pods -l app.kubernetes.io/name=gha-runner-scale-set-controller
```
### Option B: Manual Deployment (for development)
```bash
# Run the controller locally (for debugging)
CONTROLLER_MANAGER_POD_NAMESPACE=arc-systems \
CONTROLLER_MANAGER_CONTAINER_IMAGE="${CONTROLLER_IMAGE}:${VERSION}" \
make run-scaleset
```
## Step 6: Deploy Runner Scale Set
Create a runner scale set for your repository:
```bash
# Install runner scale set
helm install arc-runner-set \
--namespace arc-runners \
--create-namespace \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
--version 0.12.1 \
--set githubConfigUrl="https://github.com/${TEST_REPO}" \
--set githubConfigSecret="github-auth" \
--set controllerServiceAccount.namespace="arc-systems" \
--set controllerServiceAccount.name="arc-controller-gha-rs-controller" \
--set minRunners=1 \
--set maxRunners=10 \
--set runnerGroup="default" \
--set runnerScaleSetName="test-scale-set"
# Watch the runner scale set
kubectl -n arc-runners get autoscalingrunnersets -w
kubectl -n arc-runners get ephemeralrunnersets -w
kubectl -n arc-runners get ephemeralrunners -w
```
## Step 7: Verify Installation
```bash
# Check controller logs
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
# Check listener logs
kubectl -n arc-systems logs -l app.kubernetes.io/name=arc-runner-set-listener -f
# Check runner pods
kubectl -n arc-runners get pods
# Get runner scale set status
kubectl -n arc-runners get autoscalingrunnersets -o wide
```
## Development Workflow
### Quick Iteration for Controller Changes
```bash
# 1. Make your code changes
# 2. Rebuild controller
VERSION=dev-$(date +%s) make docker-build
docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
${CONTROLLER_IMAGE}:${VERSION}
# 3. Load into Kind
kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
# 4. Update the deployment
kubectl -n arc-systems set image deployment/arc-controller-gha-rs-controller \
manager=${CONTROLLER_IMAGE}:${VERSION}
# 5. Watch logs
kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
```
### Testing Parallel Runner Creation
```bash
# Scale up to test parallel creation
kubectl -n arc-runners patch autoscalingrunnerset arc-runner-set-runner-set \
--type merge \
-p '{"spec":{"maxRunners":50}}'
# Trigger scale up by running workflows in your test repo
# Or manually patch the ephemeralrunnerset
kubectl -n arc-runners patch ephemeralrunnerset <name> \
--type merge \
-p '{"spec":{"replicas":50}}'
# Monitor creation time
time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s
# Check metrics
kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80
curl http://localhost:8080/metrics | grep ephemeral
```
## Debugging
### Enable Verbose Logging
```bash
# Update controller deployment with debug logging
kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller
# Add to container args:
# - "--log-level=debug"
```
### Common Commands
```bash
# Get all resources
kubectl get all -n arc-systems
kubectl get all -n arc-runners
# Describe runner set
kubectl -n arc-runners describe autoscalingrunnerset
# Get events
kubectl -n arc-runners get events --sort-by='.lastTimestamp'
# Port forward for pprof debugging
kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060
go tool pprof http://localhost:6060/debug/pprof/profile
```
## Performance Testing Script
```bash
#!/bin/bash
# perf-test.sh
NAMESPACE="arc-runners"
REPLICAS="${1:-100}"
echo "Testing creation of ${REPLICAS} runners..."
# Record start time
START=$(date +%s)
# Scale up
kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \
--type merge \
-p "{\"spec\":{\"replicas\":${REPLICAS}}}"
# Wait for all runners
kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s
# Record end time
END=$(date +%s)
DURATION=$((END - START))
echo "Created ${REPLICAS} runners in ${DURATION} seconds"
echo "Average time per runner: $((DURATION / REPLICAS)) seconds"
# Get runner creation events
kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner
```
## Cleanup
```bash
# Delete runner scale set
helm uninstall arc-runner-set -n arc-runners
# Delete controller
helm uninstall arc-controller -n arc-systems
# Delete namespaces
kubectl delete namespace arc-systems arc-runners
# Delete Kind cluster
kind delete cluster --name ${CLUSTER_NAME}
```
## Troubleshooting
### Runner Scale Set Not Creating Runners
```bash
# Check if runner scale set is registered
kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId
# Check GitHub API connectivity
kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \
curl -H "Authorization: token ${GITHUB_TOKEN}" \
https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token
```
### Runners Not Picking Up Jobs
```bash
# Ensure runner group matches your workflow
# In workflow file:
# runs-on: [self-hosted, linux, x64, default] # default = runner group
# Check runner registration
kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100
```
## Key Differences from Legacy Mode
1. **No Cert-Manager**: New mode doesn't use admission webhooks
2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner`
3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks
5. **No Runner Deployment**: Only uses ephemeral runners
## Resources
- [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller)
- [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts)
- [Kind Documentation](https://kind.sigs.k8s.io/)

246
tasks.md Normal file
View File

@ -0,0 +1,246 @@
# Runner Scale Set Controller Performance Optimization
## Problem Analysis
Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
### Current Sequential Implementation Issues:
1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
2. **Blocking loop**: Each runner creation blocks until the API call completes
3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
### Key Bottlenecks Identified:
- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
- **No batching**: Individual API calls instead of batch operations
- **No concurrency**: Single-threaded execution path
## Proposed Task List for Performance Improvement
### Phase 1: Research & Design (Week 1)
- [ ] **Task 1.1**: Benchmark current performance
- Measure time to create 10, 50, 100, 500 runners
- Profile CPU/memory usage during scale-up
- Document baseline metrics for comparison
- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
- Study controller-runtime workqueue patterns
- Investigate rate limiting considerations
- Review best practices for bulk operations
- [ ] **Task 1.3**: Design concurrent runner creation architecture
- Define optimal concurrency level (suggest: configurable, default 10)
- Design error handling and retry strategy
- Plan backward compatibility approach
### Phase 2: Implementation (Week 2-3)
- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
```go
// Suggested approach:
// - Use worker pool pattern with configurable concurrency
// - Implement error aggregation
// - Add progress tracking
```
- [ ] **Task 2.2**: Implement configurable concurrency controls
- Add `--runner-creation-concurrency` flag (default: 10)
- Add `--runner-creation-timeout` flag (default: 30s)
- Environment variable overrides for containerized deployments
- [ ] **Task 2.3**: Add comprehensive error handling
- Implement exponential backoff for failed creations
- Partial success handling (some runners created, some failed)
- Detailed error reporting and metrics
- [ ] **Task 2.4**: Implement progress tracking and observability
- Add prometheus metrics for creation time per runner
- Log progress at intervals (e.g., "Created 50/100 runners")
- Add events to AutoscalingRunnerSet for visibility
### Phase 3: Testing (Week 3-4)
- [ ] **Task 3.1**: Unit tests for concurrent creation
- Test with mock client
- Verify error handling
- Test concurrency limits
- Test partial failures
- [ ] **Task 3.2**: Integration tests
- Test with real Kubernetes API
- Verify resource creation order
- Test rollback on failure
- Test with various concurrency levels
- [ ] **Task 3.3**: Load testing
- Test creating 100+ runners simultaneously
- Monitor API server impact
- Measure improvement vs baseline
- Test with rate limiting
- [ ] **Task 3.4**: Chaos testing
- Test with network failures
- Test with API server throttling
- Test with partial quota exhaustion
- Test controller restart during creation
### Phase 4: Optimization & Tuning (Week 4-5)
- [ ] **Task 4.1**: Implement adaptive concurrency
- Start with low concurrency, increase based on success rate
- Back off on errors or throttling
- Self-tuning based on cluster capacity
- [ ] **Task 4.2**: Add bulk creation API support (if available)
- Research if Actions API supports bulk runner registration
- Implement batch registration if supported
- Fall back to parallel individual creation
- [ ] **Task 4.3**: Optimize resource creation
- Pre-compute runner configurations
- Cache common data (secrets, configs)
- Minimize API calls per runner
### Phase 5: Documentation & Rollout (Week 5-6)
- [ ] **Task 5.1**: Document configuration options
- Update CLAUDE.md with new flags
- Add tuning guide for different cluster sizes
- Document performance improvements
- [ ] **Task 5.2**: Create migration guide
- Document any breaking changes
- Provide upgrade path
- Include rollback procedures
- [ ] **Task 5.3**: Performance report
- Before/after benchmarks
- Scalability analysis
- Recommendations for different use cases
## Implementation Details
### Suggested Code Structure
```go
// ephemeralrunnerset_controller.go
type runnerCreationJob struct {
runner *v1alpha1.EphemeralRunner
index int
err error
}
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
ctx context.Context,
runnerSet *v1alpha1.EphemeralRunnerSet,
count int,
log logr.Logger,
) error {
concurrency := r.getConfiguredConcurrency() // Default: 10
jobs := make(chan runnerCreationJob, count)
results := make(chan runnerCreationJob, count)
// Start workers
var wg sync.WaitGroup
for i := 0; i < concurrency; i++ {
wg.Add(1)
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
}
// Queue jobs
for i := 0; i < count; i++ {
jobs <- runnerCreationJob{
runner: r.newEphemeralRunner(runnerSet),
index: i,
}
}
close(jobs)
// Wait for completion
go func() {
wg.Wait()
close(results)
}()
// Collect results and handle errors
var errs []error
created := 0
for result := range results {
if result.err != nil {
errs = append(errs, result.err)
} else {
created++
if created%10 == 0 || created == count {
log.Info("Runner creation progress", "created", created, "total", count)
}
}
}
return multierr.Combine(errs...)
}
```
## Success Metrics
1. **Performance**:
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
2. **Reliability**:
- Handle partial failures gracefully
- No runner leaks on error
- Proper cleanup on controller restart
3. **Observability**:
- Clear progress tracking
- Detailed metrics and logs
- Actionable error messages
4. **Compatibility**:
- Backward compatible by default
- Configurable for different environments
- No breaking changes to CRDs
## Risk Mitigation
1. **API Server Overload**: Implement rate limiting and backoff
2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
3. **Partial Failures**: Implement proper rollback and cleanup
4. **Race Conditions**: Use proper locking and atomic operations
## Testing Requirements
- Unit test coverage > 80%
- Integration tests for all scenarios
- Performance regression tests
- Documentation for all new features
- Backward compatibility tests
## Rollout Plan
1. **Alpha**: Deploy to dev environment with conservative defaults
2. **Beta**: Test with select users, gather feedback
3. **GA**: Full rollout with documentation and migration guide
## Dependencies
- No changes to CRDs required
- Compatible with existing Actions Runner Controller versions
- Requires Go 1.21+ for errors.Join support (already in use)
## Timeline Estimate
- Total Duration: 5-6 weeks
- Developer Resources: 1-2 engineers
- Review & Testing: Additional 1 week
## Notes for Implementation
1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
2. Leverage existing `multierr` package for error aggregation
3. Use context cancellation for proper cleanup
4. Consider implementing circuit breaker pattern for API failures
5. Add feature flag to enable/disable parallel creation