Add development setup for runner scale set controller optimization

- Add CLAUDE.md with project focus on new mode only (actions.github.com API) - Add ENV_SETUP.md for local development with Kind cluster setup - Add tasks.md with comprehensive performance optimization plan - Configure for justanotherspy GitHub username and danielschwartzlol Docker Hub - Use Helm charts version 0.12.1 for runner scale set controller - Focus exclusively on optimizing EphemeralRunnerSetReconciler parallel creation - No cert-manager required for new mode setup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 15:03:15 +02:00 · 2025-08-19 15:03:15 +02:00 · c73b8a2b92
parent ddc2918a48
commit c73b8a2b92
3 changed files with 862 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,234 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Repository Information
 **THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository.
 - **Fork Owner**: `justanotherspy`
 - **Upstream**: `actions/actions-runner-controller`
 - **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream
 - **Default Branch**: Work on feature branches, not directly on master
 ## Project Focus
 **IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode.
 - **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group
 - **NO Legacy Development**: Do not work on `actions.summerwind.net` resources
 - **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager
 - **GitHub Username**: `justanotherspy` (for test repositories)
 - **Docker Hub Account**: `danielschwartzlol`
 ## Development Configuration
 - **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller`
 - **Runner Image**: Use official `ghcr.io/actions/actions-runner`
 - **Helm Charts** (Version 0.12.1):
  - Controller: `gha-runner-scale-set-controller`
  - Runner Set: `gha-runner-scale-set`
 - **Helm Chart Version**: Always use `0.12.1` (latest as of this setup)
 - **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md)
 - **Test Repository**: `justanotherspy/test-runner-repo`
 ## Key Components (New Mode Only)
 ### Controllers to Focus On
 **AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`)
 - Manages runner scale set lifecycle
 - Creates EphemeralRunnerSets based on demand
 - Handles runner group configuration
 **EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`)
 - **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop
 - `createEphemeralRunners()` method at line 359-386 needs parallelization
 - Manages replicas of EphemeralRunners
 **EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`)
 - Manages individual runner pods
 - Handles runner registration with GitHub
 **AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`)
 - Manages the listener pod that receives GitHub webhooks
 - Triggers scaling events
 ### Resource Hierarchy (New Mode)
 ```text
 AutoscalingRunnerSet
  ├── AutoscalingListener (webhook receiver pod)
  └── EphemeralRunnerSet
      └── EphemeralRunner (Pod)
 ```
 ## Performance Optimization Focus
 ### Current Problem
 - `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially
 - Time complexity: O(n) where n = number of runners
 - Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383`
 ### Optimization Goal
 - Implement parallel runner creation with worker pool pattern
 - Target: 10x improvement (create 100 runners in < 30 seconds)
 - Configurable concurrency (default: 10 parallel creations)
 ## Build Commands
 ```bash
 # Build controller for runner scale set mode
 make docker-build
 docker tag danielschwartzlol/actions-runner-controller:dev \
           danielschwartzlol/gha-runner-scale-set-controller:dev
 # Run controller locally in scale set mode
 make run-scaleset
 # Generate CRDs (only actions.github.com ones matter)
 make manifests
 # Run tests for new mode controllers
 go test -v ./controllers/actions.github.com/...
 ```
 ## Testing Commands
 ```bash
 # Unit tests for runner scale set controllers
 go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet
 # Integration tests for new mode
 KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \
  go test -v ./controllers/actions.github.com/...
 # Benchmark runner creation
 go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/...
 ```
 ## Local Development Workflow
 ```bash
 # 1. Create Kind cluster (no cert-manager needed)
 kind create cluster --name arc-dev
 # 2. Build and load controller
 VERSION=dev make docker-build
 docker tag danielschwartzlol/actions-runner-controller:dev \
           danielschwartzlol/gha-runner-scale-set-controller:dev
 kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev
 # 3. Install controller with Helm (v0.12.1)
 helm install arc-controller \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.12.1 \
  --set image.repository=danielschwartzlol/gha-runner-scale-set-controller \
  --set image.tag=dev \
  --set imagePullPolicy=Never
 # 4. Deploy runner scale set (v0.12.1)
 helm install arc-runner-set \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.12.1 \
  --set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \
  --set githubConfigSecret="github-auth"
 ```
 ## Important Files for Optimization
 ### Primary Focus
 - `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic
 - `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management
 - `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration
 ### Configuration
 - `charts/gha-runner-scale-set-controller/` - Controller Helm chart
 - `charts/gha-runner-scale-set/` - Runner set Helm chart
 - `cmd/ghalistener/` - Listener pod that receives GitHub webhooks
 ### Tests
 - `controllers/actions.github.com/ephemeralrunnerset_controller_test.go`
 - `controllers/actions.github.com/ephemeralrunner_controller_test.go`
 ## Code Patterns for New Mode
 ### Creating Resources in Parallel
 ```go
 // Example pattern for parallel creation
 func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
    ctx context.Context,
    runnerSet *v1alpha1.EphemeralRunnerSet,
    count int,
    log logr.Logger,
 ) error {
    workers := 10 // Configurable
    jobs := make(chan int, count)
    results := make(chan error, count)
    // Start workers
    for w := 0; w < workers; w++ {
        go r.createRunnerWorker(ctx, runnerSet, jobs, results, log)
    }
    // Queue jobs
    for i := 0; i < count; i++ {
        jobs <- i
    }
    close(jobs)
    // Collect results
    var errs []error
    for i := 0; i < count; i++ {
        if err := <-results; err != nil {
            errs = append(errs, err)
        }
    }
    return multierr.Combine(errs...)
 }
 ```
 ## GitHub API Integration
 - Use `github.Client` interface for testability
 - Implement exponential backoff for rate limiting
 - Runner scale sets register with GitHub using JIT configuration
 - Default runner group: "default"
 ## DO NOT Work On
 - **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/`
 - **Cert-Manager**: Not used in new mode
 - **Webhooks**: New mode uses listener pod instead
 - **RunnerDeployment**: Legacy resource type
 - **HorizontalRunnerAutoscaler**: Legacy autoscaling
 ## Testing Performance Improvements
 ```bash
 # Create many runners to test parallel creation
 kubectl -n arc-runners patch ephemeralrunnerset <name> \
  --type merge -p '{"spec":{"replicas":100}}'
 # Monitor creation time
 time kubectl -n arc-runners wait --for=condition=Ready \
  ephemeralrunners --all --timeout=600s
 # Check controller metrics
 kubectl port-forward -n arc-systems service/arc-controller 8080:80
 curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration
 ```
 ## Key Metrics to Track
 - `ephemeral_runner_creation_duration_seconds` - Time to create each runner
 - `ephemeral_runner_set_replicas` - Current vs desired replicas
 - `controller_runtime_reconcile_time_seconds` - Reconciliation performance
 ## Files Referenced
@ENV_SETUP.md - Complete setup guide for new mode
@tasks.md - Performance optimization task plan
@controllers/actions.github.com/ephemeralrunnerset_controller.go
@controllers/actions.github.com/ephemeralrunner_controller.go
@controllers/actions.github.com/autoscalingrunnerset_controller.go
--- a/ENV_SETUP.md
+++ b/ENV_SETUP.md
@ -0,0 +1,382 @@
 # Local Development Environment Setup - Runner Scale Set Controller
 This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode).
 ## Important Notes
 - **NO cert-manager required** - The new mode doesn't use webhooks
 - **NO legacy controller** - We only work with the new `actions.github.com` API group
 - Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
 - GitHub username: `justanotherspy`
 - Docker Hub account: `danielschwartzlol`
 ## Prerequisites
 ### Required Tools
 1. **Docker** - For running containers and Kind cluster
   ```bash
   # Ubuntu/Debian
   sudo apt-get update
   sudo apt-get install docker.io
   sudo usermod -aG docker $USER
   # Log out and back in for group changes to take effect
   ```
 2. **Kind** - Kubernetes in Docker
   ```bash
   # Install Kind
   curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
   chmod +x ./kind
   sudo mv ./kind /usr/local/bin/kind
   ```
 3. **kubectl** - Kubernetes CLI
   ```bash
   curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
   chmod +x kubectl
   sudo mv kubectl /usr/local/bin/
   ```
 4. **Helm** - Kubernetes package manager
   ```bash
   curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
   ```
 5. **Go** - For building the controller (1.21+)
   ```bash
   # Install Go 1.21
   wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
   sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
   export PATH=$PATH:/usr/local/go/bin
   echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
   ```
 ### Environment Variables
 Add these to your `.bashrc` or `.zshrc`:
 ```bash
 # Docker Hub Configuration
 export DOCKER_USER="danielschwartzlol"
 export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller"
 export RUNNER_IMAGE="ghcr.io/actions/actions-runner"  # Official runner image
 # GitHub Configuration
 export GITHUB_TOKEN="your-github-pat-token-here"
 export GITHUB_USERNAME="justanotherspy"
 # Or for GitHub App authentication (recommended):
 # export APP_ID="your-app-id"
 # export INSTALLATION_ID="your-installation-id"
 # export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem"
 # Test Repository Configuration
 export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo"
 export TEST_ORG=""  # Optional: Your test organization
 # Development Settings
 export VERSION="dev"
 export CLUSTER_NAME="arc-dev"
 ```
 ## Step 1: Build the Controller Image
 ```bash
 # Build the controller image with scale set mode
 make docker-build
 # Tag it for our use
 docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
           ${CONTROLLER_IMAGE}:${VERSION}
 ```
 ## Step 2: Create Kind Cluster
 Create a simple Kind cluster (no special config needed for new mode):
 ```bash
 # Create Kind cluster
 cat <<EOF | kind create cluster --name ${CLUSTER_NAME} --config=-
 kind: Cluster
 apiVersion: kind.x-k8s.io/v1alpha4
 nodes:
 - role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
 EOF
 # Verify cluster is running
 kubectl cluster-info --context kind-${CLUSTER_NAME}
 ```
 ## Step 3: Load Controller Image into Kind
 ```bash
 # Load the controller image
 kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
 # Verify image is loaded
 docker exec -it ${CLUSTER_NAME}-control-plane crictl images | grep ${DOCKER_USER}
 ```
 ## Step 4: Create GitHub Authentication Secret
 ```bash
 # Create namespace
 kubectl create namespace arc-systems
 # For PAT authentication
 kubectl create secret generic github-auth \
  --namespace=arc-systems \
  --from-literal=github_token=${GITHUB_TOKEN}
 # For GitHub App authentication (if using App instead)
 kubectl create secret generic github-auth \
  --namespace=arc-systems \
  --from-file=github_app_id=${APP_ID} \
  --from-file=github_app_installation_id=${INSTALLATION_ID} \
  --from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
 ```
 ## Step 5: Install Runner Scale Set Controller
 ### Option A: Using Helm (Recommended)
 ```bash
 # Install the controller
 helm install arc-controller \
  --namespace arc-systems \
  --create-namespace \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.12.1 \
  --set image.repository=${CONTROLLER_IMAGE} \
  --set image.tag=${VERSION} \
  --set imagePullPolicy=Never
 # Verify controller is running
 kubectl -n arc-systems get pods -l app.kubernetes.io/name=gha-runner-scale-set-controller
 ```
 ### Option B: Manual Deployment (for development)
 ```bash
 # Run the controller locally (for debugging)
 CONTROLLER_MANAGER_POD_NAMESPACE=arc-systems \
 CONTROLLER_MANAGER_CONTAINER_IMAGE="${CONTROLLER_IMAGE}:${VERSION}" \
 make run-scaleset
 ```
 ## Step 6: Deploy Runner Scale Set
 Create a runner scale set for your repository:
 ```bash
 # Install runner scale set
 helm install arc-runner-set \
  --namespace arc-runners \
  --create-namespace \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.12.1 \
  --set githubConfigUrl="https://github.com/${TEST_REPO}" \
  --set githubConfigSecret="github-auth" \
  --set controllerServiceAccount.namespace="arc-systems" \
  --set controllerServiceAccount.name="arc-controller-gha-rs-controller" \
  --set minRunners=1 \
  --set maxRunners=10 \
  --set runnerGroup="default" \
  --set runnerScaleSetName="test-scale-set"
 # Watch the runner scale set
 kubectl -n arc-runners get autoscalingrunnersets -w
 kubectl -n arc-runners get ephemeralrunnersets -w
 kubectl -n arc-runners get ephemeralrunners -w
 ```
 ## Step 7: Verify Installation
 ```bash
 # Check controller logs
 kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
 # Check listener logs
 kubectl -n arc-systems logs -l app.kubernetes.io/name=arc-runner-set-listener -f
 # Check runner pods
 kubectl -n arc-runners get pods
 # Get runner scale set status
 kubectl -n arc-runners get autoscalingrunnersets -o wide
 ```
 ## Development Workflow
 ### Quick Iteration for Controller Changes
 ```bash
 # 1. Make your code changes
 # 2. Rebuild controller
 VERSION=dev-$(date +%s) make docker-build
 docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
           ${CONTROLLER_IMAGE}:${VERSION}
 # 3. Load into Kind
 kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
 # 4. Update the deployment
 kubectl -n arc-systems set image deployment/arc-controller-gha-rs-controller \
  manager=${CONTROLLER_IMAGE}:${VERSION}
 # 5. Watch logs
 kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
 ```
 ### Testing Parallel Runner Creation
 ```bash
 # Scale up to test parallel creation
 kubectl -n arc-runners patch autoscalingrunnerset arc-runner-set-runner-set \
  --type merge \
  -p '{"spec":{"maxRunners":50}}'
 # Trigger scale up by running workflows in your test repo
 # Or manually patch the ephemeralrunnerset
 kubectl -n arc-runners patch ephemeralrunnerset <name> \
  --type merge \
  -p '{"spec":{"replicas":50}}'
 # Monitor creation time
 time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s
 # Check metrics
 kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80
 curl http://localhost:8080/metrics | grep ephemeral
 ```
 ## Debugging
 ### Enable Verbose Logging
 ```bash
 # Update controller deployment with debug logging
 kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller
 # Add to container args:
 # - "--log-level=debug"
 ```
 ### Common Commands
 ```bash
 # Get all resources
 kubectl get all -n arc-systems
 kubectl get all -n arc-runners
 # Describe runner set
 kubectl -n arc-runners describe autoscalingrunnerset
 # Get events
 kubectl -n arc-runners get events --sort-by='.lastTimestamp'
 # Port forward for pprof debugging
 kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060
 go tool pprof http://localhost:6060/debug/pprof/profile
 ```
 ## Performance Testing Script
 ```bash
 #!/bin/bash
 # perf-test.sh
 NAMESPACE="arc-runners"
 REPLICAS="${1:-100}"
 echo "Testing creation of ${REPLICAS} runners..."
 # Record start time
 START=$(date +%s)
 # Scale up
 kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \
  --type merge \
  -p "{\"spec\":{\"replicas\":${REPLICAS}}}"
 # Wait for all runners
 kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s
 # Record end time
 END=$(date +%s)
 DURATION=$((END - START))
 echo "Created ${REPLICAS} runners in ${DURATION} seconds"
 echo "Average time per runner: $((DURATION / REPLICAS)) seconds"
 # Get runner creation events
 kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner
 ```
 ## Cleanup
 ```bash
 # Delete runner scale set
 helm uninstall arc-runner-set -n arc-runners
 # Delete controller
 helm uninstall arc-controller -n arc-systems
 # Delete namespaces
 kubectl delete namespace arc-systems arc-runners
 # Delete Kind cluster
 kind delete cluster --name ${CLUSTER_NAME}
 ```
 ## Troubleshooting
 ### Runner Scale Set Not Creating Runners
 ```bash
 # Check if runner scale set is registered
 kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId
 # Check GitHub API connectivity
 kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \
  curl -H "Authorization: token ${GITHUB_TOKEN}" \
  https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token
 ```
 ### Runners Not Picking Up Jobs
 ```bash
 # Ensure runner group matches your workflow
 # In workflow file:
 # runs-on: [self-hosted, linux, x64, default]  # default = runner group
 # Check runner registration
 kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100
 ```
 ## Key Differences from Legacy Mode
 1. **No Cert-Manager**: New mode doesn't use admission webhooks
 2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner`
 3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
 4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks
 5. **No Runner Deployment**: Only uses ephemeral runners
 ## Resources
 - [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller)
 - [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts)
 - [Kind Documentation](https://kind.sigs.k8s.io/)
--- a/tasks.md
+++ b/tasks.md
@ -0,0 +1,246 @@
 # Runner Scale Set Controller Performance Optimization
 ## Problem Analysis
 Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
 ### Current Sequential Implementation Issues:
 1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
 2. **Blocking loop**: Each runner creation blocks until the API call completes
 3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
 4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
 ### Key Bottlenecks Identified:
 - **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
 - **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
 - **No batching**: Individual API calls instead of batch operations
 - **No concurrency**: Single-threaded execution path
 ## Proposed Task List for Performance Improvement
 ### Phase 1: Research & Design (Week 1)
 - [ ] **Task 1.1**: Benchmark current performance
  - Measure time to create 10, 50, 100, 500 runners
  - Profile CPU/memory usage during scale-up
  - Document baseline metrics for comparison
 - [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
  - Study controller-runtime workqueue patterns
  - Investigate rate limiting considerations
  - Review best practices for bulk operations
 - [ ] **Task 1.3**: Design concurrent runner creation architecture
  - Define optimal concurrency level (suggest: configurable, default 10)
  - Design error handling and retry strategy
  - Plan backward compatibility approach
 ### Phase 2: Implementation (Week 2-3)
 - [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
  ```go
  // Suggested approach:
  // - Use worker pool pattern with configurable concurrency
  // - Implement error aggregation
  // - Add progress tracking
  ```
 - [ ] **Task 2.2**: Implement configurable concurrency controls
  - Add `--runner-creation-concurrency` flag (default: 10)
  - Add `--runner-creation-timeout` flag (default: 30s)
  - Environment variable overrides for containerized deployments
 - [ ] **Task 2.3**: Add comprehensive error handling
  - Implement exponential backoff for failed creations
  - Partial success handling (some runners created, some failed)
  - Detailed error reporting and metrics
 - [ ] **Task 2.4**: Implement progress tracking and observability
  - Add prometheus metrics for creation time per runner
  - Log progress at intervals (e.g., "Created 50/100 runners")
  - Add events to AutoscalingRunnerSet for visibility
 ### Phase 3: Testing (Week 3-4)
 - [ ] **Task 3.1**: Unit tests for concurrent creation
  - Test with mock client
  - Verify error handling
  - Test concurrency limits
  - Test partial failures
 - [ ] **Task 3.2**: Integration tests
  - Test with real Kubernetes API
  - Verify resource creation order
  - Test rollback on failure
  - Test with various concurrency levels
 - [ ] **Task 3.3**: Load testing
  - Test creating 100+ runners simultaneously
  - Monitor API server impact
  - Measure improvement vs baseline
  - Test with rate limiting
 - [ ] **Task 3.4**: Chaos testing
  - Test with network failures
  - Test with API server throttling
  - Test with partial quota exhaustion
  - Test controller restart during creation
 ### Phase 4: Optimization & Tuning (Week 4-5)
 - [ ] **Task 4.1**: Implement adaptive concurrency
  - Start with low concurrency, increase based on success rate
  - Back off on errors or throttling
  - Self-tuning based on cluster capacity
 - [ ] **Task 4.2**: Add bulk creation API support (if available)
  - Research if Actions API supports bulk runner registration
  - Implement batch registration if supported
  - Fall back to parallel individual creation
 - [ ] **Task 4.3**: Optimize resource creation
  - Pre-compute runner configurations
  - Cache common data (secrets, configs)
  - Minimize API calls per runner
 ### Phase 5: Documentation & Rollout (Week 5-6)
 - [ ] **Task 5.1**: Document configuration options
  - Update CLAUDE.md with new flags
  - Add tuning guide for different cluster sizes
  - Document performance improvements
 - [ ] **Task 5.2**: Create migration guide
  - Document any breaking changes
  - Provide upgrade path
  - Include rollback procedures
 - [ ] **Task 5.3**: Performance report
  - Before/after benchmarks
  - Scalability analysis
  - Recommendations for different use cases
 ## Implementation Details
 ### Suggested Code Structure
 ```go
 // ephemeralrunnerset_controller.go
 type runnerCreationJob struct {
    runner *v1alpha1.EphemeralRunner
    index  int
    err    error
 }
 func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
    ctx context.Context, 
    runnerSet *v1alpha1.EphemeralRunnerSet, 
    count int, 
    log logr.Logger,
 ) error {
    concurrency := r.getConfiguredConcurrency() // Default: 10
    jobs := make(chan runnerCreationJob, count)
    results := make(chan runnerCreationJob, count)
    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < concurrency; i++ {
        wg.Add(1)
        go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
    }
    // Queue jobs
    for i := 0; i < count; i++ {
        jobs <- runnerCreationJob{
            runner: r.newEphemeralRunner(runnerSet),
            index:  i,
        }
    }
    close(jobs)
    // Wait for completion
    go func() {
        wg.Wait()
        close(results)
    }()
    // Collect results and handle errors
    var errs []error
    created := 0
    for result := range results {
        if result.err != nil {
            errs = append(errs, result.err)
        } else {
            created++
            if created%10 == 0 || created == count {
                log.Info("Runner creation progress", "created", created, "total", count)
            }
        }
    }
    return multierr.Combine(errs...)
 }
 ```
 ## Success Metrics
 1. **Performance**: 
   - Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
   - Reduce time complexity from O(n) to O(n/c) where c = concurrency
 2. **Reliability**:
   - Handle partial failures gracefully
   - No runner leaks on error
   - Proper cleanup on controller restart
 3. **Observability**:
   - Clear progress tracking
   - Detailed metrics and logs
   - Actionable error messages
 4. **Compatibility**:
   - Backward compatible by default
   - Configurable for different environments
   - No breaking changes to CRDs
 ## Risk Mitigation
 1. **API Server Overload**: Implement rate limiting and backoff
 2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
 3. **Partial Failures**: Implement proper rollback and cleanup
 4. **Race Conditions**: Use proper locking and atomic operations
 ## Testing Requirements
 - Unit test coverage > 80%
 - Integration tests for all scenarios
 - Performance regression tests
 - Documentation for all new features
 - Backward compatibility tests
 ## Rollout Plan
 1. **Alpha**: Deploy to dev environment with conservative defaults
 2. **Beta**: Test with select users, gather feedback
 3. **GA**: Full rollout with documentation and migration guide
 ## Dependencies
 - No changes to CRDs required
 - Compatible with existing Actions Runner Controller versions
 - Requires Go 1.21+ for errors.Join support (already in use)
 ## Timeline Estimate
 - Total Duration: 5-6 weeks
 - Developer Resources: 1-2 engineers
 - Review & Testing: Additional 1 week
 ## Notes for Implementation
 1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
 2. Leverage existing `multierr` package for error aggregation
 3. Use context cancellation for proper cleanup
 4. Consider implementing circuit breaker pattern for API failures
 5. Add feature flag to enable/disable parallel creation