Add development setup for runner scale set controller optimization

- Add CLAUDE.md with project focus on new mode only (actions.github.com API) - Add ENV_SETUP.md for local development with Kind cluster setup - Add tasks.md with comprehensive performance optimization plan - Configure for justanotherspy GitHub username and danielschwartzlol Docker Hub - Use Helm charts version 0.12.1 for runner scale set controller - Focus exclusively on optimizing EphemeralRunnerSetReconciler parallel creation - No cert-manager required for new mode setup 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-19 15:03:15 +02:00 · 2025-08-19 15:03:15 +02:00 · c73b8a2b92
parent ddc2918a48
commit c73b8a2b92
3 changed files with 862 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,234 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Repository Information
+
+**THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository.
+- **Fork Owner**: `justanotherspy`
+- **Upstream**: `actions/actions-runner-controller`
+- **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream
+- **Default Branch**: Work on feature branches, not directly on master
+
+## Project Focus
+
+**IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode.
+
+- **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group
+- **NO Legacy Development**: Do not work on `actions.summerwind.net` resources
+- **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager
+- **GitHub Username**: `justanotherspy` (for test repositories)
+- **Docker Hub Account**: `danielschwartzlol`
+
+## Development Configuration
+
+- **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller`
+- **Runner Image**: Use official `ghcr.io/actions/actions-runner`
+- **Helm Charts** (Version 0.12.1):
+  - Controller: `gha-runner-scale-set-controller`
+  - Runner Set: `gha-runner-scale-set`
+- **Helm Chart Version**: Always use `0.12.1` (latest as of this setup)
+- **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md)
+- **Test Repository**: `justanotherspy/test-runner-repo`
+
+## Key Components (New Mode Only)
+
+### Controllers to Focus On
+
+**AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`)
+- Manages runner scale set lifecycle
+- Creates EphemeralRunnerSets based on demand
+- Handles runner group configuration
+
+**EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`)
+- **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop
+- `createEphemeralRunners()` method at line 359-386 needs parallelization
+- Manages replicas of EphemeralRunners
+
+**EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`)
+- Manages individual runner pods
+- Handles runner registration with GitHub
+
+**AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`)
+- Manages the listener pod that receives GitHub webhooks
+- Triggers scaling events
+
+### Resource Hierarchy (New Mode)
+
+```text
+AutoscalingRunnerSet
+  ├── AutoscalingListener (webhook receiver pod)
+  └── EphemeralRunnerSet
+      └── EphemeralRunner (Pod)
+```
+
+## Performance Optimization Focus
+
+### Current Problem
+- `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially
+- Time complexity: O(n) where n = number of runners
+- Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383`
+
+### Optimization Goal
+- Implement parallel runner creation with worker pool pattern
+- Target: 10x improvement (create 100 runners in < 30 seconds)
+- Configurable concurrency (default: 10 parallel creations)
+
+## Build Commands
+
+```bash
+# Build controller for runner scale set mode
+make docker-build
+docker tag danielschwartzlol/actions-runner-controller:dev \
+           danielschwartzlol/gha-runner-scale-set-controller:dev
+
+# Run controller locally in scale set mode
+make run-scaleset
+
+# Generate CRDs (only actions.github.com ones matter)
+make manifests
+
+# Run tests for new mode controllers
+go test -v ./controllers/actions.github.com/...
+```
+
+## Testing Commands
+
+```bash
+# Unit tests for runner scale set controllers
+go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet
+
+# Integration tests for new mode
+KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \
+  go test -v ./controllers/actions.github.com/...
+
+# Benchmark runner creation
+go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/...
+```
+
+## Local Development Workflow
+
+```bash
+# 1. Create Kind cluster (no cert-manager needed)
+kind create cluster --name arc-dev
+
+# 2. Build and load controller
+VERSION=dev make docker-build
+docker tag danielschwartzlol/actions-runner-controller:dev \
+           danielschwartzlol/gha-runner-scale-set-controller:dev
+kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev
+
+# 3. Install controller with Helm (v0.12.1)
+helm install arc-controller \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
+  --version 0.12.1 \
+  --set image.repository=danielschwartzlol/gha-runner-scale-set-controller \
+  --set image.tag=dev \
+  --set imagePullPolicy=Never
+
+# 4. Deploy runner scale set (v0.12.1)
+helm install arc-runner-set \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
+  --version 0.12.1 \
+  --set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \
+  --set githubConfigSecret="github-auth"
+```
+
+## Important Files for Optimization
+
+### Primary Focus
+- `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic
+- `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management
+- `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration
+
+### Configuration
+- `charts/gha-runner-scale-set-controller/` - Controller Helm chart
+- `charts/gha-runner-scale-set/` - Runner set Helm chart
+- `cmd/ghalistener/` - Listener pod that receives GitHub webhooks
+
+### Tests
+- `controllers/actions.github.com/ephemeralrunnerset_controller_test.go`
+- `controllers/actions.github.com/ephemeralrunner_controller_test.go`
+
+## Code Patterns for New Mode
+
+### Creating Resources in Parallel
+```go
+// Example pattern for parallel creation
+func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
+    ctx context.Context,
+    runnerSet *v1alpha1.EphemeralRunnerSet,
+    count int,
+    log logr.Logger,
+) error {
+    workers := 10 // Configurable
+    jobs := make(chan int, count)
+    results := make(chan error, count)
+    
+    // Start workers
+    for w := 0; w < workers; w++ {
+        go r.createRunnerWorker(ctx, runnerSet, jobs, results, log)
+    }
+    
+    // Queue jobs
+    for i := 0; i < count; i++ {
+        jobs <- i
+    }
+    close(jobs)
+    
+    // Collect results
+    var errs []error
+    for i := 0; i < count; i++ {
+        if err := <-results; err != nil {
+            errs = append(errs, err)
+        }
+    }
+    
+    return multierr.Combine(errs...)
+}
+```
+
+## GitHub API Integration
+
+- Use `github.Client` interface for testability
+- Implement exponential backoff for rate limiting
+- Runner scale sets register with GitHub using JIT configuration
+- Default runner group: "default"
+
+## DO NOT Work On
+
+- **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/`
+- **Cert-Manager**: Not used in new mode
+- **Webhooks**: New mode uses listener pod instead
+- **RunnerDeployment**: Legacy resource type
+- **HorizontalRunnerAutoscaler**: Legacy autoscaling
+
+## Testing Performance Improvements
+
+```bash
+# Create many runners to test parallel creation
+kubectl -n arc-runners patch ephemeralrunnerset <name> \
+  --type merge -p '{"spec":{"replicas":100}}'
+
+# Monitor creation time
+time kubectl -n arc-runners wait --for=condition=Ready \
+  ephemeralrunners --all --timeout=600s
+
+# Check controller metrics
+kubectl port-forward -n arc-systems service/arc-controller 8080:80
+curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration
+```
+
+## Key Metrics to Track
+
+- `ephemeral_runner_creation_duration_seconds` - Time to create each runner
+- `ephemeral_runner_set_replicas` - Current vs desired replicas
+- `controller_runtime_reconcile_time_seconds` - Reconciliation performance
+
+## Files Referenced
+
+@ENV_SETUP.md - Complete setup guide for new mode
+@tasks.md - Performance optimization task plan
+@controllers/actions.github.com/ephemeralrunnerset_controller.go
+@controllers/actions.github.com/ephemeralrunner_controller.go
+@controllers/actions.github.com/autoscalingrunnerset_controller.go
--- a/ENV_SETUP.md
+++ b/ENV_SETUP.md
@ -0,0 +1,382 @@
+# Local Development Environment Setup - Runner Scale Set Controller
+
+This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode).
+
+## Important Notes
+
+- **NO cert-manager required** - The new mode doesn't use webhooks
+- **NO legacy controller** - We only work with the new `actions.github.com` API group
+- Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
+- GitHub username: `justanotherspy`
+- Docker Hub account: `danielschwartzlol`
+
+## Prerequisites
+
+### Required Tools
+
+1. **Docker** - For running containers and Kind cluster
+
+   ```bash
+   # Ubuntu/Debian
+   sudo apt-get update
+   sudo apt-get install docker.io
+   sudo usermod -aG docker $USER
+   # Log out and back in for group changes to take effect
+   ```
+
+2. **Kind** - Kubernetes in Docker
+
+   ```bash
+   # Install Kind
+   curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
+   chmod +x ./kind
+   sudo mv ./kind /usr/local/bin/kind
+   ```
+
+3. **kubectl** - Kubernetes CLI
+
+   ```bash
+   curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
+   chmod +x kubectl
+   sudo mv kubectl /usr/local/bin/
+   ```
+
+4. **Helm** - Kubernetes package manager
+
+   ```bash
+   curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+   ```
+
+5. **Go** - For building the controller (1.21+)
+
+   ```bash
+   # Install Go 1.21
+   wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
+   sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz
+   export PATH=$PATH:/usr/local/go/bin
+   echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc
+   ```
+
+### Environment Variables
+
+Add these to your `.bashrc` or `.zshrc`:
+
+```bash
+# Docker Hub Configuration
+export DOCKER_USER="danielschwartzlol"
+export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller"
+export RUNNER_IMAGE="ghcr.io/actions/actions-runner"  # Official runner image
+
+# GitHub Configuration
+export GITHUB_TOKEN="your-github-pat-token-here"
+export GITHUB_USERNAME="justanotherspy"
+
+# Or for GitHub App authentication (recommended):
+# export APP_ID="your-app-id"
+# export INSTALLATION_ID="your-installation-id"
+# export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem"
+
+# Test Repository Configuration
+export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo"
+export TEST_ORG=""  # Optional: Your test organization
+
+# Development Settings
+export VERSION="dev"
+export CLUSTER_NAME="arc-dev"
+```
+
+## Step 1: Build the Controller Image
+
+```bash
+# Build the controller image with scale set mode
+make docker-build
+
+# Tag it for our use
+docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
+           ${CONTROLLER_IMAGE}:${VERSION}
+```
+
+## Step 2: Create Kind Cluster
+
+Create a simple Kind cluster (no special config needed for new mode):
+
+```bash
+# Create Kind cluster
+cat <<EOF | kind create cluster --name ${CLUSTER_NAME} --config=-
+kind: Cluster
+apiVersion: kind.x-k8s.io/v1alpha4
+nodes:
+- role: control-plane
+  kubeadmConfigPatches:
+  - |
+    kind: InitConfiguration
+    nodeRegistration:
+      kubeletExtraArgs:
+        node-labels: "ingress-ready=true"
+EOF
+
+# Verify cluster is running
+kubectl cluster-info --context kind-${CLUSTER_NAME}
+```
+
+## Step 3: Load Controller Image into Kind
+
+```bash
+# Load the controller image
+kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
+
+# Verify image is loaded
+docker exec -it ${CLUSTER_NAME}-control-plane crictl images | grep ${DOCKER_USER}
+```
+
+## Step 4: Create GitHub Authentication Secret
+
+```bash
+# Create namespace
+kubectl create namespace arc-systems
+
+# For PAT authentication
+kubectl create secret generic github-auth \
+  --namespace=arc-systems \
+  --from-literal=github_token=${GITHUB_TOKEN}
+
+# For GitHub App authentication (if using App instead)
+kubectl create secret generic github-auth \
+  --namespace=arc-systems \
+  --from-file=github_app_id=${APP_ID} \
+  --from-file=github_app_installation_id=${INSTALLATION_ID} \
+  --from-file=github_app_private_key=${PRIVATE_KEY_FILE_PATH}
+```
+
+## Step 5: Install Runner Scale Set Controller
+
+### Option A: Using Helm (Recommended)
+
+```bash
+# Install the controller
+helm install arc-controller \
+  --namespace arc-systems \
+  --create-namespace \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
+  --version 0.12.1 \
+  --set image.repository=${CONTROLLER_IMAGE} \
+  --set image.tag=${VERSION} \
+  --set imagePullPolicy=Never
+
+# Verify controller is running
+kubectl -n arc-systems get pods -l app.kubernetes.io/name=gha-runner-scale-set-controller
+```
+
+### Option B: Manual Deployment (for development)
+
+```bash
+# Run the controller locally (for debugging)
+CONTROLLER_MANAGER_POD_NAMESPACE=arc-systems \
+CONTROLLER_MANAGER_CONTAINER_IMAGE="${CONTROLLER_IMAGE}:${VERSION}" \
+make run-scaleset
+```
+
+## Step 6: Deploy Runner Scale Set
+
+Create a runner scale set for your repository:
+
+```bash
+# Install runner scale set
+helm install arc-runner-set \
+  --namespace arc-runners \
+  --create-namespace \
+  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
+  --version 0.12.1 \
+  --set githubConfigUrl="https://github.com/${TEST_REPO}" \
+  --set githubConfigSecret="github-auth" \
+  --set controllerServiceAccount.namespace="arc-systems" \
+  --set controllerServiceAccount.name="arc-controller-gha-rs-controller" \
+  --set minRunners=1 \
+  --set maxRunners=10 \
+  --set runnerGroup="default" \
+  --set runnerScaleSetName="test-scale-set"
+
+# Watch the runner scale set
+kubectl -n arc-runners get autoscalingrunnersets -w
+kubectl -n arc-runners get ephemeralrunnersets -w
+kubectl -n arc-runners get ephemeralrunners -w
+```
+
+## Step 7: Verify Installation
+
+```bash
+# Check controller logs
+kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
+
+# Check listener logs
+kubectl -n arc-systems logs -l app.kubernetes.io/name=arc-runner-set-listener -f
+
+# Check runner pods
+kubectl -n arc-runners get pods
+
+# Get runner scale set status
+kubectl -n arc-runners get autoscalingrunnersets -o wide
+```
+
+## Development Workflow
+
+### Quick Iteration for Controller Changes
+
+```bash
+# 1. Make your code changes
+
+# 2. Rebuild controller
+VERSION=dev-$(date +%s) make docker-build
+docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \
+           ${CONTROLLER_IMAGE}:${VERSION}
+
+# 3. Load into Kind
+kind load docker-image ${CONTROLLER_IMAGE}:${VERSION} --name ${CLUSTER_NAME}
+
+# 4. Update the deployment
+kubectl -n arc-systems set image deployment/arc-controller-gha-rs-controller \
+  manager=${CONTROLLER_IMAGE}:${VERSION}
+
+# 5. Watch logs
+kubectl -n arc-systems logs -l app.kubernetes.io/name=gha-runner-scale-set-controller -f
+```
+
+### Testing Parallel Runner Creation
+
+```bash
+# Scale up to test parallel creation
+kubectl -n arc-runners patch autoscalingrunnerset arc-runner-set-runner-set \
+  --type merge \
+  -p '{"spec":{"maxRunners":50}}'
+
+# Trigger scale up by running workflows in your test repo
+# Or manually patch the ephemeralrunnerset
+kubectl -n arc-runners patch ephemeralrunnerset <name> \
+  --type merge \
+  -p '{"spec":{"replicas":50}}'
+
+# Monitor creation time
+time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s
+
+# Check metrics
+kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80
+curl http://localhost:8080/metrics | grep ephemeral
+```
+
+## Debugging
+
+### Enable Verbose Logging
+
+```bash
+# Update controller deployment with debug logging
+kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller
+
+# Add to container args:
+# - "--log-level=debug"
+```
+
+### Common Commands
+
+```bash
+# Get all resources
+kubectl get all -n arc-systems
+kubectl get all -n arc-runners
+
+# Describe runner set
+kubectl -n arc-runners describe autoscalingrunnerset
+
+# Get events
+kubectl -n arc-runners get events --sort-by='.lastTimestamp'
+
+# Port forward for pprof debugging
+kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060
+go tool pprof http://localhost:6060/debug/pprof/profile
+```
+
+## Performance Testing Script
+
+```bash
+#!/bin/bash
+# perf-test.sh
+
+NAMESPACE="arc-runners"
+REPLICAS="${1:-100}"
+
+echo "Testing creation of ${REPLICAS} runners..."
+
+# Record start time
+START=$(date +%s)
+
+# Scale up
+kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \
+  --type merge \
+  -p "{\"spec\":{\"replicas\":${REPLICAS}}}"
+
+# Wait for all runners
+kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s
+
+# Record end time
+END=$(date +%s)
+DURATION=$((END - START))
+
+echo "Created ${REPLICAS} runners in ${DURATION} seconds"
+echo "Average time per runner: $((DURATION / REPLICAS)) seconds"
+
+# Get runner creation events
+kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner
+```
+
+## Cleanup
+
+```bash
+# Delete runner scale set
+helm uninstall arc-runner-set -n arc-runners
+
+# Delete controller
+helm uninstall arc-controller -n arc-systems
+
+# Delete namespaces
+kubectl delete namespace arc-systems arc-runners
+
+# Delete Kind cluster
+kind delete cluster --name ${CLUSTER_NAME}
+```
+
+## Troubleshooting
+
+### Runner Scale Set Not Creating Runners
+
+```bash
+# Check if runner scale set is registered
+kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId
+
+# Check GitHub API connectivity
+kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \
+  curl -H "Authorization: token ${GITHUB_TOKEN}" \
+  https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token
+```
+
+### Runners Not Picking Up Jobs
+
+```bash
+# Ensure runner group matches your workflow
+# In workflow file:
+# runs-on: [self-hosted, linux, x64, default]  # default = runner group
+
+# Check runner registration
+kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100
+```
+
+## Key Differences from Legacy Mode
+
+1. **No Cert-Manager**: New mode doesn't use admission webhooks
+2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner`
+3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set`
+4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks
+5. **No Runner Deployment**: Only uses ephemeral runners
+
+## Resources
+
+- [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller)
+- [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts)
+- [Kind Documentation](https://kind.sigs.k8s.io/)
--- a/tasks.md
+++ b/tasks.md
@ -0,0 +1,246 @@
+# Runner Scale Set Controller Performance Optimization
+
+## Problem Analysis
+
+Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
+
+### Current Sequential Implementation Issues:
+1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
+2. **Blocking loop**: Each runner creation blocks until the API call completes
+3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
+4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
+
+### Key Bottlenecks Identified:
+- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
+- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
+- **No batching**: Individual API calls instead of batch operations
+- **No concurrency**: Single-threaded execution path
+
+## Proposed Task List for Performance Improvement
+
+### Phase 1: Research & Design (Week 1)
+- [ ] **Task 1.1**: Benchmark current performance
+  - Measure time to create 10, 50, 100, 500 runners
+  - Profile CPU/memory usage during scale-up
+  - Document baseline metrics for comparison
+
+- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
+  - Study controller-runtime workqueue patterns
+  - Investigate rate limiting considerations
+  - Review best practices for bulk operations
+
+- [ ] **Task 1.3**: Design concurrent runner creation architecture
+  - Define optimal concurrency level (suggest: configurable, default 10)
+  - Design error handling and retry strategy
+  - Plan backward compatibility approach
+
+### Phase 2: Implementation (Week 2-3)
+
+- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
+  ```go
+  // Suggested approach:
+  // - Use worker pool pattern with configurable concurrency
+  // - Implement error aggregation
+  // - Add progress tracking
+  ```
+
+- [ ] **Task 2.2**: Implement configurable concurrency controls
+  - Add `--runner-creation-concurrency` flag (default: 10)
+  - Add `--runner-creation-timeout` flag (default: 30s)
+  - Environment variable overrides for containerized deployments
+
+- [ ] **Task 2.3**: Add comprehensive error handling
+  - Implement exponential backoff for failed creations
+  - Partial success handling (some runners created, some failed)
+  - Detailed error reporting and metrics
+
+- [ ] **Task 2.4**: Implement progress tracking and observability
+  - Add prometheus metrics for creation time per runner
+  - Log progress at intervals (e.g., "Created 50/100 runners")
+  - Add events to AutoscalingRunnerSet for visibility
+
+### Phase 3: Testing (Week 3-4)
+
+- [ ] **Task 3.1**: Unit tests for concurrent creation
+  - Test with mock client
+  - Verify error handling
+  - Test concurrency limits
+  - Test partial failures
+
+- [ ] **Task 3.2**: Integration tests
+  - Test with real Kubernetes API
+  - Verify resource creation order
+  - Test rollback on failure
+  - Test with various concurrency levels
+
+- [ ] **Task 3.3**: Load testing
+  - Test creating 100+ runners simultaneously
+  - Monitor API server impact
+  - Measure improvement vs baseline
+  - Test with rate limiting
+
+- [ ] **Task 3.4**: Chaos testing
+  - Test with network failures
+  - Test with API server throttling
+  - Test with partial quota exhaustion
+  - Test controller restart during creation
+
+### Phase 4: Optimization & Tuning (Week 4-5)
+
+- [ ] **Task 4.1**: Implement adaptive concurrency
+  - Start with low concurrency, increase based on success rate
+  - Back off on errors or throttling
+  - Self-tuning based on cluster capacity
+
+- [ ] **Task 4.2**: Add bulk creation API support (if available)
+  - Research if Actions API supports bulk runner registration
+  - Implement batch registration if supported
+  - Fall back to parallel individual creation
+
+- [ ] **Task 4.3**: Optimize resource creation
+  - Pre-compute runner configurations
+  - Cache common data (secrets, configs)
+  - Minimize API calls per runner
+
+### Phase 5: Documentation & Rollout (Week 5-6)
+
+- [ ] **Task 5.1**: Document configuration options
+  - Update CLAUDE.md with new flags
+  - Add tuning guide for different cluster sizes
+  - Document performance improvements
+
+- [ ] **Task 5.2**: Create migration guide
+  - Document any breaking changes
+  - Provide upgrade path
+  - Include rollback procedures
+
+- [ ] **Task 5.3**: Performance report
+  - Before/after benchmarks
+  - Scalability analysis
+  - Recommendations for different use cases
+
+## Implementation Details
+
+### Suggested Code Structure
+
+```go
+// ephemeralrunnerset_controller.go
+
+type runnerCreationJob struct {
+    runner *v1alpha1.EphemeralRunner
+    index  int
+    err    error
+}
+
+func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
+    ctx context.Context, 
+    runnerSet *v1alpha1.EphemeralRunnerSet, 
+    count int, 
+    log logr.Logger,
+) error {
+    concurrency := r.getConfiguredConcurrency() // Default: 10
+    
+    jobs := make(chan runnerCreationJob, count)
+    results := make(chan runnerCreationJob, count)
+    
+    // Start workers
+    var wg sync.WaitGroup
+    for i := 0; i < concurrency; i++ {
+        wg.Add(1)
+        go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
+    }
+    
+    // Queue jobs
+    for i := 0; i < count; i++ {
+        jobs <- runnerCreationJob{
+            runner: r.newEphemeralRunner(runnerSet),
+            index:  i,
+        }
+    }
+    close(jobs)
+    
+    // Wait for completion
+    go func() {
+        wg.Wait()
+        close(results)
+    }()
+    
+    // Collect results and handle errors
+    var errs []error
+    created := 0
+    for result := range results {
+        if result.err != nil {
+            errs = append(errs, result.err)
+        } else {
+            created++
+            if created%10 == 0 || created == count {
+                log.Info("Runner creation progress", "created", created, "total", count)
+            }
+        }
+    }
+    
+    return multierr.Combine(errs...)
+}
+```
+
+## Success Metrics
+
+1. **Performance**: 
+   - Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
+   - Reduce time complexity from O(n) to O(n/c) where c = concurrency
+
+2. **Reliability**:
+   - Handle partial failures gracefully
+   - No runner leaks on error
+   - Proper cleanup on controller restart
+
+3. **Observability**:
+   - Clear progress tracking
+   - Detailed metrics and logs
+   - Actionable error messages
+
+4. **Compatibility**:
+   - Backward compatible by default
+   - Configurable for different environments
+   - No breaking changes to CRDs
+
+## Risk Mitigation
+
+1. **API Server Overload**: Implement rate limiting and backoff
+2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
+3. **Partial Failures**: Implement proper rollback and cleanup
+4. **Race Conditions**: Use proper locking and atomic operations
+
+## Testing Requirements
+
+- Unit test coverage > 80%
+- Integration tests for all scenarios
+- Performance regression tests
+- Documentation for all new features
+- Backward compatibility tests
+
+## Rollout Plan
+
+1. **Alpha**: Deploy to dev environment with conservative defaults
+2. **Beta**: Test with select users, gather feedback
+3. **GA**: Full rollout with documentation and migration guide
+
+## Dependencies
+
+- No changes to CRDs required
+- Compatible with existing Actions Runner Controller versions
+- Requires Go 1.21+ for errors.Join support (already in use)
+
+## Timeline Estimate
+
+- Total Duration: 5-6 weeks
+- Developer Resources: 1-2 engineers
+- Review & Testing: Additional 1 week
+
+## Notes for Implementation
+
+1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
+2. Leverage existing `multierr` package for error aggregation
+3. Use context cancellation for proper cleanup
+4. Consider implementing circuit breaker pattern for API failures
+5. Add feature flag to enable/disable parallel creation