actions-runner-controller/tasks.md

7.9 KiB

Runner Scale Set Controller Performance Optimization

Problem Analysis

Based on analysis of the codebase, the runner scale set controller currently spawns runners sequentially in the EphemeralRunnerSetReconciler.createEphemeralRunners() method at /controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386.

Current Sequential Implementation Issues:

  1. Linear time complexity O(n): Creating n runners takes n sequential API calls
  2. Blocking loop: Each runner creation blocks until the API call completes
  3. Poor scalability: Large scale-ups (e.g., 100+ runners) take minutes
  4. Resource underutilization: Controller pod doesn't leverage available CPU/memory for parallel operations

Key Bottlenecks Identified:

  • EphemeralRunnerSet Controller (ephemeralrunnerset_controller.go:362-383): Sequential for-loop creating runners one by one
  • API Call Latency: Each r.Create(ctx, ephemeralRunner) call blocks for network roundtrip
  • No batching: Individual API calls instead of batch operations
  • No concurrency: Single-threaded execution path

Proposed Task List for Performance Improvement

Phase 1: Research & Design (Week 1)

  • Task 1.1: Benchmark current performance

    • Measure time to create 10, 50, 100, 500 runners
    • Profile CPU/memory usage during scale-up
    • Document baseline metrics for comparison
  • Task 1.2: Research Kubernetes client-go patterns for concurrent resource creation

    • Study controller-runtime workqueue patterns
    • Investigate rate limiting considerations
    • Review best practices for bulk operations
  • Task 1.3: Design concurrent runner creation architecture

    • Define optimal concurrency level (suggest: configurable, default 10)
    • Design error handling and retry strategy
    • Plan backward compatibility approach

Phase 2: Implementation (Week 2-3)

  • Task 2.1: Refactor createEphemeralRunners for parallel execution

    // Suggested approach:
    // - Use worker pool pattern with configurable concurrency
    // - Implement error aggregation
    // - Add progress tracking
    
  • Task 2.2: Implement configurable concurrency controls

    • Add --runner-creation-concurrency flag (default: 10)
    • Add --runner-creation-timeout flag (default: 30s)
    • Environment variable overrides for containerized deployments
  • Task 2.3: Add comprehensive error handling

    • Implement exponential backoff for failed creations
    • Partial success handling (some runners created, some failed)
    • Detailed error reporting and metrics
  • Task 2.4: Implement progress tracking and observability

    • Add prometheus metrics for creation time per runner
    • Log progress at intervals (e.g., "Created 50/100 runners")
    • Add events to AutoscalingRunnerSet for visibility

Phase 3: Testing (Week 3-4)

  • Task 3.1: Unit tests for concurrent creation

    • Test with mock client
    • Verify error handling
    • Test concurrency limits
    • Test partial failures
  • Task 3.2: Integration tests

    • Test with real Kubernetes API
    • Verify resource creation order
    • Test rollback on failure
    • Test with various concurrency levels
  • Task 3.3: Load testing

    • Test creating 100+ runners simultaneously
    • Monitor API server impact
    • Measure improvement vs baseline
    • Test with rate limiting
  • Task 3.4: Chaos testing

    • Test with network failures
    • Test with API server throttling
    • Test with partial quota exhaustion
    • Test controller restart during creation

Phase 4: Optimization & Tuning (Week 4-5)

  • Task 4.1: Implement adaptive concurrency

    • Start with low concurrency, increase based on success rate
    • Back off on errors or throttling
    • Self-tuning based on cluster capacity
  • Task 4.2: Add bulk creation API support (if available)

    • Research if Actions API supports bulk runner registration
    • Implement batch registration if supported
    • Fall back to parallel individual creation
  • Task 4.3: Optimize resource creation

    • Pre-compute runner configurations
    • Cache common data (secrets, configs)
    • Minimize API calls per runner

Phase 5: Documentation & Rollout (Week 5-6)

  • Task 5.1: Document configuration options

    • Update CLAUDE.md with new flags
    • Add tuning guide for different cluster sizes
    • Document performance improvements
  • Task 5.2: Create migration guide

    • Document any breaking changes
    • Provide upgrade path
    • Include rollback procedures
  • Task 5.3: Performance report

    • Before/after benchmarks
    • Scalability analysis
    • Recommendations for different use cases

Implementation Details

Suggested Code Structure

// ephemeralrunnerset_controller.go

type runnerCreationJob struct {
    runner *v1alpha1.EphemeralRunner
    index  int
    err    error
}

func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
    ctx context.Context, 
    runnerSet *v1alpha1.EphemeralRunnerSet, 
    count int, 
    log logr.Logger,
) error {
    concurrency := r.getConfiguredConcurrency() // Default: 10
    
    jobs := make(chan runnerCreationJob, count)
    results := make(chan runnerCreationJob, count)
    
    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < concurrency; i++ {
        wg.Add(1)
        go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
    }
    
    // Queue jobs
    for i := 0; i < count; i++ {
        jobs <- runnerCreationJob{
            runner: r.newEphemeralRunner(runnerSet),
            index:  i,
        }
    }
    close(jobs)
    
    // Wait for completion
    go func() {
        wg.Wait()
        close(results)
    }()
    
    // Collect results and handle errors
    var errs []error
    created := 0
    for result := range results {
        if result.err != nil {
            errs = append(errs, result.err)
        } else {
            created++
            if created%10 == 0 || created == count {
                log.Info("Runner creation progress", "created", created, "total", count)
            }
        }
    }
    
    return multierr.Combine(errs...)
}

Success Metrics

  1. Performance:

    • Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
    • Reduce time complexity from O(n) to O(n/c) where c = concurrency
  2. Reliability:

    • Handle partial failures gracefully
    • No runner leaks on error
    • Proper cleanup on controller restart
  3. Observability:

    • Clear progress tracking
    • Detailed metrics and logs
    • Actionable error messages
  4. Compatibility:

    • Backward compatible by default
    • Configurable for different environments
    • No breaking changes to CRDs

Risk Mitigation

  1. API Server Overload: Implement rate limiting and backoff
  2. Resource Exhaustion: Add memory/CPU limits and monitoring
  3. Partial Failures: Implement proper rollback and cleanup
  4. Race Conditions: Use proper locking and atomic operations

Testing Requirements

  • Unit test coverage > 80%
  • Integration tests for all scenarios
  • Performance regression tests
  • Documentation for all new features
  • Backward compatibility tests

Rollout Plan

  1. Alpha: Deploy to dev environment with conservative defaults
  2. Beta: Test with select users, gather feedback
  3. GA: Full rollout with documentation and migration guide

Dependencies

  • No changes to CRDs required
  • Compatible with existing Actions Runner Controller versions
  • Requires Go 1.21+ for errors.Join support (already in use)

Timeline Estimate

  • Total Duration: 5-6 weeks
  • Developer Resources: 1-2 engineers
  • Review & Testing: Additional 1 week

Notes for Implementation

  1. Consider using golang.org/x/sync/errgroup for cleaner error handling
  2. Leverage existing multierr package for error aggregation
  3. Use context cancellation for proper cleanup
  4. Consider implementing circuit breaker pattern for API failures
  5. Add feature flag to enable/disable parallel creation