7.9 KiB
Runner Scale Set Controller Performance Optimization
Problem Analysis
Based on analysis of the codebase, the runner scale set controller currently spawns runners sequentially in the EphemeralRunnerSetReconciler.createEphemeralRunners() method at /controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386.
Current Sequential Implementation Issues:
- Linear time complexity O(n): Creating n runners takes n sequential API calls
- Blocking loop: Each runner creation blocks until the API call completes
- Poor scalability: Large scale-ups (e.g., 100+ runners) take minutes
- Resource underutilization: Controller pod doesn't leverage available CPU/memory for parallel operations
Key Bottlenecks Identified:
- EphemeralRunnerSet Controller (
ephemeralrunnerset_controller.go:362-383): Sequential for-loop creating runners one by one - API Call Latency: Each
r.Create(ctx, ephemeralRunner)call blocks for network roundtrip - No batching: Individual API calls instead of batch operations
- No concurrency: Single-threaded execution path
Proposed Task List for Performance Improvement
Phase 1: Research & Design (Week 1)
-
Task 1.1: Benchmark current performance
- Measure time to create 10, 50, 100, 500 runners
- Profile CPU/memory usage during scale-up
- Document baseline metrics for comparison
-
Task 1.2: Research Kubernetes client-go patterns for concurrent resource creation
- Study controller-runtime workqueue patterns
- Investigate rate limiting considerations
- Review best practices for bulk operations
-
Task 1.3: Design concurrent runner creation architecture
- Define optimal concurrency level (suggest: configurable, default 10)
- Design error handling and retry strategy
- Plan backward compatibility approach
Phase 2: Implementation (Week 2-3)
-
Task 2.1: Refactor
createEphemeralRunnersfor parallel execution// Suggested approach: // - Use worker pool pattern with configurable concurrency // - Implement error aggregation // - Add progress tracking -
Task 2.2: Implement configurable concurrency controls
- Add
--runner-creation-concurrencyflag (default: 10) - Add
--runner-creation-timeoutflag (default: 30s) - Environment variable overrides for containerized deployments
- Add
-
Task 2.3: Add comprehensive error handling
- Implement exponential backoff for failed creations
- Partial success handling (some runners created, some failed)
- Detailed error reporting and metrics
-
Task 2.4: Implement progress tracking and observability
- Add prometheus metrics for creation time per runner
- Log progress at intervals (e.g., "Created 50/100 runners")
- Add events to AutoscalingRunnerSet for visibility
Phase 3: Testing (Week 3-4)
-
Task 3.1: Unit tests for concurrent creation
- Test with mock client
- Verify error handling
- Test concurrency limits
- Test partial failures
-
Task 3.2: Integration tests
- Test with real Kubernetes API
- Verify resource creation order
- Test rollback on failure
- Test with various concurrency levels
-
Task 3.3: Load testing
- Test creating 100+ runners simultaneously
- Monitor API server impact
- Measure improvement vs baseline
- Test with rate limiting
-
Task 3.4: Chaos testing
- Test with network failures
- Test with API server throttling
- Test with partial quota exhaustion
- Test controller restart during creation
Phase 4: Optimization & Tuning (Week 4-5)
-
Task 4.1: Implement adaptive concurrency
- Start with low concurrency, increase based on success rate
- Back off on errors or throttling
- Self-tuning based on cluster capacity
-
Task 4.2: Add bulk creation API support (if available)
- Research if Actions API supports bulk runner registration
- Implement batch registration if supported
- Fall back to parallel individual creation
-
Task 4.3: Optimize resource creation
- Pre-compute runner configurations
- Cache common data (secrets, configs)
- Minimize API calls per runner
Phase 5: Documentation & Rollout (Week 5-6)
-
Task 5.1: Document configuration options
- Update CLAUDE.md with new flags
- Add tuning guide for different cluster sizes
- Document performance improvements
-
Task 5.2: Create migration guide
- Document any breaking changes
- Provide upgrade path
- Include rollback procedures
-
Task 5.3: Performance report
- Before/after benchmarks
- Scalability analysis
- Recommendations for different use cases
Implementation Details
Suggested Code Structure
// ephemeralrunnerset_controller.go
type runnerCreationJob struct {
runner *v1alpha1.EphemeralRunner
index int
err error
}
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
ctx context.Context,
runnerSet *v1alpha1.EphemeralRunnerSet,
count int,
log logr.Logger,
) error {
concurrency := r.getConfiguredConcurrency() // Default: 10
jobs := make(chan runnerCreationJob, count)
results := make(chan runnerCreationJob, count)
// Start workers
var wg sync.WaitGroup
for i := 0; i < concurrency; i++ {
wg.Add(1)
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
}
// Queue jobs
for i := 0; i < count; i++ {
jobs <- runnerCreationJob{
runner: r.newEphemeralRunner(runnerSet),
index: i,
}
}
close(jobs)
// Wait for completion
go func() {
wg.Wait()
close(results)
}()
// Collect results and handle errors
var errs []error
created := 0
for result := range results {
if result.err != nil {
errs = append(errs, result.err)
} else {
created++
if created%10 == 0 || created == count {
log.Info("Runner creation progress", "created", created, "total", count)
}
}
}
return multierr.Combine(errs...)
}
Success Metrics
-
Performance:
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
-
Reliability:
- Handle partial failures gracefully
- No runner leaks on error
- Proper cleanup on controller restart
-
Observability:
- Clear progress tracking
- Detailed metrics and logs
- Actionable error messages
-
Compatibility:
- Backward compatible by default
- Configurable for different environments
- No breaking changes to CRDs
Risk Mitigation
- API Server Overload: Implement rate limiting and backoff
- Resource Exhaustion: Add memory/CPU limits and monitoring
- Partial Failures: Implement proper rollback and cleanup
- Race Conditions: Use proper locking and atomic operations
Testing Requirements
- Unit test coverage > 80%
- Integration tests for all scenarios
- Performance regression tests
- Documentation for all new features
- Backward compatibility tests
Rollout Plan
- Alpha: Deploy to dev environment with conservative defaults
- Beta: Test with select users, gather feedback
- GA: Full rollout with documentation and migration guide
Dependencies
- No changes to CRDs required
- Compatible with existing Actions Runner Controller versions
- Requires Go 1.21+ for errors.Join support (already in use)
Timeline Estimate
- Total Duration: 5-6 weeks
- Developer Resources: 1-2 engineers
- Review & Testing: Additional 1 week
Notes for Implementation
- Consider using
golang.org/x/sync/errgroupfor cleaner error handling - Leverage existing
multierrpackage for error aggregation - Use context cancellation for proper cleanup
- Consider implementing circuit breaker pattern for API failures
- Add feature flag to enable/disable parallel creation