actions-runner-controller/tasks.md

246 lines
7.9 KiB
Markdown

# Runner Scale Set Controller Performance Optimization
## Problem Analysis
Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
### Current Sequential Implementation Issues:
1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
2. **Blocking loop**: Each runner creation blocks until the API call completes
3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
### Key Bottlenecks Identified:
- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
- **No batching**: Individual API calls instead of batch operations
- **No concurrency**: Single-threaded execution path
## Proposed Task List for Performance Improvement
### Phase 1: Research & Design (Week 1)
- [ ] **Task 1.1**: Benchmark current performance
- Measure time to create 10, 50, 100, 500 runners
- Profile CPU/memory usage during scale-up
- Document baseline metrics for comparison
- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
- Study controller-runtime workqueue patterns
- Investigate rate limiting considerations
- Review best practices for bulk operations
- [ ] **Task 1.3**: Design concurrent runner creation architecture
- Define optimal concurrency level (suggest: configurable, default 10)
- Design error handling and retry strategy
- Plan backward compatibility approach
### Phase 2: Implementation (Week 2-3)
- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
```go
// Suggested approach:
// - Use worker pool pattern with configurable concurrency
// - Implement error aggregation
// - Add progress tracking
```
- [ ] **Task 2.2**: Implement configurable concurrency controls
- Add `--runner-creation-concurrency` flag (default: 10)
- Add `--runner-creation-timeout` flag (default: 30s)
- Environment variable overrides for containerized deployments
- [ ] **Task 2.3**: Add comprehensive error handling
- Implement exponential backoff for failed creations
- Partial success handling (some runners created, some failed)
- Detailed error reporting and metrics
- [ ] **Task 2.4**: Implement progress tracking and observability
- Add prometheus metrics for creation time per runner
- Log progress at intervals (e.g., "Created 50/100 runners")
- Add events to AutoscalingRunnerSet for visibility
### Phase 3: Testing (Week 3-4)
- [ ] **Task 3.1**: Unit tests for concurrent creation
- Test with mock client
- Verify error handling
- Test concurrency limits
- Test partial failures
- [ ] **Task 3.2**: Integration tests
- Test with real Kubernetes API
- Verify resource creation order
- Test rollback on failure
- Test with various concurrency levels
- [ ] **Task 3.3**: Load testing
- Test creating 100+ runners simultaneously
- Monitor API server impact
- Measure improvement vs baseline
- Test with rate limiting
- [ ] **Task 3.4**: Chaos testing
- Test with network failures
- Test with API server throttling
- Test with partial quota exhaustion
- Test controller restart during creation
### Phase 4: Optimization & Tuning (Week 4-5)
- [ ] **Task 4.1**: Implement adaptive concurrency
- Start with low concurrency, increase based on success rate
- Back off on errors or throttling
- Self-tuning based on cluster capacity
- [ ] **Task 4.2**: Add bulk creation API support (if available)
- Research if Actions API supports bulk runner registration
- Implement batch registration if supported
- Fall back to parallel individual creation
- [ ] **Task 4.3**: Optimize resource creation
- Pre-compute runner configurations
- Cache common data (secrets, configs)
- Minimize API calls per runner
### Phase 5: Documentation & Rollout (Week 5-6)
- [ ] **Task 5.1**: Document configuration options
- Update CLAUDE.md with new flags
- Add tuning guide for different cluster sizes
- Document performance improvements
- [ ] **Task 5.2**: Create migration guide
- Document any breaking changes
- Provide upgrade path
- Include rollback procedures
- [ ] **Task 5.3**: Performance report
- Before/after benchmarks
- Scalability analysis
- Recommendations for different use cases
## Implementation Details
### Suggested Code Structure
```go
// ephemeralrunnerset_controller.go
type runnerCreationJob struct {
runner *v1alpha1.EphemeralRunner
index int
err error
}
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
ctx context.Context,
runnerSet *v1alpha1.EphemeralRunnerSet,
count int,
log logr.Logger,
) error {
concurrency := r.getConfiguredConcurrency() // Default: 10
jobs := make(chan runnerCreationJob, count)
results := make(chan runnerCreationJob, count)
// Start workers
var wg sync.WaitGroup
for i := 0; i < concurrency; i++ {
wg.Add(1)
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
}
// Queue jobs
for i := 0; i < count; i++ {
jobs <- runnerCreationJob{
runner: r.newEphemeralRunner(runnerSet),
index: i,
}
}
close(jobs)
// Wait for completion
go func() {
wg.Wait()
close(results)
}()
// Collect results and handle errors
var errs []error
created := 0
for result := range results {
if result.err != nil {
errs = append(errs, result.err)
} else {
created++
if created%10 == 0 || created == count {
log.Info("Runner creation progress", "created", created, "total", count)
}
}
}
return multierr.Combine(errs...)
}
```
## Success Metrics
1. **Performance**:
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
2. **Reliability**:
- Handle partial failures gracefully
- No runner leaks on error
- Proper cleanup on controller restart
3. **Observability**:
- Clear progress tracking
- Detailed metrics and logs
- Actionable error messages
4. **Compatibility**:
- Backward compatible by default
- Configurable for different environments
- No breaking changes to CRDs
## Risk Mitigation
1. **API Server Overload**: Implement rate limiting and backoff
2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
3. **Partial Failures**: Implement proper rollback and cleanup
4. **Race Conditions**: Use proper locking and atomic operations
## Testing Requirements
- Unit test coverage > 80%
- Integration tests for all scenarios
- Performance regression tests
- Documentation for all new features
- Backward compatibility tests
## Rollout Plan
1. **Alpha**: Deploy to dev environment with conservative defaults
2. **Beta**: Test with select users, gather feedback
3. **GA**: Full rollout with documentation and migration guide
## Dependencies
- No changes to CRDs required
- Compatible with existing Actions Runner Controller versions
- Requires Go 1.21+ for errors.Join support (already in use)
## Timeline Estimate
- Total Duration: 5-6 weeks
- Developer Resources: 1-2 engineers
- Review & Testing: Additional 1 week
## Notes for Implementation
1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
2. Leverage existing `multierr` package for error aggregation
3. Use context cancellation for proper cleanup
4. Consider implementing circuit breaker pattern for API failures
5. Add feature flag to enable/disable parallel creation