246 lines
7.9 KiB
Markdown
246 lines
7.9 KiB
Markdown
# Runner Scale Set Controller Performance Optimization
|
|
|
|
## Problem Analysis
|
|
|
|
Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`.
|
|
|
|
### Current Sequential Implementation Issues:
|
|
1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls
|
|
2. **Blocking loop**: Each runner creation blocks until the API call completes
|
|
3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes
|
|
4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations
|
|
|
|
### Key Bottlenecks Identified:
|
|
- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one
|
|
- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip
|
|
- **No batching**: Individual API calls instead of batch operations
|
|
- **No concurrency**: Single-threaded execution path
|
|
|
|
## Proposed Task List for Performance Improvement
|
|
|
|
### Phase 1: Research & Design (Week 1)
|
|
- [ ] **Task 1.1**: Benchmark current performance
|
|
- Measure time to create 10, 50, 100, 500 runners
|
|
- Profile CPU/memory usage during scale-up
|
|
- Document baseline metrics for comparison
|
|
|
|
- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation
|
|
- Study controller-runtime workqueue patterns
|
|
- Investigate rate limiting considerations
|
|
- Review best practices for bulk operations
|
|
|
|
- [ ] **Task 1.3**: Design concurrent runner creation architecture
|
|
- Define optimal concurrency level (suggest: configurable, default 10)
|
|
- Design error handling and retry strategy
|
|
- Plan backward compatibility approach
|
|
|
|
### Phase 2: Implementation (Week 2-3)
|
|
|
|
- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution
|
|
```go
|
|
// Suggested approach:
|
|
// - Use worker pool pattern with configurable concurrency
|
|
// - Implement error aggregation
|
|
// - Add progress tracking
|
|
```
|
|
|
|
- [ ] **Task 2.2**: Implement configurable concurrency controls
|
|
- Add `--runner-creation-concurrency` flag (default: 10)
|
|
- Add `--runner-creation-timeout` flag (default: 30s)
|
|
- Environment variable overrides for containerized deployments
|
|
|
|
- [ ] **Task 2.3**: Add comprehensive error handling
|
|
- Implement exponential backoff for failed creations
|
|
- Partial success handling (some runners created, some failed)
|
|
- Detailed error reporting and metrics
|
|
|
|
- [ ] **Task 2.4**: Implement progress tracking and observability
|
|
- Add prometheus metrics for creation time per runner
|
|
- Log progress at intervals (e.g., "Created 50/100 runners")
|
|
- Add events to AutoscalingRunnerSet for visibility
|
|
|
|
### Phase 3: Testing (Week 3-4)
|
|
|
|
- [ ] **Task 3.1**: Unit tests for concurrent creation
|
|
- Test with mock client
|
|
- Verify error handling
|
|
- Test concurrency limits
|
|
- Test partial failures
|
|
|
|
- [ ] **Task 3.2**: Integration tests
|
|
- Test with real Kubernetes API
|
|
- Verify resource creation order
|
|
- Test rollback on failure
|
|
- Test with various concurrency levels
|
|
|
|
- [ ] **Task 3.3**: Load testing
|
|
- Test creating 100+ runners simultaneously
|
|
- Monitor API server impact
|
|
- Measure improvement vs baseline
|
|
- Test with rate limiting
|
|
|
|
- [ ] **Task 3.4**: Chaos testing
|
|
- Test with network failures
|
|
- Test with API server throttling
|
|
- Test with partial quota exhaustion
|
|
- Test controller restart during creation
|
|
|
|
### Phase 4: Optimization & Tuning (Week 4-5)
|
|
|
|
- [ ] **Task 4.1**: Implement adaptive concurrency
|
|
- Start with low concurrency, increase based on success rate
|
|
- Back off on errors or throttling
|
|
- Self-tuning based on cluster capacity
|
|
|
|
- [ ] **Task 4.2**: Add bulk creation API support (if available)
|
|
- Research if Actions API supports bulk runner registration
|
|
- Implement batch registration if supported
|
|
- Fall back to parallel individual creation
|
|
|
|
- [ ] **Task 4.3**: Optimize resource creation
|
|
- Pre-compute runner configurations
|
|
- Cache common data (secrets, configs)
|
|
- Minimize API calls per runner
|
|
|
|
### Phase 5: Documentation & Rollout (Week 5-6)
|
|
|
|
- [ ] **Task 5.1**: Document configuration options
|
|
- Update CLAUDE.md with new flags
|
|
- Add tuning guide for different cluster sizes
|
|
- Document performance improvements
|
|
|
|
- [ ] **Task 5.2**: Create migration guide
|
|
- Document any breaking changes
|
|
- Provide upgrade path
|
|
- Include rollback procedures
|
|
|
|
- [ ] **Task 5.3**: Performance report
|
|
- Before/after benchmarks
|
|
- Scalability analysis
|
|
- Recommendations for different use cases
|
|
|
|
## Implementation Details
|
|
|
|
### Suggested Code Structure
|
|
|
|
```go
|
|
// ephemeralrunnerset_controller.go
|
|
|
|
type runnerCreationJob struct {
|
|
runner *v1alpha1.EphemeralRunner
|
|
index int
|
|
err error
|
|
}
|
|
|
|
func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel(
|
|
ctx context.Context,
|
|
runnerSet *v1alpha1.EphemeralRunnerSet,
|
|
count int,
|
|
log logr.Logger,
|
|
) error {
|
|
concurrency := r.getConfiguredConcurrency() // Default: 10
|
|
|
|
jobs := make(chan runnerCreationJob, count)
|
|
results := make(chan runnerCreationJob, count)
|
|
|
|
// Start workers
|
|
var wg sync.WaitGroup
|
|
for i := 0; i < concurrency; i++ {
|
|
wg.Add(1)
|
|
go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log)
|
|
}
|
|
|
|
// Queue jobs
|
|
for i := 0; i < count; i++ {
|
|
jobs <- runnerCreationJob{
|
|
runner: r.newEphemeralRunner(runnerSet),
|
|
index: i,
|
|
}
|
|
}
|
|
close(jobs)
|
|
|
|
// Wait for completion
|
|
go func() {
|
|
wg.Wait()
|
|
close(results)
|
|
}()
|
|
|
|
// Collect results and handle errors
|
|
var errs []error
|
|
created := 0
|
|
for result := range results {
|
|
if result.err != nil {
|
|
errs = append(errs, result.err)
|
|
} else {
|
|
created++
|
|
if created%10 == 0 || created == count {
|
|
log.Info("Runner creation progress", "created", created, "total", count)
|
|
}
|
|
}
|
|
}
|
|
|
|
return multierr.Combine(errs...)
|
|
}
|
|
```
|
|
|
|
## Success Metrics
|
|
|
|
1. **Performance**:
|
|
- Target: Create 100 runners in < 30 seconds (vs current ~5 minutes)
|
|
- Reduce time complexity from O(n) to O(n/c) where c = concurrency
|
|
|
|
2. **Reliability**:
|
|
- Handle partial failures gracefully
|
|
- No runner leaks on error
|
|
- Proper cleanup on controller restart
|
|
|
|
3. **Observability**:
|
|
- Clear progress tracking
|
|
- Detailed metrics and logs
|
|
- Actionable error messages
|
|
|
|
4. **Compatibility**:
|
|
- Backward compatible by default
|
|
- Configurable for different environments
|
|
- No breaking changes to CRDs
|
|
|
|
## Risk Mitigation
|
|
|
|
1. **API Server Overload**: Implement rate limiting and backoff
|
|
2. **Resource Exhaustion**: Add memory/CPU limits and monitoring
|
|
3. **Partial Failures**: Implement proper rollback and cleanup
|
|
4. **Race Conditions**: Use proper locking and atomic operations
|
|
|
|
## Testing Requirements
|
|
|
|
- Unit test coverage > 80%
|
|
- Integration tests for all scenarios
|
|
- Performance regression tests
|
|
- Documentation for all new features
|
|
- Backward compatibility tests
|
|
|
|
## Rollout Plan
|
|
|
|
1. **Alpha**: Deploy to dev environment with conservative defaults
|
|
2. **Beta**: Test with select users, gather feedback
|
|
3. **GA**: Full rollout with documentation and migration guide
|
|
|
|
## Dependencies
|
|
|
|
- No changes to CRDs required
|
|
- Compatible with existing Actions Runner Controller versions
|
|
- Requires Go 1.21+ for errors.Join support (already in use)
|
|
|
|
## Timeline Estimate
|
|
|
|
- Total Duration: 5-6 weeks
|
|
- Developer Resources: 1-2 engineers
|
|
- Review & Testing: Additional 1 week
|
|
|
|
## Notes for Implementation
|
|
|
|
1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling
|
|
2. Leverage existing `multierr` package for error aggregation
|
|
3. Use context cancellation for proper cleanup
|
|
4. Consider implementing circuit breaker pattern for API failures
|
|
5. Add feature flag to enable/disable parallel creation |