diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..dd34607e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,234 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Repository Information + +**THIS IS A FORK**: This repository is a fork of the upstream `actions/actions-runner-controller` repository. +- **Fork Owner**: `justanotherspy` +- **Upstream**: `actions/actions-runner-controller` +- **IMPORTANT**: Always push changes to the fork (`justanotherspy/actions-runner-controller`), NEVER to upstream +- **Default Branch**: Work on feature branches, not directly on master + +## Project Focus + +**IMPORTANT**: We work EXCLUSIVELY on the NEW Runner Scale Set Controller mode, NOT the legacy mode. + +- **NEW Mode ONLY**: Autoscaling Runner Sets using `actions.github.com` API group +- **NO Legacy Development**: Do not work on `actions.summerwind.net` resources +- **NO Cert-Manager**: The new mode doesn't use webhooks or cert-manager +- **GitHub Username**: `justanotherspy` (for test repositories) +- **Docker Hub Account**: `danielschwartzlol` + +## Development Configuration + +- **Controller Image**: `danielschwartzlol/gha-runner-scale-set-controller` +- **Runner Image**: Use official `ghcr.io/actions/actions-runner` +- **Helm Charts** (Version 0.12.1): + - Controller: `gha-runner-scale-set-controller` + - Runner Set: `gha-runner-scale-set` +- **Helm Chart Version**: Always use `0.12.1` (latest as of this setup) +- **Local Development**: Use Kind cluster without cert-manager (see ENV_SETUP.md) +- **Test Repository**: `justanotherspy/test-runner-repo` + +## Key Components (New Mode Only) + +### Controllers to Focus On + +**AutoscalingRunnerSetReconciler** (`controllers/actions.github.com/autoscalingrunnerset_controller.go`) +- Manages runner scale set lifecycle +- Creates EphemeralRunnerSets based on demand +- Handles runner group configuration + +**EphemeralRunnerSetReconciler** (`controllers/actions.github.com/ephemeralrunnerset_controller.go`) +- **CRITICAL FOR OPTIMIZATION**: Contains sequential runner creation loop +- `createEphemeralRunners()` method at line 359-386 needs parallelization +- Manages replicas of EphemeralRunners + +**EphemeralRunnerReconciler** (`controllers/actions.github.com/ephemeralrunner_controller.go`) +- Manages individual runner pods +- Handles runner registration with GitHub + +**AutoscalingListenerReconciler** (`controllers/actions.github.com/autoscalinglistener_controller.go`) +- Manages the listener pod that receives GitHub webhooks +- Triggers scaling events + +### Resource Hierarchy (New Mode) + +```text +AutoscalingRunnerSet + ├── AutoscalingListener (webhook receiver pod) + └── EphemeralRunnerSet + └── EphemeralRunner (Pod) +``` + +## Performance Optimization Focus + +### Current Problem +- `EphemeralRunnerSetReconciler.createEphemeralRunners()` creates runners sequentially +- Time complexity: O(n) where n = number of runners +- Bottleneck location: `controllers/actions.github.com/ephemeralrunnerset_controller.go:362-383` + +### Optimization Goal +- Implement parallel runner creation with worker pool pattern +- Target: 10x improvement (create 100 runners in < 30 seconds) +- Configurable concurrency (default: 10 parallel creations) + +## Build Commands + +```bash +# Build controller for runner scale set mode +make docker-build +docker tag danielschwartzlol/actions-runner-controller:dev \ + danielschwartzlol/gha-runner-scale-set-controller:dev + +# Run controller locally in scale set mode +make run-scaleset + +# Generate CRDs (only actions.github.com ones matter) +make manifests + +# Run tests for new mode controllers +go test -v ./controllers/actions.github.com/... +``` + +## Testing Commands + +```bash +# Unit tests for runner scale set controllers +go test -v ./controllers/actions.github.com/... -run TestEphemeralRunnerSet + +# Integration tests for new mode +KUBEBUILDER_ASSETS="$(setup-envtest use 1.28 -p path)" \ + go test -v ./controllers/actions.github.com/... + +# Benchmark runner creation +go test -bench=BenchmarkCreateEphemeralRunners ./controllers/actions.github.com/... +``` + +## Local Development Workflow + +```bash +# 1. Create Kind cluster (no cert-manager needed) +kind create cluster --name arc-dev + +# 2. Build and load controller +VERSION=dev make docker-build +docker tag danielschwartzlol/actions-runner-controller:dev \ + danielschwartzlol/gha-runner-scale-set-controller:dev +kind load docker-image danielschwartzlol/gha-runner-scale-set-controller:dev --name arc-dev + +# 3. Install controller with Helm (v0.12.1) +helm install arc-controller \ + oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \ + --version 0.12.1 \ + --set image.repository=danielschwartzlol/gha-runner-scale-set-controller \ + --set image.tag=dev \ + --set imagePullPolicy=Never + +# 4. Deploy runner scale set (v0.12.1) +helm install arc-runner-set \ + oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \ + --version 0.12.1 \ + --set githubConfigUrl="https://github.com/justanotherspy/test-runner-repo" \ + --set githubConfigSecret="github-auth" +``` + +## Important Files for Optimization + +### Primary Focus +- `controllers/actions.github.com/ephemeralrunnerset_controller.go` - Contains sequential creation logic +- `controllers/actions.github.com/ephemeralrunner_controller.go` - Individual runner management +- `controllers/actions.github.com/autoscalingrunnerset_controller.go` - Scale set orchestration + +### Configuration +- `charts/gha-runner-scale-set-controller/` - Controller Helm chart +- `charts/gha-runner-scale-set/` - Runner set Helm chart +- `cmd/ghalistener/` - Listener pod that receives GitHub webhooks + +### Tests +- `controllers/actions.github.com/ephemeralrunnerset_controller_test.go` +- `controllers/actions.github.com/ephemeralrunner_controller_test.go` + +## Code Patterns for New Mode + +### Creating Resources in Parallel +```go +// Example pattern for parallel creation +func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel( + ctx context.Context, + runnerSet *v1alpha1.EphemeralRunnerSet, + count int, + log logr.Logger, +) error { + workers := 10 // Configurable + jobs := make(chan int, count) + results := make(chan error, count) + + // Start workers + for w := 0; w < workers; w++ { + go r.createRunnerWorker(ctx, runnerSet, jobs, results, log) + } + + // Queue jobs + for i := 0; i < count; i++ { + jobs <- i + } + close(jobs) + + // Collect results + var errs []error + for i := 0; i < count; i++ { + if err := <-results; err != nil { + errs = append(errs, err) + } + } + + return multierr.Combine(errs...) +} +``` + +## GitHub API Integration + +- Use `github.Client` interface for testability +- Implement exponential backoff for rate limiting +- Runner scale sets register with GitHub using JIT configuration +- Default runner group: "default" + +## DO NOT Work On + +- **Legacy Controllers**: Anything in `controllers/actions.summerwind.net/` +- **Cert-Manager**: Not used in new mode +- **Webhooks**: New mode uses listener pod instead +- **RunnerDeployment**: Legacy resource type +- **HorizontalRunnerAutoscaler**: Legacy autoscaling + +## Testing Performance Improvements + +```bash +# Create many runners to test parallel creation +kubectl -n arc-runners patch ephemeralrunnerset \ + --type merge -p '{"spec":{"replicas":100}}' + +# Monitor creation time +time kubectl -n arc-runners wait --for=condition=Ready \ + ephemeralrunners --all --timeout=600s + +# Check controller metrics +kubectl port-forward -n arc-systems service/arc-controller 8080:80 +curl http://localhost:8080/metrics | grep ephemeral_runner_creation_duration +``` + +## Key Metrics to Track + +- `ephemeral_runner_creation_duration_seconds` - Time to create each runner +- `ephemeral_runner_set_replicas` - Current vs desired replicas +- `controller_runtime_reconcile_time_seconds` - Reconciliation performance + +## Files Referenced + +@ENV_SETUP.md - Complete setup guide for new mode +@tasks.md - Performance optimization task plan +@controllers/actions.github.com/ephemeralrunnerset_controller.go +@controllers/actions.github.com/ephemeralrunner_controller.go +@controllers/actions.github.com/autoscalingrunnerset_controller.go \ No newline at end of file diff --git a/ENV_SETUP.md b/ENV_SETUP.md new file mode 100644 index 00000000..cbb66ca3 --- /dev/null +++ b/ENV_SETUP.md @@ -0,0 +1,382 @@ +# Local Development Environment Setup - Runner Scale Set Controller + +This guide sets up a local development environment for the **NEW** GitHub Actions Runner Scale Set Controller (not the legacy mode). + +## Important Notes + +- **NO cert-manager required** - The new mode doesn't use webhooks +- **NO legacy controller** - We only work with the new `actions.github.com` API group +- Uses separate Helm charts: `gha-runner-scale-set-controller` and `gha-runner-scale-set` +- GitHub username: `justanotherspy` +- Docker Hub account: `danielschwartzlol` + +## Prerequisites + +### Required Tools + +1. **Docker** - For running containers and Kind cluster + + ```bash + # Ubuntu/Debian + sudo apt-get update + sudo apt-get install docker.io + sudo usermod -aG docker $USER + # Log out and back in for group changes to take effect + ``` + +2. **Kind** - Kubernetes in Docker + + ```bash + # Install Kind + curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64 + chmod +x ./kind + sudo mv ./kind /usr/local/bin/kind + ``` + +3. **kubectl** - Kubernetes CLI + + ```bash + curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" + chmod +x kubectl + sudo mv kubectl /usr/local/bin/ + ``` + +4. **Helm** - Kubernetes package manager + + ```bash + curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + ``` + +5. **Go** - For building the controller (1.21+) + + ```bash + # Install Go 1.21 + wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz + sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.21.5.linux-amd64.tar.gz + export PATH=$PATH:/usr/local/go/bin + echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.bashrc + ``` + +### Environment Variables + +Add these to your `.bashrc` or `.zshrc`: + +```bash +# Docker Hub Configuration +export DOCKER_USER="danielschwartzlol" +export CONTROLLER_IMAGE="${DOCKER_USER}/gha-runner-scale-set-controller" +export RUNNER_IMAGE="ghcr.io/actions/actions-runner" # Official runner image + +# GitHub Configuration +export GITHUB_TOKEN="your-github-pat-token-here" +export GITHUB_USERNAME="justanotherspy" + +# Or for GitHub App authentication (recommended): +# export APP_ID="your-app-id" +# export INSTALLATION_ID="your-installation-id" +# export PRIVATE_KEY_FILE_PATH="/path/to/private-key.pem" + +# Test Repository Configuration +export TEST_REPO="${GITHUB_USERNAME}/test-runner-repo" +export TEST_ORG="" # Optional: Your test organization + +# Development Settings +export VERSION="dev" +export CLUSTER_NAME="arc-dev" +``` + +## Step 1: Build the Controller Image + +```bash +# Build the controller image with scale set mode +make docker-build + +# Tag it for our use +docker tag ${DOCKER_USER}/actions-runner-controller:${VERSION} \ + ${CONTROLLER_IMAGE}:${VERSION} +``` + +## Step 2: Create Kind Cluster + +Create a simple Kind cluster (no special config needed for new mode): + +```bash +# Create Kind cluster +cat < \ + --type merge \ + -p '{"spec":{"replicas":50}}' + +# Monitor creation time +time kubectl -n arc-runners wait --for=condition=Ready ephemeralrunners --all --timeout=600s + +# Check metrics +kubectl -n arc-systems port-forward service/arc-controller-gha-rs-controller 8080:80 +curl http://localhost:8080/metrics | grep ephemeral +``` + +## Debugging + +### Enable Verbose Logging + +```bash +# Update controller deployment with debug logging +kubectl -n arc-systems edit deployment arc-controller-gha-rs-controller + +# Add to container args: +# - "--log-level=debug" +``` + +### Common Commands + +```bash +# Get all resources +kubectl get all -n arc-systems +kubectl get all -n arc-runners + +# Describe runner set +kubectl -n arc-runners describe autoscalingrunnerset + +# Get events +kubectl -n arc-runners get events --sort-by='.lastTimestamp' + +# Port forward for pprof debugging +kubectl -n arc-systems port-forward deployment/arc-controller-gha-rs-controller 6060:6060 +go tool pprof http://localhost:6060/debug/pprof/profile +``` + +## Performance Testing Script + +```bash +#!/bin/bash +# perf-test.sh + +NAMESPACE="arc-runners" +REPLICAS="${1:-100}" + +echo "Testing creation of ${REPLICAS} runners..." + +# Record start time +START=$(date +%s) + +# Scale up +kubectl -n ${NAMESPACE} patch ephemeralrunnerset $(kubectl -n ${NAMESPACE} get ers -o name | head -1) \ + --type merge \ + -p "{\"spec\":{\"replicas\":${REPLICAS}}}" + +# Wait for all runners +kubectl -n ${NAMESPACE} wait --for=condition=Ready ephemeralrunners --all --timeout=600s + +# Record end time +END=$(date +%s) +DURATION=$((END - START)) + +echo "Created ${REPLICAS} runners in ${DURATION} seconds" +echo "Average time per runner: $((DURATION / REPLICAS)) seconds" + +# Get runner creation events +kubectl -n ${NAMESPACE} get events --field-selector reason=Created | grep EphemeralRunner +``` + +## Cleanup + +```bash +# Delete runner scale set +helm uninstall arc-runner-set -n arc-runners + +# Delete controller +helm uninstall arc-controller -n arc-systems + +# Delete namespaces +kubectl delete namespace arc-systems arc-runners + +# Delete Kind cluster +kind delete cluster --name ${CLUSTER_NAME} +``` + +## Troubleshooting + +### Runner Scale Set Not Creating Runners + +```bash +# Check if runner scale set is registered +kubectl -n arc-runners get autoscalingrunnerset -o yaml | grep runnerScaleSetId + +# Check GitHub API connectivity +kubectl -n arc-systems exec -it deployment/arc-controller-gha-rs-controller -- \ + curl -H "Authorization: token ${GITHUB_TOKEN}" \ + https://api.github.com/repos/${TEST_REPO}/actions/runners/registration-token +``` + +### Runners Not Picking Up Jobs + +```bash +# Ensure runner group matches your workflow +# In workflow file: +# runs-on: [self-hosted, linux, x64, default] # default = runner group + +# Check runner registration +kubectl -n arc-runners logs -l app.kubernetes.io/component=runner --tail=100 +``` + +## Key Differences from Legacy Mode + +1. **No Cert-Manager**: New mode doesn't use admission webhooks +2. **Different CRDs**: Uses `AutoscalingRunnerSet`, `EphemeralRunnerSet`, `EphemeralRunner` +3. **Separate Helm Charts**: `gha-runner-scale-set-controller` and `gha-runner-scale-set` +4. **Listener Pod**: Runs in controller namespace, handles GitHub webhooks +5. **No Runner Deployment**: Only uses ephemeral runners + +## Resources + +- [Runner Scale Set Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller) +- [ARC Helm Charts](https://github.com/actions/actions-runner-controller/tree/master/charts) +- [Kind Documentation](https://kind.sigs.k8s.io/) \ No newline at end of file diff --git a/tasks.md b/tasks.md new file mode 100644 index 00000000..95405d40 --- /dev/null +++ b/tasks.md @@ -0,0 +1,246 @@ +# Runner Scale Set Controller Performance Optimization + +## Problem Analysis + +Based on analysis of the codebase, the runner scale set controller currently spawns runners **sequentially** in the `EphemeralRunnerSetReconciler.createEphemeralRunners()` method at `/controllers/actions.github.com/ephemeralrunnerset_controller.go:359-386`. + +### Current Sequential Implementation Issues: +1. **Linear time complexity O(n)**: Creating n runners takes n sequential API calls +2. **Blocking loop**: Each runner creation blocks until the API call completes +3. **Poor scalability**: Large scale-ups (e.g., 100+ runners) take minutes +4. **Resource underutilization**: Controller pod doesn't leverage available CPU/memory for parallel operations + +### Key Bottlenecks Identified: +- **EphemeralRunnerSet Controller** (`ephemeralrunnerset_controller.go:362-383`): Sequential for-loop creating runners one by one +- **API Call Latency**: Each `r.Create(ctx, ephemeralRunner)` call blocks for network roundtrip +- **No batching**: Individual API calls instead of batch operations +- **No concurrency**: Single-threaded execution path + +## Proposed Task List for Performance Improvement + +### Phase 1: Research & Design (Week 1) +- [ ] **Task 1.1**: Benchmark current performance + - Measure time to create 10, 50, 100, 500 runners + - Profile CPU/memory usage during scale-up + - Document baseline metrics for comparison + +- [ ] **Task 1.2**: Research Kubernetes client-go patterns for concurrent resource creation + - Study controller-runtime workqueue patterns + - Investigate rate limiting considerations + - Review best practices for bulk operations + +- [ ] **Task 1.3**: Design concurrent runner creation architecture + - Define optimal concurrency level (suggest: configurable, default 10) + - Design error handling and retry strategy + - Plan backward compatibility approach + +### Phase 2: Implementation (Week 2-3) + +- [ ] **Task 2.1**: Refactor `createEphemeralRunners` for parallel execution + ```go + // Suggested approach: + // - Use worker pool pattern with configurable concurrency + // - Implement error aggregation + // - Add progress tracking + ``` + +- [ ] **Task 2.2**: Implement configurable concurrency controls + - Add `--runner-creation-concurrency` flag (default: 10) + - Add `--runner-creation-timeout` flag (default: 30s) + - Environment variable overrides for containerized deployments + +- [ ] **Task 2.3**: Add comprehensive error handling + - Implement exponential backoff for failed creations + - Partial success handling (some runners created, some failed) + - Detailed error reporting and metrics + +- [ ] **Task 2.4**: Implement progress tracking and observability + - Add prometheus metrics for creation time per runner + - Log progress at intervals (e.g., "Created 50/100 runners") + - Add events to AutoscalingRunnerSet for visibility + +### Phase 3: Testing (Week 3-4) + +- [ ] **Task 3.1**: Unit tests for concurrent creation + - Test with mock client + - Verify error handling + - Test concurrency limits + - Test partial failures + +- [ ] **Task 3.2**: Integration tests + - Test with real Kubernetes API + - Verify resource creation order + - Test rollback on failure + - Test with various concurrency levels + +- [ ] **Task 3.3**: Load testing + - Test creating 100+ runners simultaneously + - Monitor API server impact + - Measure improvement vs baseline + - Test with rate limiting + +- [ ] **Task 3.4**: Chaos testing + - Test with network failures + - Test with API server throttling + - Test with partial quota exhaustion + - Test controller restart during creation + +### Phase 4: Optimization & Tuning (Week 4-5) + +- [ ] **Task 4.1**: Implement adaptive concurrency + - Start with low concurrency, increase based on success rate + - Back off on errors or throttling + - Self-tuning based on cluster capacity + +- [ ] **Task 4.2**: Add bulk creation API support (if available) + - Research if Actions API supports bulk runner registration + - Implement batch registration if supported + - Fall back to parallel individual creation + +- [ ] **Task 4.3**: Optimize resource creation + - Pre-compute runner configurations + - Cache common data (secrets, configs) + - Minimize API calls per runner + +### Phase 5: Documentation & Rollout (Week 5-6) + +- [ ] **Task 5.1**: Document configuration options + - Update CLAUDE.md with new flags + - Add tuning guide for different cluster sizes + - Document performance improvements + +- [ ] **Task 5.2**: Create migration guide + - Document any breaking changes + - Provide upgrade path + - Include rollback procedures + +- [ ] **Task 5.3**: Performance report + - Before/after benchmarks + - Scalability analysis + - Recommendations for different use cases + +## Implementation Details + +### Suggested Code Structure + +```go +// ephemeralrunnerset_controller.go + +type runnerCreationJob struct { + runner *v1alpha1.EphemeralRunner + index int + err error +} + +func (r *EphemeralRunnerSetReconciler) createEphemeralRunnersParallel( + ctx context.Context, + runnerSet *v1alpha1.EphemeralRunnerSet, + count int, + log logr.Logger, +) error { + concurrency := r.getConfiguredConcurrency() // Default: 10 + + jobs := make(chan runnerCreationJob, count) + results := make(chan runnerCreationJob, count) + + // Start workers + var wg sync.WaitGroup + for i := 0; i < concurrency; i++ { + wg.Add(1) + go r.runnerCreationWorker(ctx, runnerSet, jobs, results, &wg, log) + } + + // Queue jobs + for i := 0; i < count; i++ { + jobs <- runnerCreationJob{ + runner: r.newEphemeralRunner(runnerSet), + index: i, + } + } + close(jobs) + + // Wait for completion + go func() { + wg.Wait() + close(results) + }() + + // Collect results and handle errors + var errs []error + created := 0 + for result := range results { + if result.err != nil { + errs = append(errs, result.err) + } else { + created++ + if created%10 == 0 || created == count { + log.Info("Runner creation progress", "created", created, "total", count) + } + } + } + + return multierr.Combine(errs...) +} +``` + +## Success Metrics + +1. **Performance**: + - Target: Create 100 runners in < 30 seconds (vs current ~5 minutes) + - Reduce time complexity from O(n) to O(n/c) where c = concurrency + +2. **Reliability**: + - Handle partial failures gracefully + - No runner leaks on error + - Proper cleanup on controller restart + +3. **Observability**: + - Clear progress tracking + - Detailed metrics and logs + - Actionable error messages + +4. **Compatibility**: + - Backward compatible by default + - Configurable for different environments + - No breaking changes to CRDs + +## Risk Mitigation + +1. **API Server Overload**: Implement rate limiting and backoff +2. **Resource Exhaustion**: Add memory/CPU limits and monitoring +3. **Partial Failures**: Implement proper rollback and cleanup +4. **Race Conditions**: Use proper locking and atomic operations + +## Testing Requirements + +- Unit test coverage > 80% +- Integration tests for all scenarios +- Performance regression tests +- Documentation for all new features +- Backward compatibility tests + +## Rollout Plan + +1. **Alpha**: Deploy to dev environment with conservative defaults +2. **Beta**: Test with select users, gather feedback +3. **GA**: Full rollout with documentation and migration guide + +## Dependencies + +- No changes to CRDs required +- Compatible with existing Actions Runner Controller versions +- Requires Go 1.21+ for errors.Join support (already in use) + +## Timeline Estimate + +- Total Duration: 5-6 weeks +- Developer Resources: 1-2 engineers +- Review & Testing: Additional 1 week + +## Notes for Implementation + +1. Consider using `golang.org/x/sync/errgroup` for cleaner error handling +2. Leverage existing `multierr` package for error aggregation +3. Use context cancellation for proper cleanup +4. Consider implementing circuit breaker pattern for API failures +5. Add feature flag to enable/disable parallel creation \ No newline at end of file