- Promote runner-class from preferred (weight 100) to required node
affinity term, matching the actual workflow pod's scheduling
- Use DoesNotExist operator when RunnerClass is unset
- AND-combine runner-class and GPU label in same matchExpressions block
- Add table-driven tests for all runner-class + GPU combinations
- Bump chart versions to jeanschmidt.9
A preferred runner-class term let placeholders land on non-matching
nodes where the real workflow pod (which uses a required term) could
never follow — wasting the reservation. Making it required ensures
placeholders only occupy nodes the actual pod can schedule onto.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add CleanupBroken to PlaceholderManager to find slots
where one of two pods was evicted/deleted
- Integrate broken-pair cleanup into reconcileProvisioning
between ListPairs and adjustPairs so replacement happens
in the same cycle
- Add "broken" delete reason with Prometheus metrics
- Add unit tests for both successful and failed cleanup
- Bump Helm chart versions to jeanschmidt.8
Without this fix, a broken pair (one pod missing) would
count as healthy in currentPairs, causing the provisioner
to believe capacity was at desired level. Pre-warmed
capacity would be permanently reduced until the next full
listener restart.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Update gha-runner-scale-set-controller chart and app version
- Update gha-runner-scale-set chart and app version
Reflects recent changes: batched runner pod listing into a
single API call, and added MaxBurstCapacity/MaxRunners
headroom support.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Merge two per-phase countRunnersByPhaseWithRetry calls into one
that returns a map[PodPhase]int, halving API calls to the apiserver
- Drop FieldSelector filtering; group pods by phase in code only
- Guard headroom calculation with max(0, ...) to prevent negative clamp
- Always log maxBurstCapacity instead of conditionally appending it
The two separate List calls (Running, Pending) were redundant — one
unfiltered List grouped in-code gives the same result with half the
API traffic. The fake clientset never honored FieldSelector anyway,
so removing it also eliminates the test/prod behavioral divergence
noted in the old comment.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add MaxBurstCapacity config to cap placeholder pairs per
cycle, preventing burst node provisioning from overloading
downstream services (git-cache, Harbor, pypi-cache)
- Fix MaxRunners headroom: subtract real runner pods (Running
+ Pending) so placeholders only fill remaining capacity
- Generalize countRunningRunners to countRunnersByPhase for
reuse across both provisioner and reporter loops
- Add Prometheus gauge for max_burst_capacity config value
- Add tests for burst cap, headroom clamp, pending pod
accounting, runner-count errors, and edge cases
The previous MaxRunners clamp (`min(desired, MaxRunners)`)
allowed up to MaxRunners placeholders ON TOP of real runners,
effectively doubling the cap. The headroom fix subtracts
actual Running+Pending runner pods from MaxRunners before
clamping. On runner-count failure the cycle is skipped
entirely — treating a failed count as 0 would silently
re-open the doubling bug during the failure window.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6
- Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6
Follows addition of Prometheus metrics to the capacity monitor
in the previous commit.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Define CapacityRecorder interface with 15 metrics (gauges, counters,
histograms) covering reconcile loops, HUD API, pair lifecycle, and
placeholder pod phases
- Instrument monitor reconcile paths with duration histograms,
skip/error counters, and last-success timestamps for wedge detection
- Refactor CleanupTimedOut and CleanupOrphans to return per-pair
success/failure counts for accurate metric emission
- Add parity test enforcing Go metric registry stays in sync with
the OSDC chart's listenerMetrics allowlist via YAML fixture
- Wire CapacityRecorder into monitor via WithRecorder option pattern,
defaulting to no-op discard for backward compatibility
Reconcile-last-success gauges are seeded at startup to avoid
spurious wedge alerts during the window between listener restart
and first completed reconcile. Placeholder pod phase gauges emit
all (role x phase) combinations including zeros so Prometheus
gauges decrement correctly when phases empty out.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Remove osdc.io/runner-class nodeSelector and toleration from
runner placeholder pods; runner-pool nodes carry no such label
or taint, causing placeholders to stay Pending forever
- Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass
covering both RunnerClass-set and RunnerClass-empty configs
- Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity
to verify workflow placeholders still use runner-class correctly
- Update existing tests to assert runner-class is always absent
- Bump chart versions to jeanschmidt.5
The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool
that does not carry osdc.io/runner-class labels or taints. Only workflow
placeholders use runner-class, via preferred node affinity (weight 100).
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add RunnerNodeFleet config to place runner placeholders on the
cluster-wide runner pool instead of the per-scale-set workflow pool
- Change git-cache-not-ready toleration to operator:Exists to match
the unconditional startupTaint on runner-pool nodes
- Make Config.Validate() return an error and require RunnerNodeFleet
when capacity-aware mode is enabled
- Add split-fleet tests verifying runner and workflow placeholders
never conflate each other's node-fleet values
- Bump chart versions to jeanschmidt.4
Runner and workflow pods target different node pools (e.g. c7i-runner
vs g4dn). Previously both used NodeFleet, which silently landed runner
placeholders on the wrong pool — defeating the topology separation the
placeholder system is meant to provide.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add sleepArg() that bounds placeholder pod lifetime to 1.5x
PlaceholderTimeout, preventing pod leaks if listener crashes
- Fix GPU node selector label from nvidia.com/gpu.present to
nvidia.com/gpu
- Bump chart versions to 0.14.1-jeanschmidt.2
- Add tests for timeout-based sleep, infinity fallback, and
sub-second floor-to-1 behavior
Previously placeholder pods ran `sleep infinity`, meaning a listener
crash would leave them running forever. The new defensive sleep
self-terminates pods after 1.5x the configured timeout, acting as a
safety net behind the normal CleanupAll/CleanupTimedOut path.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add OCI source label to Dockerfile for jeanschmidt fork
- Bump chart/app versions to 0.14.1-jeanschmidt.1
- Redirect controller image repo to ghcr.io/jeanschmidt
Enables building and deploying the placeholder pod POC
from the forked registry for testing.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Decouple provisioning (create/delete pairs) from capacity
reporting (count running pairs/runners, call setMaxRunners)
into independent loops with separate tick intervals
- Add exponential-backoff retry for K8s API and HUD API calls
with per-loop retry budgets (reporter: 2, provisioner: 3)
- Use atomic.Int64 for slotCounter to support concurrent access
- Add ordered shutdown: reporter stops before placeholder
cleanup to prevent a flash of reportedCapacity=0
The single reconcile loop coupled provisioning and reporting at
the same interval (30s). Reporting needs to react faster when
placeholder pods become Running (node warm-up), so it now ticks
independently at 5s (configurable via CAPACITY_AWARE_REPORT_INTERVAL).
Retries prevent transient K8s API failures from causing missed
cycles or premature capacity drops.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add anchor ConfigMap as OwnerReference for placeholder pods, enabling
Kubernetes GC cascade-delete across namespaces
- Make HUD API URL configurable via CAPACITY_AWARE_HUD_API_URL env var
with sensible default query parameters baked into the URL
- Simplify HUDClient by storing URL as instance field, removing the
need for rewriteTransport test helper
- Extend CleanupOrphans to also delete orphaned anchor ConfigMaps
The anchor ConfigMap solves cross-namespace ownership: the listener
runs in arc-systems but placeholders live in arc-runners, so the
listener pod cannot be a direct OwnerReference. The anchor lives in
the placeholder namespace and owns all placeholder pods, so deleting
it triggers automatic GC of all associated pods.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add capacity package with monitor, placeholder manager, HUD client,
and env-based config for proactive node pre-warming
- Create runner+workflow placeholder pod pairs that mirror the actual
runner.yaml.tpl specs (nodeSelector, tolerations, affinity, resources)
- Integrate monitor into ghalistener main.go as an optional goroutine
gated by CAPACITY_AWARE_ENABLED env var
- Query PyTorch HUD API for queued job counts to dynamically scale
placeholder count beyond the static proactive capacity baseline
- Add comprehensive test coverage (~1440 lines) including pod spec
fidelity, orphan cleanup, scale-down preference, and idempotency
The listener currently reports a static maxRunners to GitHub, which
means nodes are only provisioned after jobs are already queued. This
causes cold-start latency while Karpenter spins up nodes. Placeholder
pods reserve cluster capacity ahead of demand so that real runner pods
can schedule immediately onto pre-warmed nodes, then get evicted via
priority class preemption when actual work arrives.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>