- Add configurable HUDFailureMultiplier (default 3x) to scale
placeholder count when HUD API is unreachable
- New env var CAPACITY_AWARE_HUD_FAILURE_MULTIPLIER with clamp ≥1
in both ConfigFromEnv and Validate
- Fallback formula: ProactiveCapacity * multiplier (replaces the
previous zero-queued-jobs fallback that reduced capacity)
- Add tests for multiplier clamping, MaxRunners cap interaction,
and HUD-disabled path
- Bump chart versions to jeanschmidt.10
When HUD is down we lose visibility into queue depth, so the old
fallback of assuming 0 queued jobs was backwards — it shrank capacity
exactly when we had the least information. The multiplier-based
fallback leans toward over-provisioning instead; existing safety
bounds (MaxRunners headroom, MaxBurstCapacity) still cap the blast
radius.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Promote runner-class from preferred (weight 100) to required node
affinity term, matching the actual workflow pod's scheduling
- Use DoesNotExist operator when RunnerClass is unset
- AND-combine runner-class and GPU label in same matchExpressions block
- Add table-driven tests for all runner-class + GPU combinations
- Bump chart versions to jeanschmidt.9
A preferred runner-class term let placeholders land on non-matching
nodes where the real workflow pod (which uses a required term) could
never follow — wasting the reservation. Making it required ensures
placeholders only occupy nodes the actual pod can schedule onto.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add CleanupBroken to PlaceholderManager to find slots
where one of two pods was evicted/deleted
- Integrate broken-pair cleanup into reconcileProvisioning
between ListPairs and adjustPairs so replacement happens
in the same cycle
- Add "broken" delete reason with Prometheus metrics
- Add unit tests for both successful and failed cleanup
- Bump Helm chart versions to jeanschmidt.8
Without this fix, a broken pair (one pod missing) would
count as healthy in currentPairs, causing the provisioner
to believe capacity was at desired level. Pre-warmed
capacity would be permanently reduced until the next full
listener restart.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Update gha-runner-scale-set-controller chart and app version
- Update gha-runner-scale-set chart and app version
Reflects recent changes: batched runner pod listing into a
single API call, and added MaxBurstCapacity/MaxRunners
headroom support.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6
- Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6
Follows addition of Prometheus metrics to the capacity monitor
in the previous commit.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Remove osdc.io/runner-class nodeSelector and toleration from
runner placeholder pods; runner-pool nodes carry no such label
or taint, causing placeholders to stay Pending forever
- Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass
covering both RunnerClass-set and RunnerClass-empty configs
- Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity
to verify workflow placeholders still use runner-class correctly
- Update existing tests to assert runner-class is always absent
- Bump chart versions to jeanschmidt.5
The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool
that does not carry osdc.io/runner-class labels or taints. Only workflow
placeholders use runner-class, via preferred node affinity (weight 100).
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add RunnerNodeFleet config to place runner placeholders on the
cluster-wide runner pool instead of the per-scale-set workflow pool
- Change git-cache-not-ready toleration to operator:Exists to match
the unconditional startupTaint on runner-pool nodes
- Make Config.Validate() return an error and require RunnerNodeFleet
when capacity-aware mode is enabled
- Add split-fleet tests verifying runner and workflow placeholders
never conflate each other's node-fleet values
- Bump chart versions to jeanschmidt.4
Runner and workflow pods target different node pools (e.g. c7i-runner
vs g4dn). Previously both used NodeFleet, which silently landed runner
placeholders on the wrong pool — defeating the topology separation the
placeholder system is meant to provide.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add sleepArg() that bounds placeholder pod lifetime to 1.5x
PlaceholderTimeout, preventing pod leaks if listener crashes
- Fix GPU node selector label from nvidia.com/gpu.present to
nvidia.com/gpu
- Bump chart versions to 0.14.1-jeanschmidt.2
- Add tests for timeout-based sleep, infinity fallback, and
sub-second floor-to-1 behavior
Previously placeholder pods ran `sleep infinity`, meaning a listener
crash would leave them running forever. The new defensive sleep
self-terminates pods after 1.5x the configured timeout, acting as a
safety net behind the normal CleanupAll/CleanupTimedOut path.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
- Add OCI source label to Dockerfile for jeanschmidt fork
- Bump chart/app versions to 0.14.1-jeanschmidt.1
- Redirect controller image repo to ghcr.io/jeanschmidt
Enables building and deploying the placeholder pod POC
from the forked registry for testing.
Signed-off-by: Jean Schmidt <contato@jschmidt.me>