Commit Graph

93 Commits

Author SHA1 Message Date
Jean Schmidt d5d94fba48 HUD failure fallback: over-provision placeholders
- Add configurable HUDFailureMultiplier (default 3x) to scale
  placeholder count when HUD API is unreachable
- New env var CAPACITY_AWARE_HUD_FAILURE_MULTIPLIER with clamp ≥1
  in both ConfigFromEnv and Validate
- Fallback formula: ProactiveCapacity * multiplier (replaces the
  previous zero-queued-jobs fallback that reduced capacity)
- Add tests for multiplier clamping, MaxRunners cap interaction,
  and HUD-disabled path
- Bump chart versions to jeanschmidt.10

When HUD is down we lose visibility into queue depth, so the old
fallback of assuming 0 queued jobs was backwards — it shrank capacity
exactly when we had the least information. The multiplier-based
fallback leans toward over-provisioning instead; existing safety
bounds (MaxRunners headroom, MaxBurstCapacity) still cap the blast
radius.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-05-15 14:23:36 -07:00
Jean Schmidt 4714643523 Require runner-class in workflow affinity
- Promote runner-class from preferred (weight 100) to required node
  affinity term, matching the actual workflow pod's scheduling
- Use DoesNotExist operator when RunnerClass is unset
- AND-combine runner-class and GPU label in same matchExpressions block
- Add table-driven tests for all runner-class + GPU combinations
- Bump chart versions to jeanschmidt.9

A preferred runner-class term let placeholders land on non-matching
nodes where the real workflow pod (which uses a required term) could
never follow — wasting the reservation. Making it required ensures
placeholders only occupy nodes the actual pod can schedule onto.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-05-05 11:42:02 -07:00
Jean Schmidt d219a11c89 Detect and replace broken placeholder pairs
- Add CleanupBroken to PlaceholderManager to find slots
  where one of two pods was evicted/deleted
- Integrate broken-pair cleanup into reconcileProvisioning
  between ListPairs and adjustPairs so replacement happens
  in the same cycle
- Add "broken" delete reason with Prometheus metrics
- Add unit tests for both successful and failed cleanup
- Bump Helm chart versions to jeanschmidt.8

Without this fix, a broken pair (one pod missing) would
count as healthy in currentPairs, causing the provisioner
to believe capacity was at desired level. Pre-warmed
capacity would be permanently reduced until the next full
listener restart.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-05-01 13:26:59 -07:00
Jean Schmidt 89a51137dc Bump chart versions to jeanschmidt.7
- Update gha-runner-scale-set-controller chart and app version
- Update gha-runner-scale-set chart and app version

Reflects recent changes: batched runner pod listing into a
single API call, and added MaxBurstCapacity/MaxRunners
headroom support.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-30 20:34:37 -07:00
Jean Schmidt 4203ba9489 Bump chart versions to jeanschmidt.6
- Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6
- Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6

Follows addition of Prometheus metrics to the capacity monitor
in the previous commit.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 13:44:22 -07:00
Jean Schmidt e3dfeae67f Drop runner-class from runner placeholders
- Remove osdc.io/runner-class nodeSelector and toleration from
  runner placeholder pods; runner-pool nodes carry no such label
  or taint, causing placeholders to stay Pending forever
- Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass
  covering both RunnerClass-set and RunnerClass-empty configs
- Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity
  to verify workflow placeholders still use runner-class correctly
- Update existing tests to assert runner-class is always absent
- Bump chart versions to jeanschmidt.5

The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool
that does not carry osdc.io/runner-class labels or taints. Only workflow
placeholders use runner-class, via preferred node affinity (weight 100).

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 03:30:46 -07:00
Jean Schmidt 2c07a07a97 Split runner/workflow placeholder fleets
- Add RunnerNodeFleet config to place runner placeholders on the
  cluster-wide runner pool instead of the per-scale-set workflow pool
- Change git-cache-not-ready toleration to operator:Exists to match
  the unconditional startupTaint on runner-pool nodes
- Make Config.Validate() return an error and require RunnerNodeFleet
  when capacity-aware mode is enabled
- Add split-fleet tests verifying runner and workflow placeholders
  never conflate each other's node-fleet values
- Bump chart versions to jeanschmidt.4

Runner and workflow pods target different node pools (e.g. c7i-runner
vs g4dn). Previously both used NodeFleet, which silently landed runner
placeholders on the wrong pool — defeating the topology separation the
placeholder system is meant to provide.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 02:36:13 -07:00
Jean Schmidt 805c698b8e bump version
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 18:44:17 -07:00
Jean Schmidt 8e5df744fd Add defensive sleep timeout to placeholders
- Add sleepArg() that bounds placeholder pod lifetime to 1.5x
  PlaceholderTimeout, preventing pod leaks if listener crashes
- Fix GPU node selector label from nvidia.com/gpu.present to
  nvidia.com/gpu
- Bump chart versions to 0.14.1-jeanschmidt.2
- Add tests for timeout-based sleep, infinity fallback, and
  sub-second floor-to-1 behavior

Previously placeholder pods ran `sleep infinity`, meaning a listener
crash would leave them running forever. The new defensive sleep
self-terminates pods after 1.5x the configured timeout, acting as a
safety net behind the normal CleanupAll/CleanupTimedOut path.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 17:53:55 -07:00
Jean Schmidt adf2791de5 Point images and charts to personal fork
- Add OCI source label to Dockerfile for jeanschmidt fork
- Bump chart/app versions to 0.14.1-jeanschmidt.1
- Redirect controller image repo to ghcr.io/jeanschmidt

Enables building and deploying the placeholder pod POC
from the forked registry for testing.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 16:40:34 -07:00
Junya Okabe a401686bd5
Add option to disable workqueue bucket rate limiter (#4451) 2026-04-22 23:26:39 +02:00
Francesco Renzi 74cfc3855e
Prepare 0.14.1 release (#4448) 2026-04-14 17:03:22 +01:00
Nikola Jokic 8b7f232dc4
Prepare 0.14.0 release (#4413) 2026-03-19 18:53:37 +01:00
Nikola Jokic 802dc28d38
Add multi-label support to scalesets (#4408) 2026-03-19 15:29:40 +01:00
Nikola Jokic 9bc1c9e53e
Shutdown the scaleset when runner is deprecated (#4404)
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-19 13:30:20 +01:00
Nikola Jokic 276717a04b
Manually bump dependencies since it needs fixes related to the controller runtime API (#4406) 2026-03-16 10:09:36 +01:00
Nikola Jokic 1d9f626c53
Allow users to apply labels and annotations to internal resources (#4400) 2026-03-12 10:32:54 +01:00
Nikola Jokic a505fb5616
Prepare 0.13.1 release (#4341) 2025-12-23 14:57:05 +01:00
Nikola Jokic a0c30df25b
Prepare 0.13.0 release (#4280) 2025-10-16 19:25:56 +02:00
Nikola Jokic 634e42c916
Bump all dependencies (#4266) 2025-10-14 13:24:25 +02:00
Nikola Jokic 088e2a3a90
Remove ephemeral runner when exit code != 0 and is patched with the job (#4239) 2025-09-17 21:40:37 +02:00
Nikola Jokic c27541140a
Remove JIT config from ephemeral runner status field (#4191) 2025-08-04 12:35:04 +02:00
Alex Hatzenbuhler a07dce28bb
Remove deprecated preserveUnknownFields from CRDs (#4135) 2025-07-24 08:47:34 +02:00
Nikola Jokic 349cc0835e
Fix image pull secrets list arguments in the chart (#4164) 2025-07-01 15:28:18 +02:00
Nikola Jokic ded39bede6
Prepare 0.12.1 release (#4153) 2025-06-27 13:49:47 +02:00
Nikola Jokic d9826e5244
Prepare 0.12.0 release (#4122) 2025-06-13 14:23:26 +02:00
Nikola Jokic e46c929241
Azure Key Vault integration to resolve secrets (#4090) 2025-06-11 15:53:33 +02:00
Nikola Jokic cae7efa2c6
Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners (#4059) 2025-05-14 15:38:50 +02:00
Nikola Jokic 4ca37fbdf2
Prepare 0.11.0 release (#3992) 2025-03-25 11:09:03 +01:00
Nikola Jokic 5a960b5ebb
Create configurable metrics (#3975) 2025-03-24 15:27:42 +01:00
Nikola Jokic 7033e299cd
Add events role permission to leader_election_role (#3988) 2025-03-24 15:10:47 +01:00
J. Fernández 3c1a323381
feat: allow namespace overrides (#3797)
Signed-off-by: Jesús Fernández <7312236+fernandezcuesta@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-03-18 21:41:04 +01:00
Nikola Jokic fb9b96bf75
Update all dependencies, conforming to the new controller-runtime API (#3949) 2025-03-11 15:52:52 +01:00
Mikey Smet 75c6a94010
Use gha-runner-scale-set-controller.chart instead of .Chart.Version (#3729)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-03-10 11:48:30 +01:00
Nikola Jokic 7a5996f467
Remove old githubrunnerscalesetlistener, remove warning and fix config bug (#3937) 2025-03-07 11:58:16 +01:00
Nikola Jokic 66172ab0bd
Fix template tests and add go test on gha-validate-chart (#3886) 2025-01-15 15:54:33 +01:00
Bassem Dghaidi 1e10417be8
Prepare `0.10.1` release (#3859) 2024-12-18 16:22:50 +01:00
Bassem Dghaidi 1ef7196115
Fix helm chart bug related to `runnerMaxConcurrentReconciles` (#3858) 2024-12-18 16:14:55 +01:00
Bassem Dghaidi 59cb1d2c8b
Prepare `0.10.0` release (#3849) 2024-12-16 11:39:55 +01:00
Bassem Dghaidi 7e04027d19
Make k8s client rate limiter parameters configurable (#3848)
Co-authored-by: Taketoshi Fujiwara <t-b-fujiwara@mercari.com>
2024-12-13 15:37:01 +01:00
Yusuke Kuoka 3998f6dee6
Make EphemeralRunnerController MaxConcurrentReconciles configurable (#3832)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2024-12-11 21:19:43 +01:00
Nikola Jokic 80d848339e
Prepare 0.9.3 release (#3624) 2024-06-25 12:35:39 +02:00
Nikola Jokic a62ca3d853
Exclude label prefix propagation (#3607) 2024-06-21 12:12:14 +02:00
Nikola Jokic 3be7128f9a
Prepare 0.9.2 release (#3530) 2024-05-20 10:58:06 +02:00
Nikola Jokic ea13873f14
Remove service monitor that is not used in controller chart (#3526) 2024-05-17 13:06:57 +02:00
Nikola Jokic 9e191cdd21
Prepare 0.9.1 release (#3448) 2024-04-17 10:51:28 +02:00
Alexandre Chouinard 0006dd5eb1
Add topologySpreadConstraint to gha-runner-scale-set-controller chart (#3405) 2024-04-12 14:22:41 +02:00
Nikola Jokic 4357525445
Prepare 0.9.0 release (#3388) 2024-03-27 11:54:17 +01:00
Nikola Jokic 7a643a5107
Fix overscaling when the controller is much faster then the listener (#3371)
Co-authored-by: Francesco Renzi <rentziass@gmail.com>
2024-03-20 15:36:12 +01:00
Nikola Jokic a7af44e042
Deprecation warning of older listener for 0.9.0 release (#3280) 2024-03-18 12:59:41 +01:00