Commit Graph

1831 Commits

Author SHA1 Message Date
Jean Schmidt 4714643523 Require runner-class in workflow affinity
- Promote runner-class from preferred (weight 100) to required node
  affinity term, matching the actual workflow pod's scheduling
- Use DoesNotExist operator when RunnerClass is unset
- AND-combine runner-class and GPU label in same matchExpressions block
- Add table-driven tests for all runner-class + GPU combinations
- Bump chart versions to jeanschmidt.9

A preferred runner-class term let placeholders land on non-matching
nodes where the real workflow pod (which uses a required term) could
never follow — wasting the reservation. Making it required ensures
placeholders only occupy nodes the actual pod can schedule onto.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-05-05 11:42:02 -07:00
Jean Schmidt d219a11c89 Detect and replace broken placeholder pairs
- Add CleanupBroken to PlaceholderManager to find slots
  where one of two pods was evicted/deleted
- Integrate broken-pair cleanup into reconcileProvisioning
  between ListPairs and adjustPairs so replacement happens
  in the same cycle
- Add "broken" delete reason with Prometheus metrics
- Add unit tests for both successful and failed cleanup
- Bump Helm chart versions to jeanschmidt.8

Without this fix, a broken pair (one pod missing) would
count as healthy in currentPairs, causing the provisioner
to believe capacity was at desired level. Pre-warmed
capacity would be permanently reduced until the next full
listener restart.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-05-01 13:26:59 -07:00
Jean Schmidt a3c294ab14
Merge pull request #3 from jeanschmidt/jeanschmidt/proactive_capacity_max_runners
Add MaxBurstCapacity cap and fix MaxRunners headroom calculation
2026-05-01 12:39:52 -07:00
Jean Schmidt 89a51137dc Bump chart versions to jeanschmidt.7
- Update gha-runner-scale-set-controller chart and app version
- Update gha-runner-scale-set chart and app version

Reflects recent changes: batched runner pod listing into a
single API call, and added MaxBurstCapacity/MaxRunners
headroom support.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-30 20:34:37 -07:00
Jean Schmidt 24bf64ada6 Batch runner pod listing into single API call
- Merge two per-phase countRunnersByPhaseWithRetry calls into one
  that returns a map[PodPhase]int, halving API calls to the apiserver
- Drop FieldSelector filtering; group pods by phase in code only
- Guard headroom calculation with max(0, ...) to prevent negative clamp
- Always log maxBurstCapacity instead of conditionally appending it

The two separate List calls (Running, Pending) were redundant — one
unfiltered List grouped in-code gives the same result with half the
API traffic. The fake clientset never honored FieldSelector anyway,
so removing it also eliminates the test/prod behavioral divergence
noted in the old comment.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-30 18:20:01 -07:00
Jean Schmidt f6c56d3af0 Add MaxBurstCapacity and MaxRunners headroom
- Add MaxBurstCapacity config to cap placeholder pairs per
  cycle, preventing burst node provisioning from overloading
  downstream services (git-cache, Harbor, pypi-cache)
- Fix MaxRunners headroom: subtract real runner pods (Running
  + Pending) so placeholders only fill remaining capacity
- Generalize countRunningRunners to countRunnersByPhase for
  reuse across both provisioner and reporter loops
- Add Prometheus gauge for max_burst_capacity config value
- Add tests for burst cap, headroom clamp, pending pod
  accounting, runner-count errors, and edge cases

The previous MaxRunners clamp (`min(desired, MaxRunners)`)
allowed up to MaxRunners placeholders ON TOP of real runners,
effectively doubling the cap. The headroom fix subtracts
actual Running+Pending runner pods from MaxRunners before
clamping. On runner-count failure the cycle is skipped
entirely — treating a failed count as 0 would silently
re-open the doubling bug during the failure window.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-30 12:17:55 -07:00
Jean Schmidt 69136c719d
Merge pull request #2 from jeanschmidt/jeanschmidt/proactive_capacity_metrics
Add Prometheus metrics to proactive capacity monitor
2026-04-29 14:06:57 -07:00
Jean Schmidt 4203ba9489 Bump chart versions to jeanschmidt.6
- Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6
- Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6

Follows addition of Prometheus metrics to the capacity monitor
in the previous commit.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 13:44:22 -07:00
Jean Schmidt 24a4afc263 Add Prometheus metrics to capacity monitor
- Define CapacityRecorder interface with 15 metrics (gauges, counters,
  histograms) covering reconcile loops, HUD API, pair lifecycle, and
  placeholder pod phases
- Instrument monitor reconcile paths with duration histograms,
  skip/error counters, and last-success timestamps for wedge detection
- Refactor CleanupTimedOut and CleanupOrphans to return per-pair
  success/failure counts for accurate metric emission
- Add parity test enforcing Go metric registry stays in sync with
  the OSDC chart's listenerMetrics allowlist via YAML fixture
- Wire CapacityRecorder into monitor via WithRecorder option pattern,
  defaulting to no-op discard for backward compatibility

Reconcile-last-success gauges are seeded at startup to avoid
spurious wedge alerts during the window between listener restart
and first completed reconcile. Placeholder pod phase gauges emit
all (role x phase) combinations including zeros so Prometheus
gauges decrement correctly when phases empty out.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 13:35:05 -07:00
Jean Schmidt e3dfeae67f Drop runner-class from runner placeholders
- Remove osdc.io/runner-class nodeSelector and toleration from
  runner placeholder pods; runner-pool nodes carry no such label
  or taint, causing placeholders to stay Pending forever
- Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass
  covering both RunnerClass-set and RunnerClass-empty configs
- Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity
  to verify workflow placeholders still use runner-class correctly
- Update existing tests to assert runner-class is always absent
- Bump chart versions to jeanschmidt.5

The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool
that does not carry osdc.io/runner-class labels or taints. Only workflow
placeholders use runner-class, via preferred node affinity (weight 100).

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 03:30:46 -07:00
Jean Schmidt 2c07a07a97 Split runner/workflow placeholder fleets
- Add RunnerNodeFleet config to place runner placeholders on the
  cluster-wide runner pool instead of the per-scale-set workflow pool
- Change git-cache-not-ready toleration to operator:Exists to match
  the unconditional startupTaint on runner-pool nodes
- Make Config.Validate() return an error and require RunnerNodeFleet
  when capacity-aware mode is enabled
- Add split-fleet tests verifying runner and workflow placeholders
  never conflate each other's node-fleet values
- Bump chart versions to jeanschmidt.4

Runner and workflow pods target different node pools (e.g. c7i-runner
vs g4dn). Previously both used NodeFleet, which silently landed runner
placeholders on the wrong pool — defeating the topology separation the
placeholder system is meant to provide.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-29 02:36:13 -07:00
Jean Schmidt 805c698b8e bump version
Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 18:44:17 -07:00
Jean Schmidt 8e5df744fd Add defensive sleep timeout to placeholders
- Add sleepArg() that bounds placeholder pod lifetime to 1.5x
  PlaceholderTimeout, preventing pod leaks if listener crashes
- Fix GPU node selector label from nvidia.com/gpu.present to
  nvidia.com/gpu
- Bump chart versions to 0.14.1-jeanschmidt.2
- Add tests for timeout-based sleep, infinity fallback, and
  sub-second floor-to-1 behavior

Previously placeholder pods ran `sleep infinity`, meaning a listener
crash would leave them running forever. The new defensive sleep
self-terminates pods after 1.5x the configured timeout, acting as a
safety net behind the normal CleanupAll/CleanupTimedOut path.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 17:53:55 -07:00
Jean Schmidt 1d0ff0f1d8
Merge pull request #1 from jeanschmidt/jeanschmidt/placeholder_run_poc
Add capacity-aware placeholder pod pre-warming for GitHub Actions runners
2026-04-28 16:43:33 -07:00
Jean Schmidt adf2791de5 Point images and charts to personal fork
- Add OCI source label to Dockerfile for jeanschmidt fork
- Bump chart/app versions to 0.14.1-jeanschmidt.1
- Redirect controller image repo to ghcr.io/jeanschmidt

Enables building and deploying the placeholder pod POC
from the forked registry for testing.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 16:40:34 -07:00
Jean Schmidt 1944a96710 Split monitor into provisioner and reporter
- Decouple provisioning (create/delete pairs) from capacity
  reporting (count running pairs/runners, call setMaxRunners)
  into independent loops with separate tick intervals
- Add exponential-backoff retry for K8s API and HUD API calls
  with per-loop retry budgets (reporter: 2, provisioner: 3)
- Use atomic.Int64 for slotCounter to support concurrent access
- Add ordered shutdown: reporter stops before placeholder
  cleanup to prevent a flash of reportedCapacity=0

The single reconcile loop coupled provisioning and reporting at
the same interval (30s). Reporting needs to react faster when
placeholder pods become Running (node warm-up), so it now ticks
independently at 5s (configurable via CAPACITY_AWARE_REPORT_INTERVAL).
Retries prevent transient K8s API failures from causing missed
cycles or premature capacity drops.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-28 09:12:50 -07:00
Jean Schmidt 0fa4cb29ce Add anchor ConfigMap and configurable HUD URL
- Add anchor ConfigMap as OwnerReference for placeholder pods, enabling
  Kubernetes GC cascade-delete across namespaces
- Make HUD API URL configurable via CAPACITY_AWARE_HUD_API_URL env var
  with sensible default query parameters baked into the URL
- Simplify HUDClient by storing URL as instance field, removing the
  need for rewriteTransport test helper
- Extend CleanupOrphans to also delete orphaned anchor ConfigMaps

The anchor ConfigMap solves cross-namespace ownership: the listener
runs in arc-systems but placeholders live in arc-runners, so the
listener pod cannot be a direct OwnerReference. The anchor lives in
the placeholder namespace and owns all placeholder pods, so deleting
it triggers automatic GC of all associated pods.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-27 12:56:10 -07:00
Jean Schmidt 6c56d56043 Add capacity-aware placeholder pod pre-warming
- Add capacity package with monitor, placeholder manager, HUD client,
  and env-based config for proactive node pre-warming
- Create runner+workflow placeholder pod pairs that mirror the actual
  runner.yaml.tpl specs (nodeSelector, tolerations, affinity, resources)
- Integrate monitor into ghalistener main.go as an optional goroutine
  gated by CAPACITY_AWARE_ENABLED env var
- Query PyTorch HUD API for queued job counts to dynamically scale
  placeholder count beyond the static proactive capacity baseline
- Add comprehensive test coverage (~1440 lines) including pod spec
  fidelity, orphan cleanup, scale-down preference, and idempotency

The listener currently reports a static maxRunners to GitHub, which
means nodes are only provisioned after jobs are already queued. This
causes cold-start latency while Karpenter spins up nodes. Placeholder
pods reserve cluster capacity ahead of demand so that real runner pods
can schedule immediately onto pre-warmed nodes, then get evicted via
priority class preemption when actual work arrives.

Signed-off-by: Jean Schmidt <contato@jschmidt.me>
2026-04-23 17:01:12 -07:00
Junya Okabe a401686bd5
Add option to disable workqueue bucket rate limiter (#4451) 2026-04-22 23:26:39 +02:00
github-actions[bot] 012f1a5b23
Updates: runner to v2.334.0 (#4467)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-04-22 17:26:50 +02:00
Gleb Khaykin e0feb3b711
Fix orphan no-permission ServiceAccount in kubernetes-novolume mode (#4455) 2026-04-20 13:31:23 +02:00
Francesco Renzi 74cfc3855e
Prepare 0.14.1 release (#4448) 2026-04-14 17:03:22 +01:00
Francesco Renzi eb1544f848
Bump actions/scaleset to v0.3.0 (#4447) 2026-04-14 14:08:22 +01:00
Nikola Jokic 79e7b17b56
Fix null field for resource metadata fields in experimental chart (#4419) 2026-04-02 23:44:37 +02:00
github-actions[bot] 39934ce5eb
Updates: runner to v2.333.1 (#4427)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-03-31 19:35:28 -05:00
github-actions[bot] 5f4c132f12
Updates: runner to v2.333.0 (#4412)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-03-23 12:46:49 +01:00
Vinayak Gaikwad 0d1e2b3e74
remove redundant ticks around "name" and use plural (#3661) 2026-03-23 12:46:13 +01:00
Nikola Jokic 104bc6b0b0
Fix chart version for publishing (#4415) 2026-03-19 18:13:17 +00:00
Nikola Jokic 8b7f232dc4
Prepare 0.14.0 release (#4413) 2026-03-19 18:53:37 +01:00
Nikola Jokic 19f22b85e7
Add @steve-glass to CODEOWNERS (#4414) 2026-03-19 18:24:00 +01:00
Nikola Jokic 802dc28d38
Add multi-label support to scalesets (#4408) 2026-03-19 15:29:40 +01:00
Nikola Jokic 9bc1c9e53e
Shutdown the scaleset when runner is deprecated (#4404)
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-03-19 13:30:20 +01:00
Nikola Jokic 40595d806f
Add chart-level API to customize internal resources (#4410) 2026-03-18 14:44:30 +01:00
Nikola Jokic dc7c858e68
Remove actions client (#4405) 2026-03-16 14:39:55 +01:00
Nikola Jokic 2fc51aaf32
Regenerate manifests for experimental charts (#4407) 2026-03-16 10:42:07 +01:00
Nikola Jokic 276717a04b
Manually bump dependencies since it needs fixes related to the controller runtime API (#4406) 2026-03-16 10:09:36 +01:00
Nikola Jokic aa031d3902
Introduce experimental chart release (#4373) 2026-03-16 10:09:05 +01:00
Nikola Jokic f99c6eda0b
Moving to scaleset client for the controller (#4390) 2026-03-13 14:36:41 +01:00
Nikola Jokic 1d9f626c53
Allow users to apply labels and annotations to internal resources (#4400) 2026-03-12 10:32:54 +01:00
dependabot[bot] 1f3e5b9027
Bump the actions group across 1 directory with 6 updates (#4402)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-03-11 16:54:22 +01:00
Nikola Jokic cd5b93d1bc
Bump Go version (#4398) 2026-03-11 10:24:20 +01:00
github-actions[bot] 396ee88f5a
Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-03-03 01:02:40 +01:00
gateixeira 1f615c1a33
feat: add default linux nodeSelector to listener pod (#4377)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2026-02-24 17:56:39 +01:00
Nikola Jokic 8b7fd9ffef
Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
Nikola Jokic c6e4c94a6a
Fix tests and generate mocks (#4384) 2026-02-24 13:36:01 +01:00
dhawalseth 9de09f56eb
Include the HTTP status code in jit error (#4361)
Co-authored-by: Dhawal Seth <dseth@linkedin.com>
2026-01-29 16:40:17 +01:00
Caius Durling 02aa70a64a
Fix `AcivityId` typo in error strings (#4359) 2026-01-21 01:14:26 +01:00
Jiaren Wu d3ca9de3ca
Potential fix for code scanning alert no. 7: Use of a broken or weak cryptographic hashing algorithm on sensitive data (#4353)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-01-14 21:04:02 -08:00
github-actions[bot] a868229fe0
Updates: runner to v2.331.0 (#4351)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-14 13:32:39 -05:00
Nikola Jokic a505fb5616
Prepare 0.13.1 release (#4341) 2025-12-23 14:57:05 +01:00