actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Jean Schmidt	4714643523	Require runner-class in workflow affinity - Promote runner-class from preferred (weight 100) to required node affinity term, matching the actual workflow pod's scheduling - Use DoesNotExist operator when RunnerClass is unset - AND-combine runner-class and GPU label in same matchExpressions block - Add table-driven tests for all runner-class + GPU combinations - Bump chart versions to jeanschmidt.9 A preferred runner-class term let placeholders land on non-matching nodes where the real workflow pod (which uses a required term) could never follow — wasting the reservation. Making it required ensures placeholders only occupy nodes the actual pod can schedule onto. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-05-05 11:42:02 -07:00
Jean Schmidt	d219a11c89	Detect and replace broken placeholder pairs - Add CleanupBroken to PlaceholderManager to find slots where one of two pods was evicted/deleted - Integrate broken-pair cleanup into reconcileProvisioning between ListPairs and adjustPairs so replacement happens in the same cycle - Add "broken" delete reason with Prometheus metrics - Add unit tests for both successful and failed cleanup - Bump Helm chart versions to jeanschmidt.8 Without this fix, a broken pair (one pod missing) would count as healthy in currentPairs, causing the provisioner to believe capacity was at desired level. Pre-warmed capacity would be permanently reduced until the next full listener restart. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-05-01 13:26:59 -07:00
Jean Schmidt	a3c294ab14	Merge pull request #3 from jeanschmidt/jeanschmidt/proactive_capacity_max_runners Add MaxBurstCapacity cap and fix MaxRunners headroom calculation	2026-05-01 12:39:52 -07:00
Jean Schmidt	89a51137dc	Bump chart versions to jeanschmidt.7 - Update gha-runner-scale-set-controller chart and app version - Update gha-runner-scale-set chart and app version Reflects recent changes: batched runner pod listing into a single API call, and added MaxBurstCapacity/MaxRunners headroom support. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-30 20:34:37 -07:00
Jean Schmidt	24bf64ada6	Batch runner pod listing into single API call - Merge two per-phase countRunnersByPhaseWithRetry calls into one that returns a map[PodPhase]int, halving API calls to the apiserver - Drop FieldSelector filtering; group pods by phase in code only - Guard headroom calculation with max(0, ...) to prevent negative clamp - Always log maxBurstCapacity instead of conditionally appending it The two separate List calls (Running, Pending) were redundant — one unfiltered List grouped in-code gives the same result with half the API traffic. The fake clientset never honored FieldSelector anyway, so removing it also eliminates the test/prod behavioral divergence noted in the old comment. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-30 18:20:01 -07:00
Jean Schmidt	f6c56d3af0	Add MaxBurstCapacity and MaxRunners headroom - Add MaxBurstCapacity config to cap placeholder pairs per cycle, preventing burst node provisioning from overloading downstream services (git-cache, Harbor, pypi-cache) - Fix MaxRunners headroom: subtract real runner pods (Running + Pending) so placeholders only fill remaining capacity - Generalize countRunningRunners to countRunnersByPhase for reuse across both provisioner and reporter loops - Add Prometheus gauge for max_burst_capacity config value - Add tests for burst cap, headroom clamp, pending pod accounting, runner-count errors, and edge cases The previous MaxRunners clamp (`min(desired, MaxRunners)`) allowed up to MaxRunners placeholders ON TOP of real runners, effectively doubling the cap. The headroom fix subtracts actual Running+Pending runner pods from MaxRunners before clamping. On runner-count failure the cycle is skipped entirely — treating a failed count as 0 would silently re-open the doubling bug during the failure window. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-30 12:17:55 -07:00
Jean Schmidt	69136c719d	Merge pull request #2 from jeanschmidt/jeanschmidt/proactive_capacity_metrics Add Prometheus metrics to proactive capacity monitor	2026-04-29 14:06:57 -07:00
Jean Schmidt	4203ba9489	Bump chart versions to jeanschmidt.6 - Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6 - Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6 Follows addition of Prometheus metrics to the capacity monitor in the previous commit. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 13:44:22 -07:00
Jean Schmidt	24a4afc263	Add Prometheus metrics to capacity monitor - Define CapacityRecorder interface with 15 metrics (gauges, counters, histograms) covering reconcile loops, HUD API, pair lifecycle, and placeholder pod phases - Instrument monitor reconcile paths with duration histograms, skip/error counters, and last-success timestamps for wedge detection - Refactor CleanupTimedOut and CleanupOrphans to return per-pair success/failure counts for accurate metric emission - Add parity test enforcing Go metric registry stays in sync with the OSDC chart's listenerMetrics allowlist via YAML fixture - Wire CapacityRecorder into monitor via WithRecorder option pattern, defaulting to no-op discard for backward compatibility Reconcile-last-success gauges are seeded at startup to avoid spurious wedge alerts during the window between listener restart and first completed reconcile. Placeholder pod phase gauges emit all (role x phase) combinations including zeros so Prometheus gauges decrement correctly when phases empty out. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 13:35:05 -07:00
Jean Schmidt	e3dfeae67f	Drop runner-class from runner placeholders - Remove osdc.io/runner-class nodeSelector and toleration from runner placeholder pods; runner-pool nodes carry no such label or taint, causing placeholders to stay Pending forever - Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass covering both RunnerClass-set and RunnerClass-empty configs - Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity to verify workflow placeholders still use runner-class correctly - Update existing tests to assert runner-class is always absent - Bump chart versions to jeanschmidt.5 The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool that does not carry osdc.io/runner-class labels or taints. Only workflow placeholders use runner-class, via preferred node affinity (weight 100). Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 03:30:46 -07:00
Jean Schmidt	2c07a07a97	Split runner/workflow placeholder fleets - Add RunnerNodeFleet config to place runner placeholders on the cluster-wide runner pool instead of the per-scale-set workflow pool - Change git-cache-not-ready toleration to operator:Exists to match the unconditional startupTaint on runner-pool nodes - Make Config.Validate() return an error and require RunnerNodeFleet when capacity-aware mode is enabled - Add split-fleet tests verifying runner and workflow placeholders never conflate each other's node-fleet values - Bump chart versions to jeanschmidt.4 Runner and workflow pods target different node pools (e.g. c7i-runner vs g4dn). Previously both used NodeFleet, which silently landed runner placeholders on the wrong pool — defeating the topology separation the placeholder system is meant to provide. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 02:36:13 -07:00
Jean Schmidt	805c698b8e	bump version Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 18:44:17 -07:00
Jean Schmidt	8e5df744fd	Add defensive sleep timeout to placeholders - Add sleepArg() that bounds placeholder pod lifetime to 1.5x PlaceholderTimeout, preventing pod leaks if listener crashes - Fix GPU node selector label from nvidia.com/gpu.present to nvidia.com/gpu - Bump chart versions to 0.14.1-jeanschmidt.2 - Add tests for timeout-based sleep, infinity fallback, and sub-second floor-to-1 behavior Previously placeholder pods ran `sleep infinity`, meaning a listener crash would leave them running forever. The new defensive sleep self-terminates pods after 1.5x the configured timeout, acting as a safety net behind the normal CleanupAll/CleanupTimedOut path. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 17:53:55 -07:00
Jean Schmidt	1d0ff0f1d8	Merge pull request #1 from jeanschmidt/jeanschmidt/placeholder_run_poc Add capacity-aware placeholder pod pre-warming for GitHub Actions runners	2026-04-28 16:43:33 -07:00
Jean Schmidt	adf2791de5	Point images and charts to personal fork - Add OCI source label to Dockerfile for jeanschmidt fork - Bump chart/app versions to 0.14.1-jeanschmidt.1 - Redirect controller image repo to ghcr.io/jeanschmidt Enables building and deploying the placeholder pod POC from the forked registry for testing. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 16:40:34 -07:00
Jean Schmidt	1944a96710	Split monitor into provisioner and reporter - Decouple provisioning (create/delete pairs) from capacity reporting (count running pairs/runners, call setMaxRunners) into independent loops with separate tick intervals - Add exponential-backoff retry for K8s API and HUD API calls with per-loop retry budgets (reporter: 2, provisioner: 3) - Use atomic.Int64 for slotCounter to support concurrent access - Add ordered shutdown: reporter stops before placeholder cleanup to prevent a flash of reportedCapacity=0 The single reconcile loop coupled provisioning and reporting at the same interval (30s). Reporting needs to react faster when placeholder pods become Running (node warm-up), so it now ticks independently at 5s (configurable via CAPACITY_AWARE_REPORT_INTERVAL). Retries prevent transient K8s API failures from causing missed cycles or premature capacity drops. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 09:12:50 -07:00
Jean Schmidt	0fa4cb29ce	Add anchor ConfigMap and configurable HUD URL - Add anchor ConfigMap as OwnerReference for placeholder pods, enabling Kubernetes GC cascade-delete across namespaces - Make HUD API URL configurable via CAPACITY_AWARE_HUD_API_URL env var with sensible default query parameters baked into the URL - Simplify HUDClient by storing URL as instance field, removing the need for rewriteTransport test helper - Extend CleanupOrphans to also delete orphaned anchor ConfigMaps The anchor ConfigMap solves cross-namespace ownership: the listener runs in arc-systems but placeholders live in arc-runners, so the listener pod cannot be a direct OwnerReference. The anchor lives in the placeholder namespace and owns all placeholder pods, so deleting it triggers automatic GC of all associated pods. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-27 12:56:10 -07:00
Jean Schmidt	6c56d56043	Add capacity-aware placeholder pod pre-warming - Add capacity package with monitor, placeholder manager, HUD client, and env-based config for proactive node pre-warming - Create runner+workflow placeholder pod pairs that mirror the actual runner.yaml.tpl specs (nodeSelector, tolerations, affinity, resources) - Integrate monitor into ghalistener main.go as an optional goroutine gated by CAPACITY_AWARE_ENABLED env var - Query PyTorch HUD API for queued job counts to dynamically scale placeholder count beyond the static proactive capacity baseline - Add comprehensive test coverage (~1440 lines) including pod spec fidelity, orphan cleanup, scale-down preference, and idempotency The listener currently reports a static maxRunners to GitHub, which means nodes are only provisioned after jobs are already queued. This causes cold-start latency while Karpenter spins up nodes. Placeholder pods reserve cluster capacity ahead of demand so that real runner pods can schedule immediately onto pre-warmed nodes, then get evicted via priority class preemption when actual work arrives. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-23 17:01:12 -07:00
Junya Okabe	a401686bd5	Add option to disable workqueue bucket rate limiter (#4451 )	2026-04-22 23:26:39 +02:00
github-actions[bot]	012f1a5b23	Updates: runner to v2.334.0 (#4467 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-04-22 17:26:50 +02:00
Gleb Khaykin	e0feb3b711	Fix orphan no-permission ServiceAccount in kubernetes-novolume mode (#4455 )	2026-04-20 13:31:23 +02:00
Francesco Renzi	74cfc3855e	Prepare 0.14.1 release (#4448 )	2026-04-14 17:03:22 +01:00
Francesco Renzi	eb1544f848	Bump actions/scaleset to v0.3.0 (#4447 )	2026-04-14 14:08:22 +01:00
Nikola Jokic	79e7b17b56	Fix null field for resource metadata fields in experimental chart (#4419 )	2026-04-02 23:44:37 +02:00
github-actions[bot]	39934ce5eb	Updates: runner to v2.333.1 (#4427 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-03-31 19:35:28 -05:00
github-actions[bot]	5f4c132f12	Updates: runner to v2.333.0 (#4412 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-03-23 12:46:49 +01:00
Vinayak Gaikwad	0d1e2b3e74	remove redundant ticks around "name" and use plural (#3661 )	2026-03-23 12:46:13 +01:00
Nikola Jokic	104bc6b0b0	Fix chart version for publishing (#4415 )	2026-03-19 18:13:17 +00:00
Nikola Jokic	8b7f232dc4	Prepare 0.14.0 release (#4413 )	2026-03-19 18:53:37 +01:00
Nikola Jokic	19f22b85e7	Add @steve-glass to CODEOWNERS (#4414 )	2026-03-19 18:24:00 +01:00
Nikola Jokic	802dc28d38	Add multi-label support to scalesets (#4408 )	2026-03-19 15:29:40 +01:00
Nikola Jokic	9bc1c9e53e	Shutdown the scaleset when runner is deprecated (#4404 ) Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-19 13:30:20 +01:00
Nikola Jokic	40595d806f	Add chart-level API to customize internal resources (#4410 )	2026-03-18 14:44:30 +01:00
Nikola Jokic	dc7c858e68	Remove actions client (#4405 )	2026-03-16 14:39:55 +01:00
Nikola Jokic	2fc51aaf32	Regenerate manifests for experimental charts (#4407 )	2026-03-16 10:42:07 +01:00
Nikola Jokic	276717a04b	Manually bump dependencies since it needs fixes related to the controller runtime API (#4406 )	2026-03-16 10:09:36 +01:00
Nikola Jokic	aa031d3902	Introduce experimental chart release (#4373 )	2026-03-16 10:09:05 +01:00
Nikola Jokic	f99c6eda0b	Moving to scaleset client for the controller (#4390 )	2026-03-13 14:36:41 +01:00
Nikola Jokic	1d9f626c53	Allow users to apply labels and annotations to internal resources (#4400 )	2026-03-12 10:32:54 +01:00
dependabot[bot]	1f3e5b9027	Bump the actions group across 1 directory with 6 updates (#4402 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-03-11 16:54:22 +01:00
Nikola Jokic	cd5b93d1bc	Bump Go version (#4398 )	2026-03-11 10:24:20 +01:00
github-actions[bot]	396ee88f5a	Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-03-03 01:02:40 +01:00
gateixeira	1f615c1a33	feat: add default linux nodeSelector to listener pod (#4377 ) Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>	2026-02-24 17:56:39 +01:00
Nikola Jokic	8b7fd9ffef	Switch client to scaleset library for the listener and update mocks (#4383 )	2026-02-24 14:17:31 +01:00
Nikola Jokic	c6e4c94a6a	Fix tests and generate mocks (#4384 )	2026-02-24 13:36:01 +01:00
dhawalseth	9de09f56eb	Include the HTTP status code in jit error (#4361 ) Co-authored-by: Dhawal Seth <dseth@linkedin.com>	2026-01-29 16:40:17 +01:00
Caius Durling	02aa70a64a	Fix `AcivityId` typo in error strings (#4359 )	2026-01-21 01:14:26 +01:00
Jiaren Wu	d3ca9de3ca	Potential fix for code scanning alert no. 7: Use of a broken or weak cryptographic hashing algorithm on sensitive data (#4353 ) Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	2026-01-14 21:04:02 -08:00
github-actions[bot]	a868229fe0	Updates: runner to v2.331.0 (#4351 ) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-01-14 13:32:39 -05:00
Nikola Jokic	a505fb5616	Prepare 0.13.1 release (#4341 )	2025-12-23 14:57:05 +01:00

1 2 3 4 5 ...

1831 Commits All Branches Search

1831 Commits

All Branches