actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Jean Schmidt	d5d94fba48	HUD failure fallback: over-provision placeholders - Add configurable HUDFailureMultiplier (default 3x) to scale placeholder count when HUD API is unreachable - New env var CAPACITY_AWARE_HUD_FAILURE_MULTIPLIER with clamp ≥1 in both ConfigFromEnv and Validate - Fallback formula: ProactiveCapacity * multiplier (replaces the previous zero-queued-jobs fallback that reduced capacity) - Add tests for multiplier clamping, MaxRunners cap interaction, and HUD-disabled path - Bump chart versions to jeanschmidt.10 When HUD is down we lose visibility into queue depth, so the old fallback of assuming 0 queued jobs was backwards — it shrank capacity exactly when we had the least information. The multiplier-based fallback leans toward over-provisioning instead; existing safety bounds (MaxRunners headroom, MaxBurstCapacity) still cap the blast radius. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-05-15 14:23:36 -07:00
Jean Schmidt	4714643523	Require runner-class in workflow affinity - Promote runner-class from preferred (weight 100) to required node affinity term, matching the actual workflow pod's scheduling - Use DoesNotExist operator when RunnerClass is unset - AND-combine runner-class and GPU label in same matchExpressions block - Add table-driven tests for all runner-class + GPU combinations - Bump chart versions to jeanschmidt.9 A preferred runner-class term let placeholders land on non-matching nodes where the real workflow pod (which uses a required term) could never follow — wasting the reservation. Making it required ensures placeholders only occupy nodes the actual pod can schedule onto. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-05-05 11:42:02 -07:00
Jean Schmidt	d219a11c89	Detect and replace broken placeholder pairs - Add CleanupBroken to PlaceholderManager to find slots where one of two pods was evicted/deleted - Integrate broken-pair cleanup into reconcileProvisioning between ListPairs and adjustPairs so replacement happens in the same cycle - Add "broken" delete reason with Prometheus metrics - Add unit tests for both successful and failed cleanup - Bump Helm chart versions to jeanschmidt.8 Without this fix, a broken pair (one pod missing) would count as healthy in currentPairs, causing the provisioner to believe capacity was at desired level. Pre-warmed capacity would be permanently reduced until the next full listener restart. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-05-01 13:26:59 -07:00
Jean Schmidt	89a51137dc	Bump chart versions to jeanschmidt.7 - Update gha-runner-scale-set-controller chart and app version - Update gha-runner-scale-set chart and app version Reflects recent changes: batched runner pod listing into a single API call, and added MaxBurstCapacity/MaxRunners headroom support. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-30 20:34:37 -07:00
Jean Schmidt	4203ba9489	Bump chart versions to jeanschmidt.6 - Bump gha-runner-scale-set-controller chart and appVersion to 0.14.1-jeanschmidt.6 - Bump gha-runner-scale-set chart and appVersion to 0.14.1-jeanschmidt.6 Follows addition of Prometheus metrics to the capacity monitor in the previous commit. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 13:44:22 -07:00
Jean Schmidt	e3dfeae67f	Drop runner-class from runner placeholders - Remove osdc.io/runner-class nodeSelector and toleration from runner placeholder pods; runner-pool nodes carry no such label or taint, causing placeholders to stay Pending forever - Add regression test TestRunnerPlaceholder_NeverIncludesRunnerClass covering both RunnerClass-set and RunnerClass-empty configs - Add TestWorkflowPlaceholder_StillUsesRunnerClassInPreferredAffinity to verify workflow placeholders still use runner-class correctly - Update existing tests to assert runner-class is always absent - Bump chart versions to jeanschmidt.5 The runner-pool fleet (e.g. c7i-runner) is a shared cluster-wide pool that does not carry osdc.io/runner-class labels or taints. Only workflow placeholders use runner-class, via preferred node affinity (weight 100). Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 03:30:46 -07:00
Jean Schmidt	2c07a07a97	Split runner/workflow placeholder fleets - Add RunnerNodeFleet config to place runner placeholders on the cluster-wide runner pool instead of the per-scale-set workflow pool - Change git-cache-not-ready toleration to operator:Exists to match the unconditional startupTaint on runner-pool nodes - Make Config.Validate() return an error and require RunnerNodeFleet when capacity-aware mode is enabled - Add split-fleet tests verifying runner and workflow placeholders never conflate each other's node-fleet values - Bump chart versions to jeanschmidt.4 Runner and workflow pods target different node pools (e.g. c7i-runner vs g4dn). Previously both used NodeFleet, which silently landed runner placeholders on the wrong pool — defeating the topology separation the placeholder system is meant to provide. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-29 02:36:13 -07:00
Jean Schmidt	805c698b8e	bump version Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 18:44:17 -07:00
Jean Schmidt	8e5df744fd	Add defensive sleep timeout to placeholders - Add sleepArg() that bounds placeholder pod lifetime to 1.5x PlaceholderTimeout, preventing pod leaks if listener crashes - Fix GPU node selector label from nvidia.com/gpu.present to nvidia.com/gpu - Bump chart versions to 0.14.1-jeanschmidt.2 - Add tests for timeout-based sleep, infinity fallback, and sub-second floor-to-1 behavior Previously placeholder pods ran `sleep infinity`, meaning a listener crash would leave them running forever. The new defensive sleep self-terminates pods after 1.5x the configured timeout, acting as a safety net behind the normal CleanupAll/CleanupTimedOut path. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 17:53:55 -07:00
Jean Schmidt	adf2791de5	Point images and charts to personal fork - Add OCI source label to Dockerfile for jeanschmidt fork - Bump chart/app versions to 0.14.1-jeanschmidt.1 - Redirect controller image repo to ghcr.io/jeanschmidt Enables building and deploying the placeholder pod POC from the forked registry for testing. Signed-off-by: Jean Schmidt <contato@jschmidt.me>	2026-04-28 16:40:34 -07:00
Junya Okabe	a401686bd5	Add option to disable workqueue bucket rate limiter (#4451 )	2026-04-22 23:26:39 +02:00
Francesco Renzi	74cfc3855e	Prepare 0.14.1 release (#4448 )	2026-04-14 17:03:22 +01:00
Nikola Jokic	8b7f232dc4	Prepare 0.14.0 release (#4413 )	2026-03-19 18:53:37 +01:00
Nikola Jokic	802dc28d38	Add multi-label support to scalesets (#4408 )	2026-03-19 15:29:40 +01:00
Nikola Jokic	9bc1c9e53e	Shutdown the scaleset when runner is deprecated (#4404 ) Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>	2026-03-19 13:30:20 +01:00
Nikola Jokic	276717a04b	Manually bump dependencies since it needs fixes related to the controller runtime API (#4406 )	2026-03-16 10:09:36 +01:00
Nikola Jokic	1d9f626c53	Allow users to apply labels and annotations to internal resources (#4400 )	2026-03-12 10:32:54 +01:00
Nikola Jokic	a505fb5616	Prepare 0.13.1 release (#4341 )	2025-12-23 14:57:05 +01:00
Nikola Jokic	a0c30df25b	Prepare 0.13.0 release (#4280 )	2025-10-16 19:25:56 +02:00
Nikola Jokic	634e42c916	Bump all dependencies (#4266 )	2025-10-14 13:24:25 +02:00
Nikola Jokic	088e2a3a90	Remove ephemeral runner when exit code != 0 and is patched with the job (#4239 )	2025-09-17 21:40:37 +02:00
Nikola Jokic	c27541140a	Remove JIT config from ephemeral runner status field (#4191 )	2025-08-04 12:35:04 +02:00
Alex Hatzenbuhler	a07dce28bb	Remove deprecated preserveUnknownFields from CRDs (#4135 )	2025-07-24 08:47:34 +02:00
Nikola Jokic	349cc0835e	Fix image pull secrets list arguments in the chart (#4164 )	2025-07-01 15:28:18 +02:00
Nikola Jokic	ded39bede6	Prepare 0.12.1 release (#4153 )	2025-06-27 13:49:47 +02:00
Nikola Jokic	d9826e5244	Prepare 0.12.0 release (#4122 )	2025-06-13 14:23:26 +02:00
Nikola Jokic	e46c929241	Azure Key Vault integration to resolve secrets (#4090 )	2025-06-11 15:53:33 +02:00
Nikola Jokic	cae7efa2c6	Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners (#4059 )	2025-05-14 15:38:50 +02:00
Nikola Jokic	4ca37fbdf2	Prepare 0.11.0 release (#3992 )	2025-03-25 11:09:03 +01:00
Nikola Jokic	5a960b5ebb	Create configurable metrics (#3975 )	2025-03-24 15:27:42 +01:00
Nikola Jokic	7033e299cd	Add events role permission to leader_election_role (#3988 )	2025-03-24 15:10:47 +01:00
J. Fernández	3c1a323381	feat: allow namespace overrides (#3797 ) Signed-off-by: Jesús Fernández <7312236+fernandezcuesta@users.noreply.github.com> Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>	2025-03-18 21:41:04 +01:00
Nikola Jokic	fb9b96bf75	Update all dependencies, conforming to the new controller-runtime API (#3949 )	2025-03-11 15:52:52 +01:00
Mikey Smet	75c6a94010	Use gha-runner-scale-set-controller.chart instead of .Chart.Version (#3729 ) Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>	2025-03-10 11:48:30 +01:00
Nikola Jokic	7a5996f467	Remove old githubrunnerscalesetlistener, remove warning and fix config bug (#3937 )	2025-03-07 11:58:16 +01:00
Nikola Jokic	66172ab0bd	Fix template tests and add go test on gha-validate-chart (#3886 )	2025-01-15 15:54:33 +01:00
Bassem Dghaidi	1e10417be8	Prepare `0.10.1` release (#3859 )	2024-12-18 16:22:50 +01:00
Bassem Dghaidi	1ef7196115	Fix helm chart bug related to `runnerMaxConcurrentReconciles` (#3858 )	2024-12-18 16:14:55 +01:00
Bassem Dghaidi	59cb1d2c8b	Prepare `0.10.0` release (#3849 )	2024-12-16 11:39:55 +01:00
Bassem Dghaidi	7e04027d19	Make k8s client rate limiter parameters configurable (#3848 ) Co-authored-by: Taketoshi Fujiwara <t-b-fujiwara@mercari.com>	2024-12-13 15:37:01 +01:00
Yusuke Kuoka	3998f6dee6	Make EphemeralRunnerController MaxConcurrentReconciles configurable (#3832 ) Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>	2024-12-11 21:19:43 +01:00
Nikola Jokic	80d848339e	Prepare 0.9.3 release (#3624 )	2024-06-25 12:35:39 +02:00
Nikola Jokic	a62ca3d853	Exclude label prefix propagation (#3607 )	2024-06-21 12:12:14 +02:00
Nikola Jokic	3be7128f9a	Prepare 0.9.2 release (#3530 )	2024-05-20 10:58:06 +02:00
Nikola Jokic	ea13873f14	Remove service monitor that is not used in controller chart (#3526 )	2024-05-17 13:06:57 +02:00
Nikola Jokic	9e191cdd21	Prepare 0.9.1 release (#3448 )	2024-04-17 10:51:28 +02:00
Alexandre Chouinard	0006dd5eb1	Add topologySpreadConstraint to gha-runner-scale-set-controller chart (#3405 )	2024-04-12 14:22:41 +02:00
Nikola Jokic	4357525445	Prepare 0.9.0 release (#3388 )	2024-03-27 11:54:17 +01:00
Nikola Jokic	7a643a5107	Fix overscaling when the controller is much faster then the listener (#3371 ) Co-authored-by: Francesco Renzi <rentziass@gmail.com>	2024-03-20 15:36:12 +01:00
Nikola Jokic	a7af44e042	Deprecation warning of older listener for 0.9.0 release (#3280 )	2024-03-18 12:59:41 +01:00

1 2

93 Commits