* fix: resolve issues #2295, #2296, #2297 and OCI registry login
This PR fixes four related bugs affecting chart preparation, caching,
and OCI registry authentication.
Issue #2295: OCI chart cache conflicts with parallel helmfile processes
- Added filesystem-level locking using flock for cross-process sync
- Implements double-check locking pattern for efficiency
- Retry logic with 5-minute timeout and 3 retries
- Refactored into reusable acquireChartLock() helper function
- Added refresh marker coordination for cross-process cache management
Issue #2296: helmDefaults.skipDeps and helmDefaults.skipRefresh ignored
- Check both CLI options AND helmDefaults when deciding to skip repo sync
Issue #2297: Local chart + transformers causes panic
- Normalize local chart paths to absolute before calling chartify
OCI Registry Login URL Fix:
- Added extractRegistryHost() to extract just the registry host from URLs
- Fixed SyncRepos to use extracted host for OCI registry login
- e.g., "account.dkr.ecr.region.amazonaws.com/charts" ->
"account.dkr.ecr.region.amazonaws.com"
Test Plan:
- Unit tests for issues #2295, #2296, #2297
- Unit tests for OCI registry login (extractRegistryHost, SyncRepos_OCI)
- Integration tests for issues #2295 and #2297
- All existing unit tests pass (including TestLint)
Fixes#2295Fixes#2296Fixes#2297
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: replace 60s timeout with reader-writer locks for OCI chart caching
Address PR review feedback from @champtar about the OCI chart caching
mechanism. The previous implementation used a 60-second timeout which
was arbitrary and caused race conditions when helm deployments took
longer (e.g., deployments triggering scaling up/down).
Changes:
- Replace 60s refresh marker timeout with proper reader-writer locks
- Use shared locks (RLock) when using cached charts (allows concurrent reads)
- Use exclusive locks (Lock) when refreshing/downloading charts
- Hold locks during entire helm operation lifecycle (not just during download)
- Add getNamedRWMutex() for in-process RW coordination
- Update PrepareCharts() to return locks map for lifecycle management
- Add chartLockReleaser in run.go to release locks after helm callback
- Remove unused mutexMap and getNamedMutex (replaced by RW versions)
- Add comprehensive tests for shared/exclusive lock behavior
This eliminates the race condition where one process could delete a
cached chart while another process's helm command was still using it.
Fixes review comment on PR #2298
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: prevent deadlock when multiple releases share the same chart
When multiple releases use the same OCI chart (e.g., same chart different
values), workers in PrepareCharts would deadlock:
1. Worker 1 acquires lock for chart/path, downloads, adds to cache
2. Worker 2 finds chart in cache, tries to acquire lock on same path
3. Worker 2 blocks waiting for Worker 1's lock
4. Collector waits for Worker 2's result
5. Worker 1's lock held until PrepareCharts finishes -> deadlock
The fix: when using the in-memory chart cache (which means another worker
in the same process already downloaded the chart), don't acquire another
lock. This is safe because:
- The in-memory cache is only used within a single helmfile process
- The tempDir cleanup is deferred until after helm callback completes
- Cross-process coordination is still handled by file locks during downloads
This fixes the "signal: killed" test failures in CI for:
- oci_chart_pull_direct
- oci_chart_pull_once
- oci_chart_pull_once2
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: resolve deadlock by releasing OCI chart locks immediately after download
This commit simplifies the OCI chart locking mechanism to fix deadlock
issues that occurred when multiple releases shared the same chart.
Problem:
When multiple releases used the same OCI chart, workers in PrepareCharts
would deadlock because:
1. Worker 1 acquires lock for chart/path, downloads chart
2. Worker 2 tries to acquire lock on same path, blocks waiting
3. PrepareCharts waits for all workers to complete
4. Worker 1's lock held until PrepareCharts finishes -> deadlock
Solution:
Release locks immediately after chart download completes. This is safe
because:
- The tempDir cleanup is deferred until after helm operations complete
in withPreparedCharts(), so charts won't be deleted mid-use
- The in-memory chart cache prevents redundant downloads within a process
- Cross-process coordination via file locks still works during download
Changes:
- Remove chartLock field from chartPrepareResult struct
- Release locks immediately in getOCIChart() and forcedDownloadChart()
- Simplify PrepareCharts() by removing lock collection and release logic
- Update function signatures to return only (path, error)
This also fixes the "signal: killed" test failures in CI for:
- oci_chart_pull_direct
- oci_chart_pull_once
- oci_chart_pull_once2
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: add double-check locking for in-memory chart cache
When multiple workers concurrently process releases using the same chart,
they all check the in-memory cache before acquiring locks. If none have
populated the cache yet, all workers miss and try to download.
Previously, even after acquiring the exclusive lock, the code would
re-download the chart when needsRefresh=true (the default). This caused
multiple "Pulling" messages in tests like oci_chart_pull_once.
The fix adds a second in-memory cache check AFTER acquiring the lock.
This implements proper double-check locking:
1. Check cache (outside lock) → miss
2. Acquire lock
3. Check cache again (inside lock) → hit if another worker populated it
4. If still miss, download and add to cache
This ensures only one worker downloads the chart, while others use
the cached version populated by the first worker.
Changes:
- Add in-memory cache double-check in getOCIChart() after acquiring lock
- Add in-memory cache double-check in forcedDownloadChart() after acquiring lock
This fixes the oci_chart_pull_once and oci_chart_pull_direct test failures
where charts were being pulled multiple times instead of once.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: use callback to prevent redundant chart downloads within a process
When multiple workers concurrently process releases using the same chart,
they need to coordinate to avoid redundant downloads. The previous fix
set SkipRefresh=true for OCI charts, which prevented legitimate refresh
scenarios (e.g., floating tags).
This commit implements a better solution using a callback mechanism:
1. acquireChartLock() now accepts an optional skipRefreshCheck callback
2. Before deleting a cached chart for refresh, the callback is invoked
3. If the callback returns true (in-memory cache has the chart), skip refresh
4. This allows deduplication within a process while respecting cross-run refresh
The flow is now:
- Worker 1 downloads chart, adds to in-memory cache, releases lock
- Worker 2 acquires lock, sees needsRefresh=true, but callback sees
in-memory cache is populated → uses cached instead of deleting
This correctly handles:
- Within-process deduplication: only one download per chart
- Cross-run refresh: respects --skip-refresh flag for floating tags
- Immutable versions: cached and reused as expected
Changes:
- Add skipRefreshCheck callback parameter to acquireChartLock()
- Update getOCIChart() to pass in-memory cache check callback
- Update forcedDownloadChart() to pass in-memory cache check callback
- Remove SkipRefresh=true workaround for OCI charts
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: address Copilot review comments on PR #2298
This commit addresses the automated review comments from GitHub Copilot:
1. pkg/state/state.go: Add nil check for logger in Release() method
to prevent potential nil pointer dereference when logger is nil.
2. pkg/state/state.go: Fix misleading comment about "external callers"
to accurately reflect that Logger() is used by the app package.
3. pkg/state/issue_2296_test.go: Add comment noting that boolPtr helper
is already defined in skip_test.go (shared across test files).
4. test/integration/test-cases/oci-parallel-pull.sh: Replace hardcoded
/tmp paths with a dedicated temp directory for test outputs. Add
cleanup for the output directory in the cleanup function.
5. test/integration/test-cases/issue-2297-local-chart-transformers.sh:
Add cleanup trap to remove temp directory on exit, preventing
leftover files from accumulating.
6. Remove dead code: The chartLocks map in PrepareCharts was always
empty since locks are released immediately after download. Removed
the unused return value and corresponding handling in run.go to
improve code clarity and maintainability.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: make oci-parallel-pull test resilient to registry issues
The integration test was intermittently failing in CI due to Docker Hub
rate limiting or network issues. These failures are not helmfile bugs.
Changes:
- Add is_registry_error() function to detect external registry issues
(rate limits, network timeouts, connection refused, etc.)
- Check for the race condition bug (issue #2295) first and fail fast
- If other failures occur, check if they're registry-related
- Skip test gracefully when registry issues are detected instead of
failing CI on external infrastructure problems
This ensures the test still catches the actual race condition bug while
not causing false failures due to Docker Hub rate limits in CI.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: make oci-parallel-pull test resilient to registry issues
The integration test was failing in CI for two reasons:
1. Docker Hub rate limiting or network issues causing helmfile to fail
2. The test script exits early due to `set -e` when `wait` returns non-zero
Changes:
- Use `wait $pid || exit=$?` pattern to capture exit codes without triggering
set -e. When wait returns non-zero, the || branch captures the exit code
into the variable, preventing script termination.
- Add is_registry_error() function to detect external registry issues
(rate limits, network timeouts, connection refused, etc.)
- Check for the race condition bug (issue #2295) first and fail fast
- Skip test gracefully when registry issues are detected instead of
failing CI on external infrastructure problems
This ensures the test still catches the actual race condition bug while
not causing false failures due to Docker Hub rate limits in CI.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: address PR #2298 review - reinitialize fileLock after release
Address Copilot review comments:
1. pkg/state/state.go: Reinitialize fileLock after releasing shared lock
When upgrading from shared to exclusive lock, the fileLock needs to be
reinitialized with flock.New() after calling Release(). This ensures
a fresh flock object is used for the exclusive lock acquisition.
2. test/integration/test-cases/oci-parallel-pull.sh: Add lock file
verification warning if no lock files are found, to ensure the
locking mechanism is actually being tested.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: address PR #2298 Copilot review comments (round 4)
Address 8 Copilot review comments:
1. pkg/state/state.go: Release in-process mutex during retry backoff
to avoid blocking other goroutines for up to 90 seconds.
2. pkg/state/state.go: Include chartPath in shared lock error message
for better debugging.
3. pkg/state/state.go: Document that extractRegistryHost does not handle
URLs with query parameters or fragments (uncommon for OCI registries).
4. pkg/state/state.go: Document that skipRefreshCheck callback should be
fast and non-blocking since it runs while holding exclusive lock.
5. oci-parallel-pull.sh: Use case-insensitive grep (-i flag) to catch
error variations like "I/O timeout".
6. helmfile.yaml: Expand comment explaining why library charts can't be
used for this test (they can't be templated by Helm).
Skipped (with justification):
- PrepareChartKey helper: Only 2 usages with different source structs
- Context reuse in retry: Per-attempt contexts provide clearer semantics
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: address PR #2298 Copilot review comments (round 5)
1. Make race condition detection grep more robust (oci-parallel-pull.sh)
- Use case-insensitive extended regex (-iqE)
- Add multiple pattern variations to catch different tar/helm versions
2. Remove unused Logger() method from HelmState (state.go)
- Method was never called; all lock releases use st.logger directly
3. Add clarifying comments for lock retry behavior (state.go)
- Document why file system errors are retried but timeouts are not
- Explain flock returns (false, nil) on context deadline exceeded
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: clarify lock file check is informational only
Lock files are ephemeral and may be cleaned up immediately after
helmfile processes complete. Update comments and warning message
to make clear their absence doesn't indicate locking wasn't used.
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* fix: add HELM_BIN env var to Dockerfiles
The helm-git plugin requires HELM_BIN environment variable to be set.
Without it, the plugin fails with "HELM_BIN: parameter not set".
Add HELM_BIN=/usr/local/bin/helm to all Dockerfile variants.
Fixes#2303
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
---------
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
This commit adds comprehensive support for Helm 4 while maintaining
full backward compatibility with Helm 3. The implementation includes:
- Updated helm version detection to support both Helm 3 and Helm 4
- Added HELMFILE_HELM4 environment variable to control Helm version
- Modified helm execution paths to handle version-specific binaries
- Updated helm plugin installation to support split architecture
- Helm 4: Uses split plugin architecture (3 separate .tgz files)
- helm-secrets.tgz
- helm-secrets-getter.tgz
- helm-secrets-post-renderer.tgz
- Helm 3: Continues using single plugin installation
- Updated Dockerfiles, CI workflows, and core installation code
- Helm 4 requires post-renderers to be plugins, not executable scripts
- Created Helm plugin structure for integration tests
- Updated helmfile.yaml templates to dynamically select renderer type
- Added test plugins: add-cm, add-cm1, add-cm2
- Updated integration tests for Helm 3/4 compatibility
- Created Helm 4 variant expected output files
- Fixed test determinism issues (repo cleanup between iterations)
- Added version-specific output filtering for warnings/messages
- Updated workflows to test both Helm 3 and Helm 4
- Matrix testing across Helm versions
- Updated helm-diff to v3.14.0 for compatibility
- Updated README and docs with Helm 4 information
- Added migration guidance
- Updated version requirements
All changes are backward compatible - existing Helm 3 users will
see no behavior changes.
fix: update Helm 4 lint expected output to match filtered output
The grep filter removes the semver warning, so the expected output
should not include it. Updated lint-helm4 files to match the filtered
output (warning removed, no extra blank line).
Signed-off-by: Aditya Menon <amenon@canarytechnologies.com>
* feat: Add updateStrategy option in the state file with 'reinstall'/'reinstallIfForbidden' choices to uninstall and apply the specific release(s) (if forbidden to update)
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Fix unit tests related to the new updateStrategy feature
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Fix unit tests related to the new updateStrategy feature
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Resolve linter issue due to cognitive complexity
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Updated index.md to describe the possible values of updateStrategy
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Add validation of updateStrategy parameter and unit test
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Updated unit test
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Removed 'reinstall' update strategy option to only have reinstallIfForbidden, cleanup of pre-sync changes, adapted unit tests
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Display affected releases that were reinstalled
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Make sure to add --wait when deleting a release to be reinstalled due to reinstallIfForbidden
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* Apply suggestions from Copilot code review
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
---------
Signed-off-by: Simon Bouchard <sbouchard@rbbn.com>
* feat: add/expose cli flags
Signed-off-by: Hans Song <hans.m.song@gmail.com>
* fix tests
Signed-off-by: Hans Song <hans.m.song@gmail.com>
* remove skipdeps from subcommand options
Signed-off-by: Hans Song <hans.m.song@gmail.com>
* remove skip-deps from subcommand flags
Signed-off-by: Hans Song <hans.m.song@gmail.com>
* remove SkipDeps from subcommand implementations
Signed-off-by: Hans Song <hans.m.song@gmail.com>
* update doco with new flags
Signed-off-by: Hans Song <hans.m.song@gmail.com>
---------
Signed-off-by: Hans Song <hans.m.song@gmail.com>
All the dependencies get correctly installed when dealing with remote
charts.
If there's a local chart that depends on remote dependencies then those
don't get automatically installed. See #526. They end up with this
error:
```
Error: no cached repository for helm-manager-b6cf96b91af4f01317d185adfbe32610179e5246214be9646a52cb0b86032272 found. (try 'helm repo update'): open /root/.cache/helm/repository/helm-manager-b6cf96b91af4f01317d185adfbe32610179e5246214be9646a52cb0b86032272-index.yaml: no such file or directory
```
One workaround for that would be to add the repositories from the local
charts. Something like this:
```
cd local-chart/ && helm dependency list $dir 2> /dev/null | tail +2 | head -n -1 | awk '{ print "helm repo add " $1 " " $3 }' | while read cmd; do $cmd; done
```
This however is not trivial to parse and implement.
An easier fix which I did here is just to not allow doing
`--skip-refresh` for local repositories.
Fixes#526
Signed-off-by: Indrek Juhkam <indrek@urgas.eu>
Signed-off-by: Indrek Juhkam <indrek@urgas.eu>
Signed-off-by: yxxhero <aiopsclub@163.com>
1. only implement post-renderer flags this patch
2. As mumoshu advise, add helmfile flags `--post-render` and add the
postRenderer config in helmDefaults and release. the priority is
helmfile flags > release > helmDefaults.
3. fix the test case in state_test.go and some other tests.
Signed-off-by: guofutan <guofutan@tencent.com>
Signed-off-by: yxxhero <aiopsclub@163.com>
* feat: show live output from the Helm binary
Signed-off-by: Rodrigo Fior Kuntzer <rodrigo@miro.com>
* fixup! Merge branch 'main' into enable-live-output
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Make a few helmfile sub-commands to consistently support needs-related flags
* helmfile-diff adds support for --include-transitive-needs
* helmfile-template adds support for --skip-needs
* helmfile-lint adds support for --skip-needs, --include-needs, and --include-transitive-needs
Ref https://github.com/roboll/helmfile/issues/2055
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix a few helmfile-lint needs related bugs and add tests
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Is include-transitive-needs realy working as intended? 🤔
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Confirm that it does fail on unselected need by default
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Add missing testdata
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Test helmfile-template for include/skip needs support
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix a few terms
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Add more tests to better know the current helmfile-diff behavior around needs
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix failing tests
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix helmfile-diff to consistently handle skip/include-needs
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Extract testhelper.RequireLog for reusing
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix all bugs and test cases for TestDiff and TestDiff_2
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix TestDiff_2
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix TestDiff
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix TestDiffWithNeeds
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Unify behavior on including disabled releases as needs for lint and template
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix bug that --include-transitive-needs does not imply include-needs
Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
Parses a new field in repositories named `skipTLSVerify` and if set to `true`, it appends `--insecure-skip-tls-verify` in `helm repo add` command.
This should be useful with internal self-signed repos, mitm proxies etc.
Resolves#1871
This adds the ability to include the --pass-credentials flag to the helm add repo command by:
- Adding repo.passCredentials to the helmfile yaml
- Changing state, helmexec, and app to include RepositorySpec.PassCredentials
Resolves#1898
Co-authored-by: almed4 <alexandre.meddin@ingka.ikea.com>
Adds a basic support for Helm repositories hosted on Azure Container Registry (not OCI but classic ones). Add a new field to RepositorySpec to state that is externally managed and runs the `az-cli` command instead of the helm one to manage the repository.
* Parse and process helm version using github.com/Masterminds/semver/v3.
* Add --force-update only when Helm version >= 3.3.2, < 3.3.4.
See: https://github.com/helm/helm/pull/8777.
* Add test cases.
- createNamespace is a new attribute that can be added to helmDefaults
or an individual release to enforce the creation of a release namespace
during sync if the namespace does not exist. This leverages helm's
(3.2+) --create-namespace flag for the install/upgrade command. If
running helm < 3.2, the createNamespace attribute has no effect.
Resolves#891Resolves#1140
* Add option to suppress diff on apply
Add --supress-diff option on apply. Usable for fresh installs when a
lot of output is produces by diff.
Resolves#458
* fix tests for suppress-diff
Runs `helm version` in helmexec.New, and exposes a method on Interface to allow other packages to use the detected version. Preserves compatibility with previous HELMFILE_HELM3 mechanism.
Resolves#923
* fix: Fix `needs` to work for upgrades and when selectors are provided
Fixes#919
* Add test framework for `helmfile apply`
* Various enhancements and fixes to the DAG support
- Make the order of upgrades/deletes more deterministic for testability
- Fix the test framework so that we can validate log outputs and errors
- Add more test cases for `helmfile apply`, along with bug fixes.
- Make sure it fails with an intuitive error when you have non-existent releases referenced from witin "needs"