Commit Graph

264 Commits

Author SHA1 Message Date
thoro 101b63b63b
Merge 0affa9425a into b97de5d7f1 2026-01-19 14:56:52 +00:00
Mikkel Oscar Lyderik Larsen f6839f87b9
Modernize code generation (#3003)
Signed-off-by: Mikkel Oscar Lyderik Larsen <mikkel.larsen@zalando.de>
2026-01-09 14:22:10 +01:00
Thomas Rosenstein 0affa9425a Improve sync responsiveness with background execution and context cancellation
This change improves the responsiveness of the operator when handling
deletion requests by running sync operations in the background and
using context cancellation to interrupt stuck operations.

Changes:
- Add context field to Cluster struct, passed through New()
- Add Cancel() method to cancel cluster's context
- Add StartSync/EndSync/NeedsResync for managing background sync state
- Run Sync() in a background goroutine so worker can process other events
- Add context-aware DB connection methods (initDbConnWithContext)
- Add RetryWithContext() that respects context cancellation
- Cancel cluster context immediately when DeletionTimestamp detected
- Use context-aware connections in syncRoles/syncDatabases
- StartSync/NeedsResync check context cancellation to prevent new syncs
  during deletion (no need for separate deleted flag)

Flow:
1. Sync event spawns background goroutine and returns immediately
2. If another sync arrives while one is running, needsResync flag is set
3. When sync completes, it checks needsResync and requeues if needed
4. Delete cancels context -> stuck DB operations return early -> mutex released
5. StartSync/NeedsResync return false when context cancelled
6. Delete proceeds without waiting for slow/stuck sync operations
2025-12-14 20:43:53 +00:00
Thomas Rosenstein 9ca87d9db4 Fix deletion timestamp handling for clusters with finalizers
When a Postgres cluster has a finalizer, deleting it sets a DeletionTimestamp
but doesn't remove the object until the finalizer is cleared. The operator
was not properly handling these DeletionTimestamp changes:

1. postgresqlUpdate() was filtering out events where only DeletionTimestamp
   changed (it only checked Spec and Annotations), causing the delete to
   never be processed.

2. EventUpdate case in processEvent() didn't check for DeletionTimestamp,
   so even if the event reached the processor, it would run Update() instead
   of Delete().

3. removeFinalizer() used a cached object with stale resourceVersion,
   causing "object has been modified" errors.

Fixes:
- Add explicit DeletionTimestamp check in postgresqlUpdate() to queue the event
- Add DeletionTimestamp check in EventUpdate to call Delete() when set
- Fetch latest object from API before removing finalizer to avoid conflicts
2025-12-14 20:29:01 +01:00
Steven Berler cd05682482
fix switchover schedule tests (#2995)
* fix switchover schedule tests

Previously the tests would fail depending on the local time zone and the
time of day the test was being run.

---------

Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
Co-authored-by: Mikkel Oscar Lyderik Larsen <mikkeloscar@users.noreply.github.com>
2025-12-11 10:22:40 +01:00
Felix Kunde 3bc244fe39
bump dependencies and reflect linter suggestions (#2963) 2025-10-16 10:23:36 +02:00
Felix Kunde dc29425969
include external traffic policy comparison into service diffing (#2956) 2025-09-23 14:30:06 +02:00
Polina Bungina bcd729b2cc
Add selector to master service when switching to CM (#2955)
Add service selector comparison to compareServices
This is necessary for the proper switch of `kubernetes_use_configmaps` configuration value, as master service should have different label selector setup for those.
2025-09-19 14:44:17 +02:00
Felix Kunde 746df0d33d
do not remove publications of slot defined in manifest (#2868)
* do not remove publications of slot defined in manifest
* improve condition to sync streams
* init publication tables map when adding manifest slots
* need to update c.Stream when there is no update
2025-02-26 17:31:37 +01:00
Felix Kunde 2a4be1cb39
fix creating secrets for rotation users (#2863)
* fix creating secrets for rotation users
* rework annotation comparison on update to decide on when to call syncSecrets
2025-02-14 09:44:09 +01:00
Polina Bungina a56ecaace7
Critical operation PDB (#2830)
Create the second PDB to cover Pods with a special "critical operation" label set.

This label is going to be assigned to all pg cluster's Pods by the Operator during a PG major version upgrade, by Patroni during a cluster/replica bootstrap. It can also be set manually or by any other automation tool.
2025-01-29 12:41:08 +01:00
Polina Bungina f49b4f1e97
Ensure podAnnotations are removed from pods if reset in the config (#2826) 2025-01-24 16:53:14 +01:00
Polina Bungina b0cfeb30ea
Partially revert #2810 (#2849)
Only schedule switchover for pod migration, consider mainWindow for PGVERSION env change
2025-01-23 16:35:33 +01:00
Polina Bungina 8522331cf2
Extend MaintenanceWindows parameter usage (#2810)
Consider maintenance window when migrating master pods and replacing pods (rolling update)
2025-01-15 18:04:36 +01:00
Felix Kunde 8231797efa
add cluster field for PVCs (#2785)
* add cluster field for PVCs
* sync volumes on cluster creation
* fully spell pvc in log messages
2024-10-31 14:08:50 +01:00
Felix Kunde 002d0f94a1
quote schema names in case they use special characters and remove strings.Builder (#2782) 2024-10-17 16:52:24 +02:00
Felix Kunde a08d1679f2
align sync and update logs (#2738) 2024-08-27 09:58:32 +02:00
fahed dorgaa aad03f71ea
fix golangci-lint issues (#2715)
Signed-off-by: fahed dorgaa <fahed.dorgaa@gmail.com>
Co-authored-by: fahed dorgaa <fahed.dorgaa.ext@corp.ovh.com>
Co-authored-by: Matthias Adler <macedigital@users.noreply.github.com>
2024-08-14 12:54:44 +02:00
Felix Kunde 25ccc87317
sync all resources to cluster fields (#2713)
* sync all resources to cluster fields (CronJob, Streams, Patroni resources)
* separated sync and delete logic for Patroni resources
* align delete streams and secrets logic with other resources
* rename gatherApplicationIds to getDistinctApplicationIds
* improve slot check before syncing streams CRD
* add ownerReferences and annotations diff to Patroni objects
* add extra sync code for config service so it does not get too ugly
* some bugfixes when comparing annotations and return err on found
* sync Patroni resources on update event and extended unit tests
* add config service/endpoint owner references check to e2e tes
2024-08-13 10:06:46 +02:00
Felix Kunde 31f92a1aa0
extend inherited annotations unit test to include logical backup cron job (#2723)
* extend inherited annotations test to logical backup cron job
* sync on updated when enabled, not only on schedule changes
2024-08-12 13:12:51 +02:00
Felix Kunde a87307e56b
Feat: enable owner references (#2688)
* feat(498): Add ownerReferences to managed entities
* empty owner reference for cross namespace secret and more tests
* update ownerReferences of existing resources
* removing ownerReference requires Update API call
* CR ownerReference on PVC blocks pvc retention policy of statefulset
* make ownerreferences optional and disabled by default
* update unit test to check len ownerReferences
* update codegen
* add owner references e2e test
* update unit test
* add block_owner_deletion field to test owner reference
* fix typos and update docs once more
* reflect code feedback

---------

Co-authored-by: Max Begenau <max@begenau.com>
2024-08-09 17:58:25 +02:00
Felix Kunde e71891e2bd
improve logical backup comparison unit test and improve container sync (#2686)
* improve logical backup comparison unit test and improve container sync
* add new comparison function for volume mounts + unit test
2024-07-08 14:06:14 +02:00
Felix Kunde 37d6993439
remove stream resources after drop from Postgres manifest (#2563)
* remove stream resources after drop from Postgres manifest
2024-06-27 14:30:52 +02:00
Polina Bungina 47efca33c9
Improve inherited annotations (#2657)
* Annotate PVC on Sync/Update, not only change PVC template
* Don't rotate pods when only annotations changed
* Annotate Logical Backup's and Pooler's pods
* Annotate PDB, Endpoints created by the Operator, Secrets, Logical Backup jobs

Inherited annotations are only added/updated, not removed
2024-06-26 13:10:37 +02:00
Motte 13d6594cdf
Secrets deletion config (#2582)
* Secrets deletion config
* Update e2e/tests/test_e2e.py

Co-authored-by: Felix Kunde <felix-kunde@gmx.de>

---------

Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
2024-05-10 16:31:21 +02:00
Felix Kunde 83878fe447
make bucket prefix for logical backup configurable (#2609)
* make bucket prefix for logical backup configurable
* include container comparison in logical backup diff
* add unit test and update description for compareContainers
* don't rely on users putting / in the config - reflect other comments from review
2024-04-23 14:24:04 +02:00
Felix Kunde 8bd9080798
return create and sync error, not setStatus error (#2574)
* return create and sync error, not possible status set error
* update documentation and improve deletion logs
2024-03-12 16:31:59 +01:00
Felix Kunde 23f4fdb327
update go and dependencies (#2554) 2024-02-23 13:58:11 +01:00
Felix Kunde e34f19be01
update spec when updating status (#2546)
* update spec when updating status
* only setSpec of pg resource is not empty
2024-02-20 10:24:24 +01:00
Felix Kunde 4a0c483514
add unit test and documentation for finalizers (#2509)
* add unit test and documentation for finalizers
* error msg with lower case and cover sync case
* try to avoid adding json-patch dependency
* use Update to remove finalizer
* changing status and finalizer during create
* do not call Delete() twice
2024-01-22 12:13:40 +01:00
Felix Kunde 3bad9aaded
fix when syncing standby discription (#2513) 2024-01-12 10:41:17 +01:00
Christian Rohmann 743aade45f
Use finalizers to avoid losing delete events and to ensure full resource cleanup (#941)
* Add Finalizer functions to Cluster; add/remove finalizer on Create/Delete events
* Check if clusters have a deletion timestamp and we missed that event. Run Delete() and remove finalizer when done.
* Fix nil handling when using Service from map; Remove Service, Endpoint entries from their maps - just like with Secrets
* Add handling of ResourceNotFound to all delete functions (Service, Endpoint, LogicalBackup CronJob, PDB and Secret) - this is not a real error when deleting things
* Emit events when there are issues deleting resources to the user is informed
* Depend the removal of the Finalizer on all resources being deleted successfully first. Otherwise the next sync run should let us try again
* Add config option to enable finalizers
* Removed dangling whitespace at EOL
* config.EnableFinalizers is a bool pointer

---------

Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
2024-01-04 16:22:53 +01:00
Felix Kunde dad5b132ec
Standby cluster promotion by changing manifest (#2472)
* Standby cluster promotion by changing manifest
* Updated the documentation

---------

Co-authored-by: Senthilnathan M <snathanm@vmware.com>
2024-01-04 12:33:50 +01:00
seeker 7ceedead35
Fix VolumeClaimTemplates index out of range problem (#2493)
when the desired statefulset has different numbers of volume claim template with current cluster,  will be panic because of index out of range
2024-01-04 11:05:15 +01:00
Ida Novindasari 36389b27bc
Enable specifying PVC retention policy for auto deletion (#2343)
* Enable specifying PVC retention policy for auto deletion
* enable StatefulSetAutoDeletePVC in featureGates
* skip node affinity test
2023-09-08 13:17:37 +02:00
drivebyer 1e64ae788e
Fix some errors be ignored (#2290)
Signed-off-by: drivebyer <yang.wu@daocloud.io>
2023-04-17 17:25:07 +02:00
Felix Kunde 9973262b83
sync stateful set when syncing streams during ADD event (#2245) 2023-02-28 09:14:22 +01:00
Felix Kunde 7887ebbbce
set wal_level config not on empty parameters map (#2189)
* set wal_level config not on empty parameters map
* UPDATE event must trigger statefulSet sync when streams are added
2023-01-26 09:43:03 +01:00
idanovinda 486d5d66e0
Allow drop slots when it gets deleted from the manifest (#2089)
* Allow drop slots when it gets deleted from the manifest
* use leader instead replica to query slots
* fix and extend unit tests for config update checks

Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
2023-01-03 15:46:59 +01:00
Felix Kunde e80cccb93b
use random short name for stream CRDs (#2137)
* use random short name for stream CRDs
2022-12-27 16:52:01 +01:00
Felix Kunde 70f3ee8e36
skip db sync on failed initUsers during UPDATE (#2083)
* skip db sync on failed initUsers during UPDATE
* provide unit test for teams API being unavailable
* add test for 404 case
2022-10-21 17:50:14 +02:00
Felix Kunde 2aa52094db
switch to policy API v1 for PDBs (#2008)
* switch to policy API v1 for PDBs
* update e2e test dependencies
* use kind 0.14.0
* bump K8s client in e2e docker image
* bump e2e tests-runner
2022-10-06 09:43:17 +02:00
Felix Kunde a119772efb
add toggle to turn off readiness probes (#2004)
* add toggle to turn off readiness probes
* include PodManagementPolicy and ReadinessProbe in stateful set comparison
* add URI scheme to generated readiness probe
2022-10-05 18:25:24 +02:00
Felix Kunde 4c07494ac7
deprecate ClusterName field of Postgresql type and remove team from REST endpoints (#2015)
* deprecate ClusterName field of Postgresql type
* remove for teamId from operator API endpints /status /logs /history
* update dns_format_string and yaml template in UI
2022-08-29 15:00:25 +02:00
Felix Kunde ef324494a0
fetch pooler and fes_user system user only when corresponding features are used (#2009)
* fetch pooler and fes_user system user only when corresponding features are used
* cover error case in unit test
* use string formatting instead of +
2022-08-24 16:28:49 +02:00
Felix Kunde b2642fa2fc
allow in place pw rotation of system users (#1953)
* allow in place pw rotation of system users
* block postgres user from rotation
* mark pooler pods for replacement
* adding podsGetter where pooler is synced in unit tests
* move rotation code in extra function
2022-08-18 14:14:31 +02:00
Felix Kunde 268a86a045
removing inner goroutine in cluster.Switchover (#1876)
* removing inner goroutine in cluster.Switchover and resolve race between processPodEvent and unregisterPodSubscriber
* unlock mutex after handling event, now with non-blocking default case
2022-05-17 18:10:39 +02:00
Felix Kunde a77d5df158
reverse membership for additional owner roles (#1862)
* reverse membership for additional owner roles
* remove type RoleOriginSpilo
* use e2e images with cron_admin inside
* let operator resolve reversed membership
* make additional owner roles part of the sync user strategy
* add more context in the docs about additional_owner_roles
2022-04-28 11:15:40 +02:00
Felix Kunde 532772c5cd
do not call EBS api when there are no pvs (#1851)
* do not call EBS api when there are no pvs
* no extra aws api call in executeEBSMigration, operate on fetched cluster.EBSVolumes
2022-04-20 12:12:02 +02:00
Felix Kunde eecd13169c
refactor spilo env var generation (#1848)
* refactor spilo env generation
* enhance docs on env vars
* add unit test for appendEnvVar
2022-04-14 11:47:33 +02:00