* allow using both infrastructure_roles_options
* new default values for user and role definition
* use robot_zmon as parent role
* add operator log to debug
* right name for old secret
* only extract if rolesDefs is empty
* set password1 in old infrastructure role
* fix new infra rile secret
* choose different role key for new secret
* set memberof everywhere
* reenable all tests
* reflect feedback
* remove condition for rolesDefs
* change Clone attribute of PostgresSpec to *ConnectionPooler
* update go.mod from master
* fix TestConnectionPoolerSynchronization()
* Update pkg/apis/acid.zalan.do/v1/postgresql_type.go
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
Co-authored-by: Pavlo Golub <pavlo.golub@gmail.com>
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* Extend operator configuration to allow for a pod_environment_secret just like pod_environment_configmap
* Add all keys from PodEnvironmentSecrets as ENV vars (using SecretKeyRef to protect the value)
* Apply envVars from pod_environment_configmap and pod_environment_secrets before doing the global settings from the operator config. This allows them to be overriden by the user (via configmap / secret)
* Add ability use a Secret for custom pod envVars (via pod_environment_secret) to admin documentation
* Add pod_environment_secret to Helm chart values.yaml
* Add unit tests for PodEnvironmentConfigMap and PodEnvironmentSecret - highly inspired by @kupson and his very similar PR #481
* Added new parameter pod_environment_secret to operatorconfig CRD and configmap examples
* Add pod_environment_secret to the operationconfiguration CRD
Co-authored-by: Christian Rohmann <christian.rohmann@inovex.de>
* delete secrets the right way
* make a one function
* continue deleting secrets even if one delete fails
Co-authored-by: Felix Kunde <felix.kunde@zalando.de>
* Try to resize pvc if resizing pv has failed
* added config option to switch between storage resize strategies
* changes according to requests
* Update pkg/controller/operator_config.go
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* enable_storage_resize documented
added examples to the default configuration and helm value files
* enable_storage_resize renamed to volume_resize_mode, off by default
* volume_resize_mode renamed to storage_resize_mode
* Update pkg/apis/acid.zalan.do/v1/crds.go
* pkg/cluster/volumes.go updated
* Update docs/reference/operator_parameters.md
* Update manifests/postgresql-operator-default-configuration.yaml
* Update pkg/controller/operator_config.go
* Update pkg/util/config/config.go
* Update charts/postgres-operator/values-crd.yaml
* Update charts/postgres-operator/values.yaml
* Update docs/reference/operator_parameters.md
* added logging if no changes required
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* try to emit error for missing team name in cluster name
* skip creation after new cluster object
* move SetStatus to k8sclient and emit event when skipping creation and rename to SetPostgresCRDStatus
Co-authored-by: Felix Kunde <felix.kunde@zalando.de>
* do not block rolling updates with lazy spilo update enabled
* treat initContainers like Spilo image
Co-authored-by: Felix Kunde <felix.kunde@zalando.de>
* Support for WAL_GS_BUCKET and GOOGLE_APPLICATION_CREDENTIALS environtment variables
* Fixed merge issue but also removed all changes to support macos.
* Updated test to new format
* Missed macos specific changes
* Added documentation and addressed comments
* Update docs/administrator.md
* Update docs/administrator.md
* Update e2e/run.sh
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
deleteConnectionPooler function incorrectly checks that the delete api response is ResourceNotFound. Looks like the only consequence is a confusing log message, but obviously it's wrong. Remove negation, since having ResourceNotFound as error is the good case.
Co-authored-by: Christian Rohmann <christian.rohmann@inovex.de>
* Initial commit
* Corrections
- set the type of the new configuration parameter to be array of
strings
- propagate the annotations to statefulset at sync
* Enable regular expression matching
* Improvements
-handle rollingUpdate flag
-modularize code
-rename config parameter name
* fix merge error
* Pass annotations to connection pooler deployment
* update code-gen
* Add documentation and update manifests
* add e2e test and introduce option in configmap
* fix service annotations test
* Add unit test
* fix e2e tests
* better key lookup of annotations tests
* add debug message for annotation tests
* Fix typos
* minor fix for looping
* Handle update path and renaming
- handle the update path to update sts and connection pooler deployment.
This way no need to wait for sync
- rename the parameter to downscaler_annotations
- handle other review comments
* another try to fix python loops
* Avoid unneccessary update events
* Update manifests
* some final polishing
* fix cluster_test after polishing
Co-authored-by: Rafia Sabih <rafia.sabih@zalando.de>
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* PreparedDatabases with default role setup
* merge changes from master
* include preparedDatabases spec check when syncing databases
* create a default preparedDB if not specified
* add more default privileges for schemas
* use empty brackets block for undefined objects
* cover more default privilege scenarios and always define admin role
* add DefaultUsers flag
* support extensions and defaultUsers for preparedDatabases
* remove exact version in deployment manifest
* enable CRD validation for new field
* update generated code
* reflect code review
* fix typo in SQL command
* add documentation for preparedDatabases feature + minor changes
* some datname should stay
* add unit tests
* reflect some feedback
* init users for preparedDatabases also on update
* only change DB default privileges on creation
* add one more section in user docs
* one more sentence
* initial implementation
* describe forcing the rolling upgrade
* make parameter name more descriptive
* add missing pieces
* address review
* address review
* fix bug in e2e tests
* fix cluster name label in e2e test
* raise test timeout
* load spilo test image
* use available spilo image
* delete replica pod for lazy update test
* fix e2e
* fix e2e with a vengeance
* lets wait for another 30m
* print pod name in error msg
* print pod name in error msg 2
* raise timeout, comment other tests
* subsequent updates of config
* add comma
* fix e2e test
* run unit tests before e2e
* remove conflicting dependency
* Revert "remove conflicting dependency"
This reverts commit 65fc09054b.
* improve cdp build
* dont run unit before e2e tests
* Revert "improve cdp build"
This reverts commit e2a8fa12aa.
Co-authored-by: Sergey Dudoladov <sergey.dudoladov@zalando.de>
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* Add EventsGetter to KubeClient to enable to sending K8S events
* Add eventRecorder to the controller, initialize it and hand it down to cluster via its constructor to enable it to emit events this way
* Add first set of events which then go to the postgresql custom resource the user interacts with to provide some feedback
* Add right to "create" events to operator cluster role
* Adapt cluster tests to new function sigurature with eventRecord (via NewFakeRecorder)
* Get a proper reference before sending events to a resource
Co-authored-by: Christian Rohmann <christian.rohmann@inovex.de>
* adds a Get call to Patroni interface to fetch state of a Patroni member
* postpones re-creating pods if at least one replica is currently being created
Co-authored-by: Sergey Dudoladov <sergey.dudoladov@zalando.de>
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* further compatibility with k8sUseConfigMaps - skip further endpoints related actions
* Update pkg/cluster/cluster.go
thanks!
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Update pkg/cluster/cluster.go
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Update pkg/cluster/cluster.go
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
There is a possibility to pass nil as one of the specs and an empty spec
into syncConnectionPooler. In this case it will perfom a syncronization
because nil != empty struct. Avoid such cases and make it testable by
returning list of syncronization reasons on top together with the final
error.
* Allow additional Volumes to be mounted
* added TargetContainers option to determine if additional volume need to be mounter or not
* fixed dependencies
* updated manifest additional volume example
* More validation
Check that there are no volume mount path clashes or "all" vs ["a", "b"]
mixtures. Also change the default behaviour to mount to "postgres"
container.
* More documentation / example about additional volumes
* Revert go.sum and go.mod from origin/master
* Declare addictionalVolume specs in CRDs
* fixed k8sres after rebase
* resolv conflict
Co-authored-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Co-authored-by: Thierry <thierry@malt.com>
* Protected and system users can't be a connection pool user
It's not supported, neither it's a best practice. Also fix potential null
pointer access. For protected users it makes sense by intent of protecting this
users (e.g. from being overriden or used as something else than supposed). For
system users the reason is the same as for superuser, it's about replicastion
user and it's under patroni control.
This is implemented on both levels, operator config and postgresql manifest.
For the latter we just use default name in this case, assuming that operator
config is always correct. For the former, since it's a serious
misconfiguration, operator panics.
* Add patroni parameters for `synchronous_mode`
* Update complete-postgres-manifest.yaml, removed quotation marks
* Update k8sres_test.go, adjust result for `Patroni configured`
* Update k8sres_test.go, adjust result for `Patroni configured`
* Update complete-postgres-manifest.yaml, set synchronous mode to false in this example
* Update pkg/cluster/k8sres.go
Does the same but is shorter. So we fix that it if you like.
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Update docs/reference/cluster_manifest.md
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Add patroni's `synchronous_mode_strict`
* Extend `TestGenerateSpiloConfig` with `SynchronousModeStrict`
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
* kubernetes_use_configmap
* Update manifests/postgresql-operator-default-configuration.yaml
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Update manifests/configmap.yaml
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* Update charts/postgres-operator/values.yaml
Co-Authored-By: Felix Kunde <felix-kunde@gmx.de>
* go.fmt
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
Connection pooler support
Add support for a connection pooler. The idea is to make it generic enough to
be able to switch between different implementations (e.g. pgbouncer or
odyssey). Operator needs to create a deployment with pooler and a service for
it to access.
For connection pool to work properly, a database needs to be prepared by
operator, namely a separate user have to be created with an access to an
installed lookup function (to fetch credential for other users).
This setups is supposed to be used only by robot/application users. Usually a
connection pool implementation is more CPU bounded, so it makes sense to create
several pods for connection pool with more emphasize on cpu resources. At the
moment there are no special affinity or tolerations assigned to bring those
pods closer to the database. For availability purposes minimal number of
connection pool pods is 2, ideally they have to be distributed between
different nodes/AZ, but it's not enforced in the operator itself. Available
configuration supposed to be ergonomic and in the normal case require minimum
changes to a manifest to enable connection pool. To have more control over the
configuration and functionality on the pool side one can customize the
corresponding docker image.
Co-authored-by: Felix Kunde <felix-kunde@gmx.de>
The [operator parameters][1] already support the
`custom_service_annotations` config.With this parameter is possible to
define custom annotations that will be used on the services created by the
operator. The `custom_service_annotations` as all the other
[operator parameters][1] are defined on the operator level and do not allow
customization on the cluster level. A cluster may require different service
annotations, as for example, set up different cloud load balancers
timeouts, different ingress annotations, and/or enable more customizable
environments.
This commit introduces a new parameter on the cluster level, called
`serviceAnnotations`, responsible for defining custom annotations just for
the services created by the operator to the specifically defined cluster.
It allows a mix of configuration between `custom_service_annotations` and
`serviceAnnotations` where the latest one will have priority. In order to
allow custom service annotations to be used on services without
LoadBalancers (as for example, service mesh services annotations) both
`custom_service_annotations` and `serviceAnnotations` are applied
independently of load-balancing configuration. For retro-compatibility
purposes, `custom_service_annotations` is still under
[Load balancer related options][2]. The two default annotations when using
LoadBalancer services, `external-dns.alpha.kubernetes.io/hostname` and
`service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout` are
still defined by the operator.
`service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout` can
be overridden by `custom_service_annotations` or `serviceAnnotations`,
allowing a more customizable environment.
`external-dns.alpha.kubernetes.io/hostname` can not be overridden once
there is no differentiation between custom service annotations for
replicas and masters.
It updates the documentation and creates the necessary unit and e2e
tests to the above-described feature too.
[1]: https://github.com/zalando/postgres-operator/blob/master/docs/reference/operator_parameters.md
[2]: https://github.com/zalando/postgres-operator/blob/master/docs/reference/operator_parameters.md#load-balancer-related-options
* add validation for PG resources and volume size
* check resource requests also on UPDATE and SYNC + update docs
* if cluster was running don't error on sync
* Added possibility to add custom annotations to LoadBalancer service.
* Added parameters for custom endpoint, access and secret key for logical backup.
* Modified dump.sh so it knows how to handle new features. Configurable S3 SSE
This will set up a continuous wal streaming cluster, by adding the corresponding section in postgres manifest. Instead of having a full-fledged standby cluster as in Patroni, here we use only the wal path of the source cluster and stream from there.
Since, standby cluster is streaming from the master and does not require to create or use databases of it's own. Hence, it bypasses the creation of users or databases.
There is a separate sample manifest added to set up a standby-cluster.
* StatefulSet fsGroup config option to allow non-root spilo
* Allow Postgres CRD to overide SpiloFSGroup of the Operator.
* Document FSGroup of a Pod cannot be changed after creation.
* database.go: substitute hardcoded .svc.cluster.local dns suffix with config parameter
Use the pod's configured dns search path, for clusters where .svc.cluster.local is not correct.
Override clone s3 bucket path
Add possibility to use a custom s3 bucket path for cloning a cluster
from an arbitrary bucket (e.g. from another k8s cluster). For that
a new config options is introduced `s3_wal_path`, that should point
to a location that spilo would understand.
* turns PostgresStatus type into a struct with field PostgresClusterStatus
* setStatus patch target is now /status subresource
* unmarshalling PostgresStatus takes care of previous status field convention
* new simple bool functions status.Running(), status.Creating()
* Config option to allow Spilo container to run non-privileged.
Runs non-privileged by default.
Fixes#395
* add spilo_privileged to manifests/configmap.yaml
* add spilo_privileged to helm chart's values.yaml
Add possibility to mount a tmpfs volume to /dev/shm to avoid issues like
[this](https://github.com/docker-library/postgres/issues/416). To achieve that
two new options were introduced:
* `enableShmVolume` to PostgreSQL manifest, to specify whether or not mount
this volume per database cluster
* `enable_shm_volume` to operator configuration, to specify whether or not mount
per operator.
The first one, `enableShmVolume` takes precedence to allow us to be more flexible.
Client-go provides a https://github.com/kubernetes/code-generator package in order to provide the API to work with CRDs similar to the one available for built-in types, i.e. Pods, Statefulsets and so on.
Use this package to generate deepcopy methods (required for CRDs), instead of using an external deepcopy package; we also generate APIs used to manipulate both Postgres and OperatorConfiguration CRDs, as well as informers and listers for the Postgres CRD, instead of using generic informers and CRD REST API; by using generated code we can get rid of some custom and obscure CRD-related code and use a better API.
All generated code resides in /pkg/generated, with an exception of zz_deepcopy.go in apis/acid.zalan.do/v1
Rename postgres-operator-configuration CRD to OperatorConfiguration, since the former broke naming convention in the code-generator.
Moved Postgresql, PostgresqlList, OperatorConfiguration and OperatorConfigurationList and other types used by them into
Change the type of the Error field in the Postgresql crd to a string, so that client-go could generate a deepcopy for it.
Use generated code to set status of CRD objects as well. Right now this is done with patch, however, Kubernetes 1.11 introduces the /status subresources, allowing us to set the status with
the special updateStatus call in the future. For now, we keep the code that is compatible with earlier versions of Kubernetes.
Rename postgresql.go to database.go and status.go to logs_and_api.go to reflect the purpose of each of those files.
Update client-go dependencies.
Minor reformatting and renaming.
Previously, the operator put pg_hba into the bootstrap/pg_hba key of
Patroni. That had 2 adverse effects:
- pg_hba.conf was shadowed by Spilo default section in the local
postgresql configuration
- when updating pg_hba in the cluster manifest, the updated lines were
not propagated to DCS, since the key was defined in the boostrap
section of Patroni.
Include some minor refactoring, moving methods to unexported when
possible and commenting out usage of md5, so that gosec won't complain.
Per https://github.com/zalando-incubator/postgres-operator/issues/330
Review by @zerg-junior
Run more linters in the gometalinter, i.e. deadcode, megacheck,
nakedret, dup.
More consistent code formatting, remove two dead functions, eliminate
naked a bunch of naked returns, refactor a few functions to avoid code
duplication.
* Allow configuring pod priority globally and per cluster.
Allow to specify pod priority class for all pods managed by the operator,
as well as for those belonging to individual clusters.
Controlled by the pod_priority_class_name operator configuration
parameter and the podPriorityClassName manifest option.
See https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass
for the explanation on how to define priority classes since Kubernetes 1.8.
Some import order changes are due to go fmt.
Removal of OrphanDependents deprecated field.
Code review by @zerg-junior
There are shortcuts in this code, i.e. we created the deepcopy function
by using the deepcopy package instead of the generated code, that will
be addressed once migrated to client-go v8. Also, some objects,
particularly statefulsets, are still taken from v1beta, this will also
be addressed in further commits once the changes are stabilized.
A repair is a sync scan that acts only on those clusters that indicate
that the last add, update or sync operation on them has failed. It is
supposed to kick in more frequently than the repair scan. The repair
scan still remains to be useful to fix the consequences of external
actions (i.e. someone deletes a postgres-related service by mistake)
unbeknownst to the operator.
The repair scan is controlled by the new repair_period parameter in the
operator configuration. It has to be at least 2 times more frequent than
a sync scan to have any effect (a normal sync scan will update both last
synced and last repaired attributes of the controller, since repair is
just a sync underneath).
A repair scan could be queued for a cluster that is already being synced
if the sync period exceeds the interval between repairs. In that case a
repair event will be discarded once the corresponding worker finds out
that the cluster is not failing anymore.
Review by @zerg-junior
* Improve generting of Scalyr container environment.
Avoid duplicating POD_NAME and POD_NAMESPACE that already bundled
every sidecar.
Do not complain on the lack of SCLALYR_SERVER_HOST, since it is set to
https://upload.eu.scalyr.com in the container we use.
Do not mentioned SCALYR_SERVER_HOST in the error messages, since it is
derived from the cluster name automatically.
Do not show 'persistent volumes are not compatible' errors for the
volumes that failed to be resized because of the other reasons (i.e.
the new size is smaller than the existing one).
* During initial Event processing submit the service account for pods and bind it to a cluster role that allows Patroni to successfully start. The cluster role is assumed to be created by the k8s cluster administrator.
* Up until now, the operator read its own configuration from the
configmap. That has a number of limitations, i.e. when the
configuration value is not a scalar, but a map or a list. We use a
custom code based on github.com/kelseyhightower/envconfig to decode
non-scalar values out of plain text keys, but that breaks when the data
inside the keys contains both YAML-special elememtns (i.e. commas) and
complex quotes, one good example for that is search_path inside
`team_api_role_configuration`. In addition, reliance on the configmap
forced a flag structure on the configuration, making it hard to write
and to read (see
https://github.com/zalando-incubator/postgres-operator/pull/308#issuecomment-395131778).
The changes allow to supply the operator configuration in a proper YAML
file. That required registering a custom CRD to support the operator
configuration and provide an example at
manifests/postgresql-operator-default-configuration.yaml. At the moment,
both old configmap and the new CRD configuration is supported, so no
compatibility issues, however, in the future I'd like to deprecate the
configmap-based configuration altogether. Contrary to the
configmap-based configuration, the CRD one doesn't embed defaults into
the operator code, however, one can use the
manifests/postgresql-operator-default-configuration.yaml as a starting
point in order to build a custom configuration.
Since previously `ReadyWaitInterval` and `ReadyWaitTimeout` parameters
used to create the CRD were taken from the operator configuration, which
is not possible if the configuration itself is stored in the CRD object,
I've added the ability to specify them as environment variables
`CRD_READY_WAIT_INTERVAL` and `CRD_READY_WAIT_TIMEOUT` respectively.
Per review by @zerg-junior and @Jan-M.
* Switchover must wait for the inner goroutine before it returns.
Otherwise, two corner cases may happen:
- waitForPodLabel writes to the podLabelErr channel that has been
already closed by the outer routine
- the outer routine exists and the caller subscribes to the pod
the inner goroutine has already subscribed to, resulting in panic.
The previous commit fe47f9ebea
that touched that code added the cancellation channel, but didn't bother
to actually wait for the goroutine to be cancelled.
Per report and review from @valer-cara.
Original issue: https://github.com/zalando-incubator/postgres-operator/issues/342
The old way of specifying it with the annotation is deprecated and not
available in recent Kubernetes versions. We will keep it there anyway
until upgrading to the new go-client that is incompatible with those
versions.
Per report from @schmitch
* Define sidecars in the operator configuration.
Right now only the name and the docker image can be defined, but with
the help of the pod_environment_configmap parameter arbitrary
environment variables can be passed to the sidecars.
* Refactoring around generatePodTemplate.
Original implementation of per-cluster sidecars by @theRealWardo
Per review by @zerg-junior and @Jan-M
Call Patroni API /config in order to set special options that are
ignored when set in the configuration file, such as max_connections.
Per https://github.com/zalando-incubator/postgres-operator/issues/297
* Some minor refacoring:
Rename Cluster ManualFailover to Swithover
Rename Patroni Failover to Switchover
Add more details to error messages and comments introduced in this PR.
Review by @zerg-junior
After an unsuccessful initial cluster sync it may happen that the
cluster statefulset is empty. This has been made more likely since
88d6a7be3, since it has introduced syncing volumes before statefulsets,
and the volume sync mail fail for different reasons (i.e. the volume has
been shrinked, or too many calls to Amazon).
Some special patroni postgresql parameters, like max_connections,
should reside in the bootstrap.dcs.postgresql.parameters section
to come into effect.
When there is an error happening upon deletion of the Kubernetes object
belonging to the cluster being removed, it makes no sense to abort the
deletion: the manifest will be removed anyway, therefore all the objects
after the one we aborted at will stay forever.
Do not use statefulset number of pods to figure out running ones
for volume resizing, since the statefulset pointer could be nil.
Instead, look at the actual running pods.
* Depreate old LB options, fix endpoint sync.
- deprecate useLoadBalancer, replicaLoadBalancer from the manifest
and enable_load_balancer from the operator configuration. The old
operator configuration options become no-op with this commit. For
the old manifest options, `useLoadBalancer` and `replicaLoadBalancer`
are still consulted, but only in the absense of the new ones
(enableMasterLoadBalancer and enableReplicaLoadBalancer).
- Make sure the endpoint being created during the sync receives proper
addresses subset. This is more critical for the replicas, as for the
masters Patroni will normally re-create the endpoint before the
operator.
- Avoid creating the replica endpoint, since it will be created automatically
by the corresponding service.
- Update the README and unit tests.
Code review by @mgomezch and @zerg-junior
* Improve the pod moving behavior during the Kubernetes cluster upgrade.
Fix an issue of not waiting for at least one replica to become ready
(if the Statefulset indicates there are replicas) when moving the master
pod off the decomissioned node. Resolves the first part of #279.
Small fixes to error messages.
* Eliminate a race condition during the swithover.
When the operator initiates the failover (switchover) that fails and
then retries it for a second time it may happen that the previous
waitForPodChannel is still active. As a result, the operator subscribes
to the former master pod two times, causing a panic.
The problem was that the original code didn't bother to cancel the
waitForPodLalbel for the new master pod in the case when the failover
fails. This commit fixes it by adding a stop channel to that function.
Code review by @zerg-junior
Avoid showing "there is no service in the cluster" when syncing a
service for the cluster if the operator has been restarted after
the cluster had been created.
Compare pods controller revisions with the one for the statefulset
to determine whether the pod is running the latest revision and,
therefore, no rolling update is necessary. This is performed only
during the operator start, afterwards the rolling update status
that is stored locally in the cluster structure is used for all
rolling update decisions.
* Remove 'team' label from the statefulset selector.
I was never supposed to be there, but implicitely statefulset
creates a selector out of meta.labels field. That is the problem
with recent Kubernetes, since statefulset cannot pick up pods
with non-matching label selectors, and we rely on statefulset
picking up old pods after statefulset replacement.
Make sure selector changes trigger replacement of the statefulset.
In the case new selector has more labels than the old one nothing
should be done with a statefulset, otherwise the new statefulset
won't see orphaned pods from the old one, as they won't match the
selector.
See https://github.com/kubernetes/kubernetes/issues/46901#issuecomment-356418393
Enhance definitions of infrastructure roles by allowing membership in multiple roles, role options and per-role configuration to be specified in the infrastructure role configmap, which must have the same name as the infrastructure role secret. See manifests/infrastructure-roles-configmap.yaml for the examples and updated README for the description of different types of database roles supposed by the operator and their purposes.
Change the logic of merging infrastructure roles with the manifest roles when they have the same name, to return the infrastructure role unchanged instead of merging. Previously, we used to propagate flags from the manifest role to the resulting infrastructure one, as there were no way to define flags for the infrastructure role; however, this is not the case anymore.
Code review and tests by @erthalion
By default, spilo sets WAL_BUCKET_SCOPE_PREFIX depending on the cluster
namespace, possibly to a non-empty string. However, we won't be able to
clone those clusters, as the clone prefix is always set to an empty string.
We could go the other way around and set both WAL_BUCKET_SCOPE_PREFIX
and CLONE_WAL_BUCKET_SCOPE_PREFIX to a non-default value that depends
on the cluster's namespace, but it seems that we don't need this
feature for now (no conflict will occur even for clusters with the
same name and different namespaces because of the SCOPE_SUFFIX) and
it requires some additional testing first.
* Track origin of roles.
* Propagate changes on infrastructure roles to corresponding secrets.
When the password in the infrastructure role is updated, re-generate the
secret for that role.
Previously, the password for an infrastructure role was always fetched from
the secret, making any updates to such role a no-op after the corresponding
secret had been generated.
There used to be a masterLess flag that was supposed to indicate whether the cluster it belongs to runs without the acting master by design. At some point, as we didn't really have support for such clusters, the flag has been misused to indicate there is no master in the cluster. However, that was not done consistently (a cluster without all pods running would never be masterless, even when the master is not among the running pods) and it was based on the wrong assumption that the masterless cluster will remain masterless until the next attempt to change that flag, ignoring the possibility of master coming up or some node doing a successful promotion. Therefore, this PR gets rid of that flag completely.
When the cluster is running with 0 instances, there is obviously no master and it makes no sense to create any database objects inside the non-existing master. Therefore, this PR introduces an additional check for that.
recreatePods were assuming that the roles of the pods recorded when the function has stared will not change; for instance, terminated replica pods should start as replicas. Revisit that assumption by looking at the actual role of the re-spawned pods; that avoids a failover if some replica has promoted to the master role while being re-spawned. In addition, if the failover from the old master was unsuccessful, we used to stop and leave the old master running on an old pod, without recording this fact anywhere. This PR makes the failover failure emit a warning, but not stop recreating the last master pod; in the worst case, the running master will be terminated, however, this case is rather unlikely one.
As a side effect, make waitForPodLabel return the pod definition it waited for, avoiding extra API calls in recreatePods and movePodFromEndOfLifeNode
This allows using S3 API in order to simplify finding all folders that are different only by a suffix, since the suffix delimiter will not occur in the suffix itself (currently being a UID).
Avoid reusing WAL S3 buckets of the older cluster with the same name as the existing one.
For the new cluster, the S3 bucket name will include a suffix that is equal to the UID of the PostgreSQL object describing the cluster. That way, the bucket name will stay the same for all members iff they correspond to the same PostgreSQL cluster object.
When "clone: uid:" key is present in the cluster manifest and the cluster is cloned from an S3 bucket (currently that happens if the endTimestamp is present in the clone description) the S3 bucket to clone from is suffixed with the -uid value.
Previously, it was set to the lifecycle-status:ready, breaking a
lot of minikube deployments. Also it was not possible befor to run
with this label set to an empty value.
Document the effect of the label in the new section of the
documentation.
Avoid migrating replica pods, since they will be handled by the
node draining anyway (the PDB specifies that only masters are to
be kept).
Allow migration of the single-pod clusters.
* Trigger the node migration on the lack of the readiness label.
* Examine the node's readiness status on node add.
Make sure we don't miss the not ready node, especially when the
operator is killed during the migration.
Introduce a new lock called specMu lock to protect the cluster spec.
This lock is held on update and sync, and when retrieving the spec in
the API code. There is no need to acquire it for cluster creation and
deletion: creation assigns the spec to the cluster before linking it to
the controller, and deletion just removes the cluster from the list in
the controller, both holding the global clustersMu Lock.
* Scalyr agent sidecar for log shipping
* Remove the default for the Scalyr image
Now the image needs to be specified explicitly to enable log shipping to
Scalyr. This removes the problem of having to generate the config file
or publish our agent image repository.
* Add configuration variable for Scalyr server URL
Defaults to the EU address.
* Alter style
Newlines are cheap and make code easier to edit/refactor, but ok.
* Fix StatefulSet comparison logic
I broke it when I made the comparison consider all containers in the
PostgreSQL pod.
* Make sure the statefulset that is deleted manually gets re-created.
Per report and analysis by Manuel Gomez.
* Move the existence checks for other objects out of the Create functions.
create{Object} for services, endpoints and PDBs refused to continue if
there is a cached definition in the cluster, however, the only place
where it makes sense is when creating a new cluster. Note that contrary
to the statefulset this doesn't fix any issues, since those definitions
were nullified correspondingly when the sync code detected there is no
object present in the Kubernetes cluster.
* Introduce higher and lower bounds for the number of instances
Reduce the number of instances to the min_instances if it is lower and
to the max_instances if it is higher. -1 for either of those means there
is no lower or upper bound.
In addition, terminate the operator when there is a nonsense in the
configuration (i.e. max_instances < min_instances).
Reviewed by Jan Mußler and Sergey Dudoladov.
They are mentioned in the documentation and the operator will emit a
warning each time the variable from the pod environment configmap is
ignored because the same variable is defined by the operator.
Some minor changes in the variable names to make the code more readable.
Per review from Sergey Dudoladov.
Inject PodEnvironmentConfigMap variables inline into the
statefulset definition in order to be able to figure out
changes to the statefulset when only PodEnvironmentConfigMap
has changed.
- make sure that the secrets for the system users (superuser, replication)
are not deleted when the main cluster is. Therefore, we can re-create
the cluster, potentially forcing Patroni to restore it from the backup
and enable Patroni to connect, since it will use the old password, not
the newly generated random one.
- when syncing users, always check whether they are already in the DB.
Previously, we did this only for the sync cluster case, but the new
cluster could be actually the one restored from the backup by Patroni,
having all or some of the users already in place.
- delete endponts last. Patroni uses the $clustername endpoint in order
to store the leader related metadata. If we remove it before removing
all pods, one of those pods running Patroni will re-create it and the
next attempt to create the cluster with the same name will stuble on
the existing endpoint.
- Use db.Exec instead of db.Query for queries that expect no result.
This also fixes the issue with the DB creation, since we didn't
release an empty Row object it was not possible to create more than
one database for a cluster.
* Avoid overwriting critical users.
Disallow defining new users either in the cluster manifest, teams
API or infrastructure roles with the names mentioned in the new
protected_role_names parameter (list of comma-separated names)
Additionally, forbid defining a user with the name matching either
super_username or replication_username, so that we don't overwrite
system roles required for correct working of the operator itself.
Also, clear PostgreSQL roles on each sync first in order to avoid using
the old definitions that are no longer present in the current manifest,
infrastructure roles secret or the teams API.
When a role is defined in the infrastructure roles and the cluster
manifest use the infrastructure role definition and add flags
defined in the manifest.
Previously the role has been overwritten by the definition from the
manifest. Because a random password is generated for each role from the
manifest the applications relying on the infrastructure role credentials
from the infrastructure roles secret were unable to connect.
Previously, the operator started to move the pods off the nodes to be
decomissioned by watching the eol_node_label value. Every new postgres
pod has been created with the anti-affinity to that label, making sure
that the pods being moved won't land on another to be decomissioned
node.
The changes introduce another label that indicates the ready node. The
new pod affinity will esnure that the pod is only scheduled to the node
marked as ready, discarding the previous anti-affinity. That way the
nodes can transition from the pending-decomission to the other statuses
(drained, terminating) without having pods suddently scaled to them.
In addition, rename the label that triggers the start of the upgrade
process to node_eol_label (for consistency with node_readiness_label)
and set its default vvalue to lifecycle-status:pending-decomission.
- fix the lack of closing the cursor for the query that returned no
rows.
- fix syncing of the user options, as previously those were not
fetched from the database.
Add options to the PgUser structure, potentially allowing to set
per-role options in the cluster definition as well.
Introduce api_roles_configuration operator option with the default
of log_statement=all
* Add node toleration config to PodSpec
This allows to taint nodes dedicated to Postgres and prevents other pods from running on these nodes.
* Document taint and toleration setup
And remove setting from default operator ConfigMap
* Allow to overwrite tolerations with Postgres manifest
- Call comparison function in the case of the sync as well as for update
- Include full cluster name in PDB name
- Assign cluster labels to the PDB object