Do not show 'persistent volumes are not compatible' errors for the
volumes that failed to be resized because of the other reasons (i.e.
the new size is smaller than the existing one).
* During initial Event processing submit the service account for pods and bind it to a cluster role that allows Patroni to successfully start. The cluster role is assumed to be created by the k8s cluster administrator.
* Up until now, the operator read its own configuration from the
configmap. That has a number of limitations, i.e. when the
configuration value is not a scalar, but a map or a list. We use a
custom code based on github.com/kelseyhightower/envconfig to decode
non-scalar values out of plain text keys, but that breaks when the data
inside the keys contains both YAML-special elememtns (i.e. commas) and
complex quotes, one good example for that is search_path inside
`team_api_role_configuration`. In addition, reliance on the configmap
forced a flag structure on the configuration, making it hard to write
and to read (see
https://github.com/zalando-incubator/postgres-operator/pull/308#issuecomment-395131778).
The changes allow to supply the operator configuration in a proper YAML
file. That required registering a custom CRD to support the operator
configuration and provide an example at
manifests/postgresql-operator-default-configuration.yaml. At the moment,
both old configmap and the new CRD configuration is supported, so no
compatibility issues, however, in the future I'd like to deprecate the
configmap-based configuration altogether. Contrary to the
configmap-based configuration, the CRD one doesn't embed defaults into
the operator code, however, one can use the
manifests/postgresql-operator-default-configuration.yaml as a starting
point in order to build a custom configuration.
Since previously `ReadyWaitInterval` and `ReadyWaitTimeout` parameters
used to create the CRD were taken from the operator configuration, which
is not possible if the configuration itself is stored in the CRD object,
I've added the ability to specify them as environment variables
`CRD_READY_WAIT_INTERVAL` and `CRD_READY_WAIT_TIMEOUT` respectively.
Per review by @zerg-junior and @Jan-M.
* Switchover must wait for the inner goroutine before it returns.
Otherwise, two corner cases may happen:
- waitForPodLabel writes to the podLabelErr channel that has been
already closed by the outer routine
- the outer routine exists and the caller subscribes to the pod
the inner goroutine has already subscribed to, resulting in panic.
The previous commit fe47f9ebea
that touched that code added the cancellation channel, but didn't bother
to actually wait for the goroutine to be cancelled.
Per report and review from @valer-cara.
Original issue: https://github.com/zalando-incubator/postgres-operator/issues/342
The old way of specifying it with the annotation is deprecated and not
available in recent Kubernetes versions. We will keep it there anyway
until upgrading to the new go-client that is incompatible with those
versions.
Per report from @schmitch
* Define sidecars in the operator configuration.
Right now only the name and the docker image can be defined, but with
the help of the pod_environment_configmap parameter arbitrary
environment variables can be passed to the sidecars.
* Refactoring around generatePodTemplate.
Original implementation of per-cluster sidecars by @theRealWardo
Per review by @zerg-junior and @Jan-M
Call Patroni API /config in order to set special options that are
ignored when set in the configuration file, such as max_connections.
Per https://github.com/zalando-incubator/postgres-operator/issues/297
* Some minor refacoring:
Rename Cluster ManualFailover to Swithover
Rename Patroni Failover to Switchover
Add more details to error messages and comments introduced in this PR.
Review by @zerg-junior
After an unsuccessful initial cluster sync it may happen that the
cluster statefulset is empty. This has been made more likely since
88d6a7be3, since it has introduced syncing volumes before statefulsets,
and the volume sync mail fail for different reasons (i.e. the volume has
been shrinked, or too many calls to Amazon).
Some special patroni postgresql parameters, like max_connections,
should reside in the bootstrap.dcs.postgresql.parameters section
to come into effect.
When there is an error happening upon deletion of the Kubernetes object
belonging to the cluster being removed, it makes no sense to abort the
deletion: the manifest will be removed anyway, therefore all the objects
after the one we aborted at will stay forever.
Do not use statefulset number of pods to figure out running ones
for volume resizing, since the statefulset pointer could be nil.
Instead, look at the actual running pods.
Avoid sharing pointers to the same spec data between the informer
and the clusters. The only catch is that the error field is cleared
during deepcopy, since it is an interface that may contain private
fields that cannot be copied, however, the error is only used when
the manifest is parsed and before it is queued, therefore, we never
refer to that field in the cluster structure.
987b434 introduced a new function that modifies the cluster spec in
memory before the cluster processes it. Unfortunately, the instance
being modified appeared to be the one stored internally in the
PostgresInformer, resulting in those modifications to be propagated with
futher cluster events and producing update loops in some occasions.
This commit makes sure we copy the spec before putting it into the
clusterEventQueues.
* Depreate old LB options, fix endpoint sync.
- deprecate useLoadBalancer, replicaLoadBalancer from the manifest
and enable_load_balancer from the operator configuration. The old
operator configuration options become no-op with this commit. For
the old manifest options, `useLoadBalancer` and `replicaLoadBalancer`
are still consulted, but only in the absense of the new ones
(enableMasterLoadBalancer and enableReplicaLoadBalancer).
- Make sure the endpoint being created during the sync receives proper
addresses subset. This is more critical for the replicas, as for the
masters Patroni will normally re-create the endpoint before the
operator.
- Avoid creating the replica endpoint, since it will be created automatically
by the corresponding service.
- Update the README and unit tests.
Code review by @mgomezch and @zerg-junior
* Sanity checks for the cluster name, improve tests.
- check that the normal and clone cluster name complies with the valid
service name. For clone cluster, only do it if clone timestamp is not
set; with a clone timestamp set, the clone name points to the S3 bucket
- add tests and improve existing ones, making sure we don't call Error()
method for an empty error, as well as that we don't miss cases where
expected error is not empty, but actual call to be tested does not
return an error.
Code review by @zerg-junior and @Jan-M
* Improve the pod moving behavior during the Kubernetes cluster upgrade.
Fix an issue of not waiting for at least one replica to become ready
(if the Statefulset indicates there are replicas) when moving the master
pod off the decomissioned node. Resolves the first part of #279.
Small fixes to error messages.
* Eliminate a race condition during the swithover.
When the operator initiates the failover (switchover) that fails and
then retries it for a second time it may happen that the previous
waitForPodChannel is still active. As a result, the operator subscribes
to the former master pod two times, causing a panic.
The problem was that the original code didn't bother to cancel the
waitForPodLalbel for the new master pod in the case when the failover
fails. This commit fixes it by adding a stop channel to that function.
Code review by @zerg-junior
Avoid showing "there is no service in the cluster" when syncing a
service for the cluster if the operator has been restarted after
the cluster had been created.
Compare pods controller revisions with the one for the statefulset
to determine whether the pod is running the latest revision and,
therefore, no rolling update is necessary. This is performed only
during the operator start, afterwards the rolling update status
that is stored locally in the cluster structure is used for all
rolling update decisions.
* Remove 'team' label from the statefulset selector.
I was never supposed to be there, but implicitely statefulset
creates a selector out of meta.labels field. That is the problem
with recent Kubernetes, since statefulset cannot pick up pods
with non-matching label selectors, and we rely on statefulset
picking up old pods after statefulset replacement.
Make sure selector changes trigger replacement of the statefulset.
In the case new selector has more labels than the old one nothing
should be done with a statefulset, otherwise the new statefulset
won't see orphaned pods from the old one, as they won't match the
selector.
See https://github.com/kubernetes/kubernetes/issues/46901#issuecomment-356418393
Enhance definitions of infrastructure roles by allowing membership in multiple roles, role options and per-role configuration to be specified in the infrastructure role configmap, which must have the same name as the infrastructure role secret. See manifests/infrastructure-roles-configmap.yaml for the examples and updated README for the description of different types of database roles supposed by the operator and their purposes.
Change the logic of merging infrastructure roles with the manifest roles when they have the same name, to return the infrastructure role unchanged instead of merging. Previously, we used to propagate flags from the manifest role to the resulting infrastructure one, as there were no way to define flags for the infrastructure role; however, this is not the case anymore.
Code review and tests by @erthalion
By default, spilo sets WAL_BUCKET_SCOPE_PREFIX depending on the cluster
namespace, possibly to a non-empty string. However, we won't be able to
clone those clusters, as the clone prefix is always set to an empty string.
We could go the other way around and set both WAL_BUCKET_SCOPE_PREFIX
and CLONE_WAL_BUCKET_SCOPE_PREFIX to a non-default value that depends
on the cluster's namespace, but it seems that we don't need this
feature for now (no conflict will occur even for clusters with the
same name and different namespaces because of the SCOPE_SUFFIX) and
it requires some additional testing first.
* Track origin of roles.
* Propagate changes on infrastructure roles to corresponding secrets.
When the password in the infrastructure role is updated, re-generate the
secret for that role.
Previously, the password for an infrastructure role was always fetched from
the secret, making any updates to such role a no-op after the corresponding
secret had been generated.
There used to be a masterLess flag that was supposed to indicate whether the cluster it belongs to runs without the acting master by design. At some point, as we didn't really have support for such clusters, the flag has been misused to indicate there is no master in the cluster. However, that was not done consistently (a cluster without all pods running would never be masterless, even when the master is not among the running pods) and it was based on the wrong assumption that the masterless cluster will remain masterless until the next attempt to change that flag, ignoring the possibility of master coming up or some node doing a successful promotion. Therefore, this PR gets rid of that flag completely.
When the cluster is running with 0 instances, there is obviously no master and it makes no sense to create any database objects inside the non-existing master. Therefore, this PR introduces an additional check for that.
recreatePods were assuming that the roles of the pods recorded when the function has stared will not change; for instance, terminated replica pods should start as replicas. Revisit that assumption by looking at the actual role of the re-spawned pods; that avoids a failover if some replica has promoted to the master role while being re-spawned. In addition, if the failover from the old master was unsuccessful, we used to stop and leave the old master running on an old pod, without recording this fact anywhere. This PR makes the failover failure emit a warning, but not stop recreating the last master pod; in the worst case, the running master will be terminated, however, this case is rather unlikely one.
As a side effect, make waitForPodLabel return the pod definition it waited for, avoiding extra API calls in recreatePods and movePodFromEndOfLifeNode
This allows using S3 API in order to simplify finding all folders that are different only by a suffix, since the suffix delimiter will not occur in the suffix itself (currently being a UID).