From 236fbbf2c65c7b871f5c48dfdcb72cdaa7a746e5 Mon Sep 17 00:00:00 2001
From: Felix Kunde <felix-kunde@gmx.de>
Date: Mon, 24 Feb 2020 12:52:05 +0100
Subject: [PATCH] be more permissive with standbys

---
 docs/administrator.md                 | 10 +--
 docs/reference/operator_parameters.md |  5 +-
 docs/user.md                          | 87 +++++++++++++++++++++------
 pkg/cluster/k8sres.go                 | 10 +--
 4 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/docs/administrator.md b/docs/administrator.md
index 3597b65ca..9d877c783 100644
--- a/docs/administrator.md
+++ b/docs/administrator.md
@@ -11,11 +11,11 @@ switchover (planned failover) of the master to the Pod with new minor version.
 The switch should usually take less than 5 seconds, still clients have to
 reconnect.
 
-Major version upgrades are supported via [cloning](user.md#clone-directly). The
-new cluster manifest must have a higher `version` string than the source cluster
-and will be created from a basebackup. Depending of the cluster size, downtime
-in this case can be significant as writes to the database should be stopped and
-all WAL files should be archived first before cloning is started.
+Major version upgrades are supported via [cloning](user.md#how-to-clone-an-existing-postgresql-cluster).
+The new cluster manifest must have a higher `version` string than the source
+cluster and will be created from a basebackup. Depending of the cluster size,
+downtime in this case can be significant as writes to the database should be
+stopped and all WAL files should be archived first before cloning is started.
 
 Note, that simply changing the version string in the `postgresql` manifest does
 not work at present and leads to errors. Neither Patroni nor Postgres Operator
diff --git a/docs/reference/operator_parameters.md b/docs/reference/operator_parameters.md
index ca972c22b..20e78fc7b 100644
--- a/docs/reference/operator_parameters.md
+++ b/docs/reference/operator_parameters.md
@@ -110,8 +110,9 @@ Those are top-level keys, containing both leaf keys and groups.
 
 * **min_instances**
   operator will run at least the number of instances for any given Postgres
-  cluster equal to the value of this parameter. When `-1` is specified, no
-  limits are applied. The default is `-1`.
+  cluster equal to the value of this parameter, except for standby clusters
+  which run with one pod if `numberOfInstances` is set to 1. When `-1` is
+  specified for `min_instances`, no limits are applied. The default is `-1`.
 
 * **resync_period**
   period between consecutive sync requests. The default is `30m`.
diff --git a/docs/user.md b/docs/user.md
index e1baf9ad1..dd9b92e60 100644
--- a/docs/user.md
+++ b/docs/user.md
@@ -317,14 +317,21 @@ spec:
     s3_force_path_style: true
 ```
 
+Note, that cloning can also be used for [major version upgrades](administrator.md#minor-and-major-version-upgrade)
+of PostgreSQL.
+
 ## Setting up a standby cluster
 
-Standby clusters are like normal cluster but they are streaming from a remote
-cluster. As the first version of this feature, the only scenario covered by
-operator is to stream from a WAL archive of the master. Following the more
-popular infrastructure of using Amazon's S3 buckets, it is mentioned as
-`s3_wal_path` here. To start a cluster as standby add the following `standby`
-section in the YAML file:
+Standby cluster is a [Patroni feature](https://github.com/zalando/patroni/blob/master/docs/replica_bootstrap.rst#standby-cluster)
+that first clones a database, and keeps replicating changes afterwards. As the
+replication is happening by the means of archived WAL files (stored on S3 or
+the equivalent of other cloud providers), the standby cluster can exist in a
+different location than its source database. Unlike cloning, the PostgreSQL
+version between source and target cluster has to be the same.
+
+To start a cluster as standby, add the following `standby` section in the YAML
+file and specify the S3 bucket path. An empty path will result in an error and
+no statefulset will be created.
 
 ```yaml
 spec:
@@ -332,20 +339,62 @@ spec:
     s3_wal_path: "s3 bucket path to the master"
 ```
 
-Things to note:
+At the moment, the operator only allows to stream from the WAL archive of the
+master. Thus, it is recommended to deploy standby clusters with only [one pod](../manifests/standby-manifest.yaml#L10).
+You can raise the instance count when detaching. Note, that the same pod role
+labels like for normal clusters are used: The standby leader is labeled as
+`master`.
 
-- An empty string in the `s3_wal_path` field of the standby cluster will result
-  in an error and no statefulset will be created.
-- Only one pod can be deployed for stand-by cluster.
-- To manually promote the standby_cluster, use `patronictl` and remove config
-  entry.
-- There is no way to transform a non-standby cluster to a standby cluster
-  through the operator. Adding the standby section to the manifest of a running
-  Postgres cluster will have no effect. However, it can be done through Patroni
-  by adding the [standby_cluster](https://github.com/zalando/patroni/blob/bd2c54581abb42a7d3a3da551edf0b8732eefd27/docs/replica_bootstrap.rst#standby-cluster)
-  section using `patronictl edit-config`. Note that the transformed standby
-  cluster will not be doing any streaming. It will be in standby mode and allow
-  read-only transactions only.
+### Providing credentials of source cluster
+
+A standby cluster is replicating the data (including users and passwords) from
+the source database and is read-only. The system and application users (like
+standby, postgres etc.) all have a password that does not match the credentials
+stored in secrets which are created by the operator. One solution is to create
+secrets beforehand and paste in the credentials of the source cluster.
+Otherwise, you will see errors in the Postgres logs saying users cannot log in
+and the operator logs will complain about not being able to sync resources.
+This, however, can safely be ignored as it will be sorted out once the cluster
+is detached from the source (and it’s still harmless if you don’t plan to).
+
+You can also edit the secrets afterwards. Find them by:
+
+```bash
+kubectl get secrets --all-namespaces | grep <postgres-cluster-name>
+```
+
+### Promote the standby
+
+One big advantage of standby clusters is that they can be promoted to a proper
+database cluster. This means it will stop replicating changes from the source,
+and start accept writes itself. This mechanism makes it possible to move
+databases from one place to another with minimal downtime. Currently, the
+operator does not support promoting a standby cluster. It has to be done
+manually using `patronictl edit-config` inside the postgres container of the
+standby leader pod. Remove the following lines from the YAML structure and the
+leader promotion happens immediately. Before doing so, make sure that the
+standby is not behind the source database.
+
+```yaml
+standby_cluster:
+  create_replica_methods:
+    - bootstrap_standby_with_wale
+    - basebackup_fast_xlog
+  restore_command: envdir "/home/postgres/etc/wal-e.d/env-standby" /scripts/restore_command.sh
+     "%f" "%p"
+```
+
+Finally, remove the `standby` section from the postgres cluster manifest.
+
+### Turn a normal cluster into a standby
+
+There is no way to transform a non-standby cluster to a standby cluster through
+the operator. Adding the `standby` section to the manifest of a running
+Postgres cluster will have no effect. But, as explained in the previous
+paragraph it can be done manually through `patronictl edit-config`. This time,
+by adding the `standby_cluster` section to the Patroni configuration. However,
+the transformed standby cluster will not be doing any streaming. It will be in
+standby mode and allow read-only transactions only.
 
 ## Sidecar Support
 
diff --git a/pkg/cluster/k8sres.go b/pkg/cluster/k8sres.go
index 4468c8428..36b4aaa5d 100644
--- a/pkg/cluster/k8sres.go
+++ b/pkg/cluster/k8sres.go
@@ -1048,11 +1048,13 @@ func (c *Cluster) getNumberOfInstances(spec *acidv1.PostgresSpec) int32 {
 	cur := spec.NumberOfInstances
 	newcur := cur
 
-	/* Limit the max number of pods to one, if this is standby-cluster */
 	if spec.StandbyCluster != nil {
-		c.logger.Info("Standby cluster can have maximum of 1 pod")
-		min = 1
-		max = 1
+		if newcur > 1 {
+			c.logger.Warningf("operator only supports standby clusters with 1 pod")
+		} else {
+			min = newcur
+			max = newcur
+		}
 	}
 	if max >= 0 && newcur > max {
 		newcur = max