Commit Graph

157 Commits

Author SHA1 Message Date
Oleksii Kliukin 87bc47d8d0 Fixes for the case of re-creating the cluster after deletion.
- make sure that the secrets for the system users (superuser, replication)
  are not deleted when the main cluster is. Therefore, we can re-create
  the cluster, potentially forcing Patroni to restore it from the backup
  and enable Patroni to connect, since it will use the old password, not
  the newly generated random one.

- when syncing users, always check whether they are already in the DB.
  Previously, we did this only for the sync cluster case, but the new
  cluster could be actually the one restored from the backup by Patroni,
  having all or some of the users already in place.

 - delete endponts last. Patroni uses the $clustername endpoint in order
   to store the leader related metadata. If we remove it before removing
   all pods, one of those pods running Patroni will re-create it and the
   next attempt to create the cluster with the same name will stuble on
   the existing endpoint.

 - Use db.Exec instead of db.Query for queries that expect no result.
   This also fixes the issue with the DB creation, since we didn't
   release an empty Row object it was not possible to create more than
   one database for a cluster.
2017-12-13 16:49:00 +01:00
Oleksii Kliukin 1fb8cf7ea0
Avoid overwriting critical users. (#172)
* Avoid overwriting critical users.

Disallow defining new users either in the cluster manifest, teams
API or infrastructure roles with the names mentioned in the new
protected_role_names parameter (list of comma-separated names)

Additionally, forbid defining a user with the name matching either
super_username or replication_username, so that we don't overwrite
system roles required for correct working of the operator itself.

Also, clear PostgreSQL roles on each sync first in order to avoid using
the old definitions that are no longer present in the current manifest,
infrastructure roles secret or the teams API.
2017-12-05 14:27:12 +01:00
Oleksii Kliukin 022ce29314 Make an error message more verbose. 2017-12-04 10:49:25 +01:00
Oleksii Kliukin 637921cdee Tests for initHumanUsers and initinitRobotUsers.
Change the Cluster class in the process to implelement Teams API
calls and Oauth token fetches as interfaces, so that we can mock
them in the tests.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin 611cfe96d6 Fix an issue when not assigning the merge result.
Add some tests.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin 831ebb1f32 Fix the error reporting. 2017-12-04 10:49:25 +01:00
Oleksii Kliukin 2e226dee26 Avoid overwriting infrastrure roles.
When a role is defined in the infrastructure roles and the cluster
manifest use the infrastructure role definition and add flags
defined in the manifest.

Previously the role has been overwritten by the definition from the
manifest.  Because a random password is generated for each role from the
manifest the applications relying on the infrastructure role credentials
from the infrastructure roles secret were unable to connect.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin dd0affc390 Tweak our reaction to the cluster upgrade process.
Previously, the operator started to move the pods off the nodes to be
decomissioned by watching the eol_node_label value. Every new postgres
pod has been created with the anti-affinity to that label, making sure
that the pods being moved won't land on another to be decomissioned
node.

The changes introduce another label that indicates the ready node.  The
new pod affinity will esnure that the pod is only scheduled to the node
marked as ready, discarding the previous anti-affinity.  That way the
nodes can transition from the pending-decomission to the other statuses
(drained, terminating) without having pods suddently scaled to them.

In addition, rename the label that triggers the start of the upgrade
process to node_eol_label (for consistency with node_readiness_label)
and set its default vvalue to lifecycle-status:pending-decomission.
2017-11-30 14:11:49 +01:00
Oleksii Kliukin 1ffe98ba9f Fix the connection leak and user options sync.
- fix the lack of closing the cursor for the query that returned no
rows.
- fix syncing of the user options, as previously those were not
  fetched from the database.
2017-11-27 16:46:34 +01:00
Oleksii Kliukin 975b21f633 Rename api roles configuration parameter.
Change api_roles_configuration to team_api_role_configuration
2017-11-22 10:43:35 +01:00
Oleksii Kliukin 2352fc9a39 go fmt run 2017-11-22 10:43:35 +01:00
Oleksii Kliukin 415a7fdc4d Allow global configuration options for API roles.
Add options to the PgUser structure, potentially allowing to set
per-role options in the cluster definition as well.

Introduce api_roles_configuration operator option with the default
of log_statement=all
2017-11-22 10:43:35 +01:00
Oleksii Kliukin 6dcd074ea0 Allow per-cluster setting of a docker image.
Add dockerImage cluster configuration parameter that overrides global
operator defaults when set to a non-empty value.
2017-11-14 11:53:04 +01:00
Oleksii Kliukin c25e849fe4 Fix a failure to create new statefulset at sync.
Also do a fmt run.
2017-11-08 18:24:17 +01:00
Murat Kabilov 86803406db
use sync methods while updating the cluster 2017-11-03 12:00:43 +01:00
Georg Kunz 47dd766fa7 Add node toleration config to PodSpec (#151)
* Add node toleration config to PodSpec

This allows to taint nodes dedicated to Postgres and prevents other pods from running on these nodes.

* Document taint and toleration setup

And remove setting from default operator ConfigMap

* Allow to overwrite tolerations with Postgres manifest
2017-11-02 19:10:44 +01:00
Oleksii Kliukin ce960e892a
Create new databases and change owners of existing ones during sync. (#153)
* Create new databases and change owners of existing ones during sync.
2017-11-02 17:46:33 +01:00
Oleksii Kliukin 7a76be7d3e Minor fixes around PDB (pod-distruption-budget) syncing: (#147)
- Call comparison function in the case of the sync as well as for update
- Include full cluster name in PDB name
- Assign cluster labels to the PDB object
2017-10-23 12:26:59 +02:00
Murat Kabilov c17aabb642 fix pod disruption budget labels (#146) 2017-10-20 15:01:51 +02:00
Murat Kabilov 661b141849 Fix Pod Disruption Budget null pointer exception 2017-10-20 11:43:50 +02:00
Murat Kabilov a1deae198b add missing master matchLabel for the PDB (#144) 2017-10-20 11:26:40 +02:00
Oleksii Kliukin eba23279c8 Kube cluster upgrade 2017-10-19 10:49:42 +02:00
Oleksii Kliukin 1dbf259c76 Retry opening DB connections. (#140)
Make sure DB connection retry also reopens a connection after closing it
2017-10-18 16:28:00 +02:00
Oleksii Kliukin 99870d8eac Fix division by zero when connecting to the DB.
Apparently the retry function's first parameter is the duration of
a single attempt and it cannot be zero.
2017-10-18 10:44:49 +02:00
Murat Kabilov 202f2de988 Retry connecting to pg 2017-10-17 17:03:50 +02:00
Murat Kabilov 6c4cb4e9da Perform manual failover during the scale down 2017-10-16 17:41:23 +02:00
Murat Kabilov 5b29576a8e Remove redundant constants 2017-10-16 15:52:48 +02:00
Murat Kabilov 3b32265258 Set status of the cluster on sync fail/success 2017-10-12 15:10:42 +02:00
Jan Mussler cec695d48e Superuser toggle for team members
Make superuser toggleable for team members. Add and "admin" role to team members if superuser is disabled.
2017-10-12 15:01:54 +02:00
Murat Kabilov 8d5faaa5a5 return idle status when worker has nothing to do 2017-10-11 15:42:20 +02:00
Oleksii Kliukin 793defef72 Fix pod wait timeouts.
Previously, a timer had been reset on every message received through
the pod channel.
2017-10-11 14:58:37 +02:00
Murat Kabilov 83c8d6c419 Extend diagnostic api with worker status info 2017-10-11 12:26:09 +02:00
Murat Kabilov 71a540ff48 Merge branch 'master' into crd 2017-10-09 11:55:18 +02:00
Murat Kabilov a35e9c6119 move from tpr to crd 2017-10-06 15:12:08 +02:00
Murat Kabilov 3b8c06416e skip manual failover for 1-pod clusters 2017-10-05 13:30:15 +03:00
Jan Mussler c4af0ac6a6 Update cluster.go 2017-10-05 10:58:23 +02:00
Jan M 4a1170855a Adding '_' to allowed chars. 2017-10-05 10:53:19 +02:00
Murat Kabilov 48ec6b35b9 perform manual failover on pg cluster rolling upgrade 2017-10-04 16:56:47 +03:00
Murat Kabilov 00194d0130 create dbs on cluster create 2017-10-04 16:24:27 +03:00
Murat Kabilov 5cfdabb63e fix regexp for api endpoint urls 2017-09-28 12:00:40 +02:00
Murat Kabilov be8bf22c00 add missing return 2017-09-28 11:23:56 +02:00
Murat Kabilov 93d4bf2b55 Merge branch 'master' into api-improvements 2017-09-26 14:47:13 +02:00
Murat Kabilov 19de2a24b7 go lint 2017-09-26 13:44:30 +02:00
Murat Kabilov d876f4d88e set secret name template via config map 2017-09-18 14:25:09 +02:00
Oleksii Kliukin 7667847bfe Feature/validate role options (#101)
Be more rigorous about validating user flags.

Only accept CREATE ROLE flags that doesn't have any params (i.e.
not ADMIN or CONNECTION LIMIT). Check that both flag and NOflag
are not used at the same time.
2017-09-15 13:57:48 +02:00
Murat Kabilov 969a06f521 Use DCS_ENABLE_KUBERNETES_API=true environment to enable kubernetes native deployment 2017-09-14 11:39:49 +02:00
Murat Kabilov 8430ee86c9 add comments on roles 2017-09-11 17:44:32 +02:00
Murat Kabilov 90b49a24ba make postgresql roles public 2017-09-11 17:44:32 +02:00
Oleksii Kliukin 8b85935a7a Allow cloning clusters from the operator. (#90)
Allow cloning clusters from the operator.

The changes add a new JSON node `clone` with possible values `cluster`
and `timestamp`. `cluster` is mandatory, and setting a non-empty
`timestamp` triggers wal-e point in time recovery. Spilo and Patroni do
the whole heavy-lifting, the operator just defines certain variables and
gathers some data about how to connect to the host to clone or the
target S3 bucket.

As a minor change, set the image pull policy to IfNotPresent instead
of Always to simplify local testing.

Change the default replication username to standby.
2017-09-08 16:47:03 +02:00
Oleksii Kliukin a0a9e8f849 Feature/configure replication role (#97)
Configure superuser and replication usernames
2017-09-07 10:12:34 +02:00