Commit Graph

143 Commits

Author SHA1 Message Date
Oleksii Kliukin 55dc12e512 Examine custom environment sources when syncing.
When comparing statefulsets, make sure EnvFrom fields are compared
as well.
2017-12-14 14:39:33 +01:00
Oleksii Kliukin 87bc47d8d0 Fixes for the case of re-creating the cluster after deletion.
- make sure that the secrets for the system users (superuser, replication)
  are not deleted when the main cluster is. Therefore, we can re-create
  the cluster, potentially forcing Patroni to restore it from the backup
  and enable Patroni to connect, since it will use the old password, not
  the newly generated random one.

- when syncing users, always check whether they are already in the DB.
  Previously, we did this only for the sync cluster case, but the new
  cluster could be actually the one restored from the backup by Patroni,
  having all or some of the users already in place.

 - delete endponts last. Patroni uses the $clustername endpoint in order
   to store the leader related metadata. If we remove it before removing
   all pods, one of those pods running Patroni will re-create it and the
   next attempt to create the cluster with the same name will stuble on
   the existing endpoint.

 - Use db.Exec instead of db.Query for queries that expect no result.
   This also fixes the issue with the DB creation, since we didn't
   release an empty Row object it was not possible to create more than
   one database for a cluster.
2017-12-13 16:49:00 +01:00
Oleksii Kliukin 1fb8cf7ea0
Avoid overwriting critical users. (#172)
* Avoid overwriting critical users.

Disallow defining new users either in the cluster manifest, teams
API or infrastructure roles with the names mentioned in the new
protected_role_names parameter (list of comma-separated names)

Additionally, forbid defining a user with the name matching either
super_username or replication_username, so that we don't overwrite
system roles required for correct working of the operator itself.

Also, clear PostgreSQL roles on each sync first in order to avoid using
the old definitions that are no longer present in the current manifest,
infrastructure roles secret or the teams API.
2017-12-05 14:27:12 +01:00
Oleksii Kliukin 022ce29314 Make an error message more verbose. 2017-12-04 10:49:25 +01:00
Oleksii Kliukin 637921cdee Tests for initHumanUsers and initinitRobotUsers.
Change the Cluster class in the process to implelement Teams API
calls and Oauth token fetches as interfaces, so that we can mock
them in the tests.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin 611cfe96d6 Fix an issue when not assigning the merge result.
Add some tests.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin 831ebb1f32 Fix the error reporting. 2017-12-04 10:49:25 +01:00
Oleksii Kliukin 2e226dee26 Avoid overwriting infrastrure roles.
When a role is defined in the infrastructure roles and the cluster
manifest use the infrastructure role definition and add flags
defined in the manifest.

Previously the role has been overwritten by the definition from the
manifest.  Because a random password is generated for each role from the
manifest the applications relying on the infrastructure role credentials
from the infrastructure roles secret were unable to connect.
2017-12-04 10:49:25 +01:00
Oleksii Kliukin 975b21f633 Rename api roles configuration parameter.
Change api_roles_configuration to team_api_role_configuration
2017-11-22 10:43:35 +01:00
Oleksii Kliukin 2352fc9a39 go fmt run 2017-11-22 10:43:35 +01:00
Oleksii Kliukin 415a7fdc4d Allow global configuration options for API roles.
Add options to the PgUser structure, potentially allowing to set
per-role options in the cluster definition as well.

Introduce api_roles_configuration operator option with the default
of log_statement=all
2017-11-22 10:43:35 +01:00
Oleksii Kliukin c25e849fe4 Fix a failure to create new statefulset at sync.
Also do a fmt run.
2017-11-08 18:24:17 +01:00
Murat Kabilov 86803406db
use sync methods while updating the cluster 2017-11-03 12:00:43 +01:00
Oleksii Kliukin ce960e892a
Create new databases and change owners of existing ones during sync. (#153)
* Create new databases and change owners of existing ones during sync.
2017-11-02 17:46:33 +01:00
Oleksii Kliukin 7a76be7d3e Minor fixes around PDB (pod-distruption-budget) syncing: (#147)
- Call comparison function in the case of the sync as well as for update
- Include full cluster name in PDB name
- Assign cluster labels to the PDB object
2017-10-23 12:26:59 +02:00
Murat Kabilov 661b141849 Fix Pod Disruption Budget null pointer exception 2017-10-20 11:43:50 +02:00
Oleksii Kliukin eba23279c8 Kube cluster upgrade 2017-10-19 10:49:42 +02:00
Murat Kabilov 6c4cb4e9da Perform manual failover during the scale down 2017-10-16 17:41:23 +02:00
Jan Mussler cec695d48e Superuser toggle for team members
Make superuser toggleable for team members. Add and "admin" role to team members if superuser is disabled.
2017-10-12 15:01:54 +02:00
Murat Kabilov 8d5faaa5a5 return idle status when worker has nothing to do 2017-10-11 15:42:20 +02:00
Murat Kabilov 83c8d6c419 Extend diagnostic api with worker status info 2017-10-11 12:26:09 +02:00
Murat Kabilov a35e9c6119 move from tpr to crd 2017-10-06 15:12:08 +02:00
Jan Mussler c4af0ac6a6 Update cluster.go 2017-10-05 10:58:23 +02:00
Jan M 4a1170855a Adding '_' to allowed chars. 2017-10-05 10:53:19 +02:00
Murat Kabilov 48ec6b35b9 perform manual failover on pg cluster rolling upgrade 2017-10-04 16:56:47 +03:00
Murat Kabilov 00194d0130 create dbs on cluster create 2017-10-04 16:24:27 +03:00
Murat Kabilov 90b49a24ba make postgresql roles public 2017-09-11 17:44:32 +02:00
Murat Kabilov 8aa11ecee2 Add patroni api client 2017-08-30 16:01:18 +02:00
Murat Kabilov 899c0bef45 Use warningf instead of warnf 2017-08-30 14:35:56 +02:00
Murat Kabilov 5967837875 pass the name of the status in the log message on set cluster status failure 2017-08-17 12:18:53 +02:00
Murat Kabilov 272d7e1bcf rename service field to services as it contains service per role 2017-08-15 15:55:56 +02:00
Murat Kabilov 82f58b57d8 add cluster and controller methods for getting status 2017-08-15 12:11:06 +02:00
Murat Kabilov 5470f20be4 always pass a cluster name as a logger field 2017-08-15 10:29:18 +02:00
Murat Kabilov e26db66cb5 start all the log messages with lowercase letters 2017-08-15 10:12:36 +02:00
Oleksii Kliukin f15f93f479 Bugfix/close db connections (#78)
Open and close DB connections on-demand.

Previously, we used to leave the DB connection open while the
cluster was registered with the operator, potentially resutling
in dangled connections if the operator terminates abnormally.

Small refactoring around the role syncing code.
2017-08-10 10:10:00 +02:00
Murat Kabilov cf663cb841 Fix golint warnings 2017-08-01 16:08:56 +02:00
Murat Kabilov 1f8b37f33d Make use of kubernetes client-go v4
* client-go v4.0.0-beta0
* remove unnecessary methods for tpr object
* rest client: use interface instead of structure pointer
* proper names for constants; some clean up for log messages
* remove teams api client from controller and make it per cluster
2017-07-25 15:25:17 +02:00
Oleksii Kliukin 4455f1b639 Feature/unit tests (#53)
- Avoid relying on Clientset structure to call Kubernetes API functions.
While Clientset is a convinient "catch-all" abstraction for calling
REST API related to different Kubernetes objects, it's impossible to
mock. Replacing it wih the kubernetes.Interface would be quite
straightforward, but would require an exra level of mocked interfaces,
because of the versioning. Instead, a new interface is defined, which
contains only the objects we need of the pre-defined versions.

-  Move KubernetesClient to k8sutil package.
- Add more tests.
2017-07-24 16:56:46 +02:00
Oleksii Kliukin 00150711e4 Configure load balancer on a per-cluster and operator-wide level (#57)
* Deny all requests to the load balancer by default.
* Operator-wide toggle for the load-balancer.
* Define per-cluster useLoadBalancer option.

If useLoadBalancer is not set - then operator-wide defaults take place. If it
is true - the load balancer is created, otherwise a service type clusterIP is
created.

Internally, we have to completely replace the service if the service type
changes. We cannot patch, since some fields from the old service that will
remain after patch are incompatible with the new one, and handling them
explicitly when updating the service is ugly and error-prone. We cannot
update the service because of the immutable fields, that leaves us the only
option of deleting the old service and creating the new one. Unfortunately,
there is still an issue of unnecessary removal of endpoints associated with
the service, it will be addressed in future commits.

* Revert the unintended effect of go fmt

* Recreate endpoints on service update.

When the service type is changed, the service is deleted and then
the one with the new type is created. Unfortnately, endpoints are
deleted as well. Re-create them afterwards, preserving the original
addresses stored in them.

* Improve error messages and comments. Use generate instead of gen in names.
2017-06-30 13:38:49 +02:00
Murat Kabilov 1540a2ba65 fix typos;
remove unnecessary tests;
go fmt -s
2017-06-08 15:52:01 +02:00
Oleksii Kliukin bc0e9ab4bc Add error checks per report from errcheck-ng 2017-06-08 10:41:44 +02:00
Murat Kabilov 292a9bda05 Check for dns annotation of the service 2017-06-07 16:41:39 +02:00
Oleksii Kliukin dc36c4ca12 Implement replicaLoadBalancer boolean flag. (#38)
The flag adds a replica service with the name cluster_name-repl and
a DNS name that defaults to {cluster}-repl.{team}.{hostedzone}.

The implementation converted Service field of the cluster into a map
with one or two elements and deals with the cases when the new flag
is changed on a running cluster
(the update and the sync should create or delete the replica service).
In order to pick up master and replica service and master endpoint
when listing cluster resources.

* Update the spec when updating the cluster.
2017-06-07 13:54:17 +02:00
Oleksii Kliukin 7b0ca31bfb Implements EBS volume resizing #35.
In order to support volumes different from EBS and filesystems other than EXT2/3/4 the respective code parts were implemented as interfaces. Adding the new resize for the volume or the filesystem will require implementing the interface, but no other changes in the cluster code itself.

Volume resizing first changes the EBS and the filesystem, and only afterwards is reflected in the Kubernetes "PersistentVolume" object. This is done deliberately to be able to check if the volume needs resizing by peeking at the Size of the PersistentVolume structure. We recheck, nevertheless, in the EBSVolumeResizer, whether the actual EBS volume size doesn't match the spec, since call to the AWS ModifyVolume is counted against the resize limit of once every 6 hours, even for those calls that shouldn't result in an actual resize (i.e. when the size matches the one for the running volume).

As a collateral, split the constants into multiple files, move the volume code into a separate file and fix minor issues related to the error reporting.
2017-06-06 13:53:27 +02:00
Murat Kabilov 009db16c7c Use queues for the pod events (#30) 2017-05-23 15:24:14 +02:00
Oleksii Kliukin afce38f6f0 Fix error messages (#27)
Use lowercase for kubernetes objects
Use %v instead of %s for errors
Start error messages with a lowercase letter.
2017-05-22 14:12:06 +02:00
Murat Kabilov 4acaf27a5d Remove etcd requests (#25)
update glide
2017-05-19 17:18:37 +02:00
Murat Kabilov d34273543e Fix the golint, gosimple warnings 2017-05-18 17:38:54 +02:00
Murat Kabilov 233e8529c1 Return error instead of logging it 2017-05-18 17:24:44 +02:00
Murat Kabilov 3b6454c2dc add missed return (#20) 2017-05-17 11:54:50 +02:00
Oleksii Kliukin c2826b10e2 Merge branch 'master' into fix/go-vet-fixes 2017-05-17 11:30:07 +02:00
Oleksii Kliukin 4457ce4e47 Replace the statefulset if it cannot be updated. (#18)
Updates to statefulset spec for fields other than 'replicas' and
containers' are forbidden. However, it is possible to delete the old
statefulset without deleting its pods and create the new one, using the
changed specs. The new statefulset shall pick up the orphaned pods.

Change the statefulset's comparison to return the combined effect of
all checks, not just the first non-matching field.
2017-05-17 11:28:21 +02:00
Murat Kabilov 6e5d7abcc5 pass cluster by reference 2017-05-17 11:05:15 +02:00
Murat Kabilov 356be8f0f1 skip clusters with invalid spec 2017-05-16 16:46:37 +02:00
Oleksii Kliukin 03064637f1 Allow disabling access to the DB and the Teams API.
Command-line options --nodatabaseaccess and --noteamsapi disable all
teams api interaction and access to the Postgres database. This is
useful for debugging purposes when the operator runs out of cluster
(with --outofcluster flag).

The same effect can be achieved by setting enable_db_access and/or
enable_teams_api to false.
2017-05-12 17:40:48 +02:00
Murat Kabilov 92d7fbf372 replace github.bus.zalan.do with github.cm/zalando-incubator 2017-05-12 11:50:16 +02:00
Oleksii Kliukin 6983f444ed Periodically sync roles with the running clusters. (#102)
The sync adds or alters database roles based on the roles defined
in the cluster's TPR, Team API and operator's infrastructure roles.
At the moment, roles are not deleted, as it would be dangerous for
the robot roles in case TPR is misconfigured. In addition, ALTER
ROLE does not remove role options, i.e. SUPERUSER or CREATEROLE,
neither it removes role membership: only new options are added and
new role membership is granted. So far, options like NOSUPERUSER
and NOCREATEROLE won't be handed correctly, when mixed with the
non-negative counterparts, also NOLOGIN should be processed correctly.
The code assumes that only MD5 passwords are stored in the DB and
will likely break with the new SCRAM auth in PostgreSQL 10.

On the implementation side, create the new interface to abstract
roles merge and creation, move most of the role-based functionality
from cluster/pg into the new 'users' module, strip create user code
of special cases related to human-based users (moving them to init
instead) and fixed the password md5 generator to avoid processing
already encrypted passwords. In addition, moved the system roles
off the slice containing all other roles in order to avoid extra
efforts to avoid creating them.

Also, fix a leak in DB connections when the new connection is not
considered healthy and discarded without being closed. Initialize
the database during the sync phase before syncing users.
2017-05-12 11:41:35 +02:00
Murat Kabilov 2370659c69 Parallel cluster processing
Run operations concerning multiple clusters in parallel. Each cluster gets its
own worker in order to create, update, sync or delete clusters.  Each worker
acquires the lock on a cluster.  Subsequent operations on the same cluster
have to wait until the current one finishes.  There is a pool of parallel
workers, configurable with the `workers` parameter in the configmap and set by
default to 4. The cluster-related tasks  are assigned to the workers based on
a cluster name: the tasks for the same cluster will be always assigned to the
same worker. There is no blocking between workers, although there is a chance
that a single worker will become a bottleneck if too many clusters are
assigned to it; therefore, for large-scale deployments it might be necessary
to bump up workers from the default value.
2017-05-12 11:41:35 +02:00
Oleksii Kliukin 1c4bce86df Avoid "bulk-comparing" pod resources during sync. (#109)
* Avoid "bulk-comparing" pod resources during sync.

First attempt to fix bogus restarts due to the reported mismatch
of container resources where one of the resources is an empty struct,
while the other has all fields set to nil.

In addition, add an ability to set limits and requests per pod, as well as the operator-level defaults.
2017-05-12 11:41:35 +02:00
Murat Kabilov a7c57874d5 Do not create roles if cluster is masterless
fix pod deletion
2017-05-12 11:41:34 +02:00
Murat Kabilov 08c0e3b6dd Use unified type for the namespaced object names 2017-05-12 11:41:34 +02:00
Oleksii Kliukin 71b93b4cc2 Feature/infrastructure roles (#91)
* Add infrastructure roles configured globally.

Those are the roles defined in the operator itself. The operator's
configuration refers to the secret containing role names, passwords
and membership information. While they are referred to as roles, in
reality those are users.

In addition, improve the regex to filter out invalid users and
make sure user secret names are compatible with DNS name spec.

Add an example manifest for the infrastructure roles.
2017-05-12 11:41:33 +02:00
Murat Kabilov 655f6dcadb make cluster resources private 2017-05-12 11:41:33 +02:00
Murat Kabilov 322676a6b9 Skip deleting Pods and PVCs if failed to delete StatefulSet 2017-05-12 11:41:32 +02:00
Murat Kabilov bb4fec25ae Fix deletion of the failed cluster; more debug messages 2017-05-12 11:41:32 +02:00
Oleksii Kliukin 8e658174e8 Fix the issue with calling a non-existent function. 2017-05-12 11:41:31 +02:00
Murat Kabilov d4bb72989a Warn if etcd key for the new cluster already exist 2017-05-12 11:41:31 +02:00
Murat Kabilov 852c5beae5 Check etcd key availability for the new cluster 2017-05-12 11:41:31 +02:00
Oleksii Kliukin 04ed22f73f Remove unnecessary initializations. 2017-05-12 11:41:31 +02:00
Murat Kabilov ee83e196a9 Fix secrets sync
* log if secret already exists
2017-05-12 11:41:30 +02:00
Oleksii Kliukin 8268b07ad2 Set logger level per package instead of doing this globally 2017-05-12 11:41:30 +02:00
Oleksii Kliukin 3a4c6268be Increase log verbosity, namely for object updates.
- add a new environment variable for triggering debug log level
- show both new, old object and diff during syncs and updates
- use pretty package to pretty-print go structures
-
2017-05-12 11:41:29 +02:00
Murat Kabilov c2d2a67ad5 Get config from environment variables;
ignore pg major version change;
get rid of resources package;
2017-05-12 11:41:29 +02:00
Murat Kabilov 79a6726d4d Increase logging verbosity, restructure code 2017-05-12 11:41:28 +02:00
Murat Kabilov 6f7399b36f Sync clusters states
* move statefulset creation from cluster spec to the separate function
* sync cluster state with desired state;
* move out from arrays for cluster resources;
* recreate pods instead of deleting them in case of statefulset change
* check for master while creating cluster/updating pods
* simplify retryutil
* list pvc while listing resources
* name kubernetes resources with capital letter
* do rolling update in case of env variables change
2017-05-12 11:41:27 +02:00
Oleksii Kliukin 1377724b2e Fix a compliation error. 2017-05-12 11:41:27 +02:00
Oleksii Kliukin 31d7426327 ClusterTeamName -> ClusterName. Add a TODO item. 2017-05-12 11:41:27 +02:00
Oleksii Kliukin 55dbacdfa6 Assign DNS name to the cluster.
DNS name is generated from the team name and cluster name.
Use "zalando.org/dnsname" service annotation that makes 'mate' service assign a CNAME to the load balancer name.
2017-05-12 11:41:27 +02:00
Murat Kabilov 486c8ecb07 use neutral name for set cluster status function 2017-05-12 11:41:26 +02:00
Murat Kabilov 1c6e7ac2e7 loadBalancerSourceRanges update 2017-05-12 11:41:26 +02:00
Murat Kabilov fc127069ab remove unnecessary ControllerNamespace 2017-05-12 11:41:26 +02:00
Murat Kabilov 416dace289 get rid of arrays in the kuberesources;
use shorter form of checking for errors
2017-05-12 11:41:26 +02:00
Oleksii Kliukin abd313f2d9 Fix a missing colon. 2017-05-12 11:41:26 +02:00
Oleksii Kliukin f65fab00dd Fix a typo 2017-05-12 11:41:26 +02:00
Oleksii Kliukin 033c28f03a Delete persistent volumes on deletion of the cluster. 2017-05-12 11:41:26 +02:00
Murat Kabilov caa0eab19b Move statefulset creation from cluster spec to the separate function 2017-05-12 11:41:25 +02:00
Oleksii Kliukin a2e78ac2ec Feature/persistent volumes 2017-05-12 11:41:25 +02:00
Murat Kabilov ae77fa15e8 Pod Rolling update
introduce Pod events channel;
add parsing of the MaintenanceWindows section;
skip deleting Etcd key on cluster delete;
use external etcd host;
watch for tpr/pods in the namespace of the operator pod only;
2017-05-12 11:41:25 +02:00
Murat Kabilov dfde075c66 Use TPR object namespace while creating its objects 2017-05-12 11:37:09 +02:00
Murat Kabilov 6e2d64bd50 Create human users from teams api 2017-05-12 11:37:09 +02:00
Murat Kabilov 58506634c4 Create pg users 2017-05-12 11:37:09 +02:00
Murat Kabilov 7e4d0410c2 Use one secret per user 2017-05-12 11:37:09 +02:00
Murat Kabilov abb1173035 Code refactor 2017-05-12 11:37:09 +02:00