postgres-operator

Commit Graph

Author	SHA1	Message	Date
Felix Kunde	702a194c41	switch to rbac/v1 (#829 ) * switch to rbac/v1	2020-02-17 11:25:07 +01:00
Felix Kunde	1f0312a014	make minimum limits boundaries configurable (#808 ) * make minimum limits boundaries configurable * add e2e test	2020-02-03 11:43:18 +01:00
Felix Kunde	cd110aabf4	Enforce minimum cpu and memory limits (#731 ) * add validation for PG resources and volume size * check resource requests also on UPDATE and SYNC + update docs * if cluster was running don't error on sync	2019-12-12 16:43:55 +01:00
Felix Kunde	f0e29060b1	move StatefulSet to apps/v1 (#675 )	2019-09-30 16:42:04 +02:00
Felix Kunde	4a863d2280	Avoid orphaned objects on delete (#654 ) * Make setSpec function work correctly when updating cluster status fails	2019-08-27 12:54:35 +02:00
Rafia Sabih	540d58d5bd	Adding the support for standby cluster This will set up a continuous wal streaming cluster, by adding the corresponding section in postgres manifest. Instead of having a full-fledged standby cluster as in Patroni, here we use only the wal path of the source cluster and stream from there. Since, standby cluster is streaming from the master and does not require to create or use databases of it's own. Hence, it bypasses the creation of users or databases. There is a separate sample manifest added to set up a standby-cluster.	2019-06-21 10:11:39 +02:00
Felix Kunde	6918394562	Add PDB configuration toggle (#583 ) * Don't create an impossible disruption budget for smaller clusters. * sync PDB also on update	2019-06-18 10:48:21 +02:00
Rafia Sabih	2886027516	Some typos/spelling mistakes fix (#580 ) Harmless typos fix.	2019-06-06 14:20:15 +02:00
Stephane T	1f4267eb05	fix: remove headless service config when deleting cluster (#567 ) see: https://github.com/zalando/postgres-operator/issues/566 Signed-off-by: Stephane Tang <hi@stang.sh>	2019-05-21 13:49:34 +02:00
Sergey Dudoladov	f3e1e80aaf	Add logical backup (#442 ) * Add k8s cron job to spawn logical backups * Minor doc updates	2019-05-16 15:52:01 +02:00
Felix Kunde	0fbfbb23bb	Use /status subresource instead of plain manifest field (#534 ) * turns PostgresStatus type into a struct with field PostgresClusterStatus * setStatus patch target is now /status subresource * unmarshalling PostgresStatus takes care of previous status field convention * new simple bool functions status.Running(), status.Creating()	2019-05-07 12:01:45 +02:00
Stephane T	edeb06d39c	fix: update init_containers (#518 ) * fix: PATH expension in Makefile Signed-off-by: Stephane Tang <hi@stang.sh> * refact: pass list of containers to compareContainers() Signed-off-by: Stephane Tang <hi@stang.sh> * compare initContainers while comparing StatefulSet Fixes #517 Signed-off-by: Stephane Tang <hi@stang.sh> * refact: compareContainers() Signed-off-by: Stephane Tang <hi@stang.sh>	2019-03-19 17:46:12 +01:00
Felix Kunde	31e568157b	reflect change in github url (#496 ) Project was moved from the incubator to the Zalando main org, hence the rename	2019-02-25 11:26:55 +01:00
Maxim Ivanov	ed6acc1178	Correctly report success in .status on Update (#469 )	2019-01-31 13:09:17 +01:00
Jan Mussler	c70905ae8b	Modifying some of the logging to be more descriptive. (#440 ) * Modifying some of the logging to be more descriptive.	2019-01-08 13:07:36 +01:00
zerg-junior	c0b0b9a832	[WIP] Add 'admin' option to create role (#425 ) * Add 'admin' option to create role * Fix run_locally_script	2018-12-27 10:14:33 +01:00
zerg-junior	7907f95d2f	Improve reporting about rolling updates (#391 )	2018-09-24 11:57:43 +02:00
Noah Kantrowitz	0b75a89920	Fix the casing of github.com/Sirupsen/logrus to match what the project itself uses. (#380 ) Dep enforces this.	2018-09-06 10:26:48 +02:00
zerg-junior	25fa45fd58	[WIP] Grant 'superuser' to the members of Postgres admin teams (#371 ) Added support for superuser team in addition to the admin team that owns the postgres cluster.	2018-08-30 10:51:37 +02:00
zerg-junior	aeae0a6ef2	Use cluster's own namespace to patch the cluster manifest (#373 )	2018-08-22 11:07:12 +02:00
Oleksii Kliukin	e1ed4b847d	Use code-generation for CRD API and deepcopy methods (#369 ) Client-go provides a https://github.com/kubernetes/code-generator package in order to provide the API to work with CRDs similar to the one available for built-in types, i.e. Pods, Statefulsets and so on. Use this package to generate deepcopy methods (required for CRDs), instead of using an external deepcopy package; we also generate APIs used to manipulate both Postgres and OperatorConfiguration CRDs, as well as informers and listers for the Postgres CRD, instead of using generic informers and CRD REST API; by using generated code we can get rid of some custom and obscure CRD-related code and use a better API. All generated code resides in /pkg/generated, with an exception of zz_deepcopy.go in apis/acid.zalan.do/v1 Rename postgres-operator-configuration CRD to OperatorConfiguration, since the former broke naming convention in the code-generator. Moved Postgresql, PostgresqlList, OperatorConfiguration and OperatorConfigurationList and other types used by them into Change the type of the Error field in the Postgresql crd to a string, so that client-go could generate a deepcopy for it. Use generated code to set status of CRD objects as well. Right now this is done with patch, however, Kubernetes 1.11 introduces the /status subresources, allowing us to set the status with the special updateStatus call in the future. For now, we keep the code that is compatible with earlier versions of Kubernetes. Rename postgresql.go to database.go and status.go to logs_and_api.go to reflect the purpose of each of those files. Update client-go dependencies. Minor reformatting and renaming.	2018-08-15 17:22:25 +02:00
Oleksii Kliukin	e933908084	Configure pg_hba in the local postgresql configuration of Patroni. (#361 ) Previously, the operator put pg_hba into the bootstrap/pg_hba key of Patroni. That had 2 adverse effects: - pg_hba.conf was shadowed by Spilo default section in the local postgresql configuration - when updating pg_hba in the cluster manifest, the updated lines were not propagated to DCS, since the key was defined in the boostrap section of Patroni. Include some minor refactoring, moving methods to unexported when possible and commenting out usage of md5, so that gosec won't complain. Per https://github.com/zalando-incubator/postgres-operator/issues/330 Review by @zerg-junior	2018-08-08 11:01:26 +02:00
Oleksii Kliukin	b06186eb41	Linter-induced code refactoring, run round 2. (#360 ) Run more linters in the gometalinter, i.e. deadcode, megacheck, nakedret, dup. More consistent code formatting, remove two dead functions, eliminate naked a bunch of naked returns, refactor a few functions to avoid code duplication.	2018-08-06 12:09:19 +02:00
Oleksii Kliukin	59f0c5551e	Allow configuring pod priority globally and per cluster. (#353 ) * Allow configuring pod priority globally and per cluster. Allow to specify pod priority class for all pods managed by the operator, as well as for those belonging to individual clusters. Controlled by the pod_priority_class_name operator configuration parameter and the podPriorityClassName manifest option. See https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass for the explanation on how to define priority classes since Kubernetes 1.8. Some import order changes are due to go fmt. Removal of OrphanDependents deprecated field. Code review by @zerg-junior	2018-08-03 14:03:37 +02:00
Oleksii Kliukin	ac7b132314	Refactoring inspired by gometalinter. (#357 ) Among other things, fix a few issues with deepcopy implementation.	2018-08-03 11:09:45 +02:00
Oleksii Kliukin	d2d3f21dc2	Client go upgrade v6 (#352 ) There are shortcuts in this code, i.e. we created the deepcopy function by using the deepcopy package instead of the generated code, that will be addressed once migrated to client-go v8. Also, some objects, particularly statefulsets, are still taken from v1beta, this will also be addressed in further commits once the changes are stabilized.	2018-08-01 11:08:01 +02:00
Oleksii Kliukin	0181a1b5b1	Introduce a repair scan to fix failing clusters (#304 ) A repair is a sync scan that acts only on those clusters that indicate that the last add, update or sync operation on them has failed. It is supposed to kick in more frequently than the repair scan. The repair scan still remains to be useful to fix the consequences of external actions (i.e. someone deletes a postgres-related service by mistake) unbeknownst to the operator. The repair scan is controlled by the new repair_period parameter in the operator configuration. It has to be at least 2 times more frequent than a sync scan to have any effect (a normal sync scan will update both last synced and last repaired attributes of the controller, since repair is just a sync underneath). A repair scan could be queued for a cluster that is already being synced if the sync period exceeds the interval between repairs. In that case a repair event will be discarded once the corresponding worker finds out that the cluster is not failing anymore. Review by @zerg-junior	2018-07-24 11:21:45 +02:00
zerg-junior	417f13c0bd	Submit RBAC credentials during initial Event processing (#344 ) * During initial Event processing submit the service account for pods and bind it to a cluster role that allows Patroni to successfully start. The cluster role is assumed to be created by the k8s cluster administrator.	2018-07-19 16:40:40 +02:00
Oleksii Kliukin	3a9378d3b8	Allow configuring the operator via the YAML manifest. (#326 ) * Up until now, the operator read its own configuration from the configmap. That has a number of limitations, i.e. when the configuration value is not a scalar, but a map or a list. We use a custom code based on github.com/kelseyhightower/envconfig to decode non-scalar values out of plain text keys, but that breaks when the data inside the keys contains both YAML-special elememtns (i.e. commas) and complex quotes, one good example for that is search_path inside `team_api_role_configuration`. In addition, reliance on the configmap forced a flag structure on the configuration, making it hard to write and to read (see https://github.com/zalando-incubator/postgres-operator/pull/308#issuecomment-395131778). The changes allow to supply the operator configuration in a proper YAML file. That required registering a custom CRD to support the operator configuration and provide an example at manifests/postgresql-operator-default-configuration.yaml. At the moment, both old configmap and the new CRD configuration is supported, so no compatibility issues, however, in the future I'd like to deprecate the configmap-based configuration altogether. Contrary to the configmap-based configuration, the CRD one doesn't embed defaults into the operator code, however, one can use the manifests/postgresql-operator-default-configuration.yaml as a starting point in order to build a custom configuration. Since previously `ReadyWaitInterval` and `ReadyWaitTimeout` parameters used to create the CRD were taken from the operator configuration, which is not possible if the configuration itself is stored in the CRD object, I've added the ability to specify them as environment variables `CRD_READY_WAIT_INTERVAL` and `CRD_READY_WAIT_TIMEOUT` respectively. Per review by @zerg-junior and @Jan-M.	2018-07-16 16:20:46 +02:00
Oleksii Kliukin	e90a01050c	Switchover must wait for the inner goroutine before it returns. (#343 ) * Switchover must wait for the inner goroutine before it returns. Otherwise, two corner cases may happen: - waitForPodLabel writes to the podLabelErr channel that has been already closed by the outer routine - the outer routine exists and the caller subscribes to the pod the inner goroutine has already subscribed to, resulting in panic. The previous commit `fe47f9ebea` that touched that code added the cancellation channel, but didn't bother to actually wait for the goroutine to be cancelled. Per report and review from @valer-cara. Original issue: https://github.com/zalando-incubator/postgres-operator/issues/342	2018-07-16 11:50:35 +02:00
Oleksii Kliukin	48a5744314	Use Patroni API to set bootstrap-only options. (#299 ) Call Patroni API /config in order to set special options that are ignored when set in the configuration file, such as max_connections. Per https://github.com/zalando-incubator/postgres-operator/issues/297 * Some minor refacoring: Rename Cluster ManualFailover to Swithover Rename Patroni Failover to Switchover Add more details to error messages and comments introduced in this PR. Review by @zerg-junior	2018-05-29 12:35:25 +02:00
Oleksii Kliukin	27c7245fed	Avoid terminating delete on errors. When there is an error happening upon deletion of the Kubernetes object belonging to the cluster being removed, it makes no sense to abort the deletion: the manifest will be removed anyway, therefore all the objects after the one we aborted at will stay forever.	2018-05-18 18:10:37 +02:00
Oleksii Kliukin	0c616a802f	Merge branch 'master' into rolling_updates_with_statefulset_annotations # Conflicts: # pkg/cluster/k8sres.go	2018-05-15 15:33:34 +02:00
Oleksii Kliukin	987b43456b	Deprecate old LB options, fix endpoint sync. (#287 ) * Depreate old LB options, fix endpoint sync. - deprecate useLoadBalancer, replicaLoadBalancer from the manifest and enable_load_balancer from the operator configuration. The old operator configuration options become no-op with this commit. For the old manifest options, `useLoadBalancer` and `replicaLoadBalancer` are still consulted, but only in the absense of the new ones (enableMasterLoadBalancer and enableReplicaLoadBalancer). - Make sure the endpoint being created during the sync receives proper addresses subset. This is more critical for the replicas, as for the masters Patroni will normally re-create the endpoint before the operator. - Avoid creating the replica endpoint, since it will be created automatically by the corresponding service. - Update the README and unit tests. Code review by @mgomezch and @zerg-junior	2018-05-15 15:19:18 +02:00
Oleksii Kliukin	332dab5237	Merge branch 'rolling_updates_with_statefulset_annotations' of github.com:zalando-incubator/postgres-operator into rolling_updates_with_statefulset_annotations	2018-05-08 14:51:10 +02:00
Oleksii Kliukin	ce0d4af91c	Initial implementation for the statefulset annotations indicating rolling updates.	2018-05-07 08:07:37 +02:00
Oleksii Kliukin	43a1db2128	Merge branch 'master' into pending_rolling_updates	2018-05-03 11:27:16 +02:00
Oleksii Kliukin	fe47f9ebea	Improve the pod moving behavior during the Kubernetes cluster upgrade. (#281 ) * Improve the pod moving behavior during the Kubernetes cluster upgrade. Fix an issue of not waiting for at least one replica to become ready (if the Statefulset indicates there are replicas) when moving the master pod off the decomissioned node. Resolves the first part of #279. Small fixes to error messages. * Eliminate a race condition during the swithover. When the operator initiates the failover (switchover) that fails and then retries it for a second time it may happen that the previous waitForPodChannel is still active. As a result, the operator subscribes to the former master pod two times, causing a panic. The problem was that the original code didn't bother to cancel the waitForPodLalbel for the new master pod in the case when the failover fails. This commit fixes it by adding a stop channel to that function. Code review by @zerg-junior	2018-05-03 10:20:24 +02:00
Sergey Dudoladov	1b718fd4c2	Minor improvemets in reporting service account creation	2018-04-26 13:47:25 +02:00
Sergey Dudoladov	d99b553ec1	Convert default account definiton into JSON	2018-04-25 12:35:16 +02:00
Sergey Dudoladov	485ec4b8ea	Move service account to Controller	2018-04-24 15:13:08 +02:00
Sergey Dudoladov	5daf0a4172	Fix error reporting during pod service account creation	2018-04-20 14:20:38 +02:00
Sergey Dudoladov	bd51d2922b	Turn ServiceAccount into struct value to avoid race conditon during account creation	2018-04-20 13:05:05 +02:00
Sergey Dudoladov	23f893647c	Remove sync of pod service accounts	2018-04-19 15:48:58 +02:00
Sergey Dudoladov	214ae04aa7	Deploy service account for pod creation on demand	2018-04-18 16:20:20 +02:00
Oleksii Kliukin	0618723a61	Check rolling updates using controller revisions. Compare pods controller revisions with the one for the statefulset to determine whether the pod is running the latest revision and, therefore, no rolling update is necessary. This is performed only during the operator start, afterwards the rolling update status that is stored locally in the cluster structure is used for all rolling update decisions.	2018-04-09 18:07:24 +02:00
Manuel Gómez	88c68712b6	Fix statefulset label selector diffing (#273 ) Otherwise, rolling updates are done unnecessarily.	2018-04-06 17:21:57 +02:00
Oleksii Kliukin	9bf80afa6b	Remove team from statefulset selector (#271 ) * Remove 'team' label from the statefulset selector. I was never supposed to be there, but implicitely statefulset creates a selector out of meta.labels field. That is the problem with recent Kubernetes, since statefulset cannot pick up pods with non-matching label selectors, and we rely on statefulset picking up old pods after statefulset replacement. Make sure selector changes trigger replacement of the statefulset. In the case new selector has more labels than the old one nothing should be done with a statefulset, otherwise the new statefulset won't see orphaned pods from the old one, as they won't match the selector. See https://github.com/kubernetes/kubernetes/issues/46901#issuecomment-356418393	2018-04-06 13:58:47 +02:00
Oleksii Kliukin	26db91c53e	Improve infrastructure role definitions (#208 ) Enhance definitions of infrastructure roles by allowing membership in multiple roles, role options and per-role configuration to be specified in the infrastructure role configmap, which must have the same name as the infrastructure role secret. See manifests/infrastructure-roles-configmap.yaml for the examples and updated README for the description of different types of database roles supposed by the operator and their purposes. Change the logic of merging infrastructure roles with the manifest roles when they have the same name, to return the infrastructure role unchanged instead of merging. Previously, we used to propagate flags from the manifest role to the resulting infrastructure one, as there were no way to define flags for the infrastructure role; however, this is not the case anymore. Code review and tests by @erthalion	2018-04-04 17:21:36 +02:00
Sergey Dudoladov	2aeff096f7	Make ReplicaLoadBalancer a separate toggler	2018-03-02 13:35:25 +01:00
Sergey Dudoladov	2ef069ee93	Create/delete replica service regardless of load balancer setup	2018-02-27 17:10:49 +01:00
Oleksii Kliukin	2bb7e98268	update individual role secrets from infrastructure roles (#206 ) * Track origin of roles. * Propagate changes on infrastructure roles to corresponding secrets. When the password in the infrastructure role is updated, re-generate the secret for that role. Previously, the password for an infrastructure role was always fetched from the secret, making any updates to such role a no-op after the corresponding secret had been generated.	2018-02-23 17:24:04 +01:00
Dmitrii Dolgov	ef50b147c5	Use list of checks instead of a map	2018-02-23 14:24:33 +01:00
Dmitrii Dolgov	95d86c7600	Move container comparison logic to a separate function	2018-02-23 11:58:37 +01:00
Oleksii Kliukin	c4aab502b3	Remove Patroni leftover objects on cluster deletion. (#244 ) * Remove all endpoints and configmaps from Patroni when Patroni is running with Kubernetes support on cluster deletion.	2018-02-23 09:52:22 +01:00
Oleksii Kliukin	cca73e30b7	Make code around recreating pods and creating objects in the database less brittle (#213 ) There used to be a masterLess flag that was supposed to indicate whether the cluster it belongs to runs without the acting master by design. At some point, as we didn't really have support for such clusters, the flag has been misused to indicate there is no master in the cluster. However, that was not done consistently (a cluster without all pods running would never be masterless, even when the master is not among the running pods) and it was based on the wrong assumption that the masterless cluster will remain masterless until the next attempt to change that flag, ignoring the possibility of master coming up or some node doing a successful promotion. Therefore, this PR gets rid of that flag completely. When the cluster is running with 0 instances, there is obviously no master and it makes no sense to create any database objects inside the non-existing master. Therefore, this PR introduces an additional check for that. recreatePods were assuming that the roles of the pods recorded when the function has stared will not change; for instance, terminated replica pods should start as replicas. Revisit that assumption by looking at the actual role of the re-spawned pods; that avoids a failover if some replica has promoted to the master role while being re-spawned. In addition, if the failover from the old master was unsuccessful, we used to stop and leave the old master running on an old pod, without recording this fact anywhere. This PR makes the failover failure emit a warning, but not stop recreating the last master pod; in the worst case, the running master will be terminated, however, this case is rather unlikely one. As a side effect, make waitForPodLabel return the pod definition it waited for, avoiding extra API calls in recreatePods and movePodFromEndOfLifeNode	2018-02-22 10:42:05 +01:00
Sergey Dudoladov	f194a2ae5a	Introduce changes from the PR #200 by @alexeyklyukin	2018-02-07 14:02:32 +01:00
Manuel Gómez	bf4406d2a4	Consider container names in Statefulset diffs (#210 ) This includes a comparison on container names being equal in the decision of whether a Statefulset has been updated.	2018-01-16 12:06:11 +01:00
Oleksii Kliukin	9720ac1f7e	WIP: Hold the proper locks while examining the list of databases. Introduce a new lock called specMu lock to protect the cluster spec. This lock is held on update and sync, and when retrieving the spec in the API code. There is no need to acquire it for cluster creation and deletion: creation assigns the spec to the cluster before linking it to the controller, and deletion just removes the cluster from the list in the controller, both holding the global clustersMu Lock.	2017-12-22 13:06:11 +01:00
Manuel Gómez	15c278d4e8	Scalyr agent sidecar for log shipping (#190 ) * Scalyr agent sidecar for log shipping * Remove the default for the Scalyr image Now the image needs to be specified explicitly to enable log shipping to Scalyr. This removes the problem of having to generate the config file or publish our agent image repository. * Add configuration variable for Scalyr server URL Defaults to the EU address. * Alter style Newlines are cheap and make code easier to edit/refactor, but ok. * Fix StatefulSet comparison logic I broke it when I made the comparison consider all containers in the PostgreSQL pod.	2017-12-21 15:34:26 +01:00
Oleksii Kliukin	da0de8cff7	Make sure the statefulset that is deleted manually gets re-created. (#191 ) * Make sure the statefulset that is deleted manually gets re-created. Per report and analysis by Manuel Gomez. * Move the existence checks for other objects out of the Create functions. create{Object} for services, endpoints and PDBs refused to continue if there is a cached definition in the cluster, however, the only place where it makes sense is when creating a new cluster. Note that contrary to the statefulset this doesn't fix any issues, since those definitions were nullified correspondingly when the sync code detected there is no object present in the Kubernetes cluster.	2017-12-21 15:20:43 +01:00
Oleksii Kliukin	1c5451cd7d	Spelling fix.	2017-12-14 14:39:33 +01:00
Oleksii Kliukin	55dc12e512	Examine custom environment sources when syncing. When comparing statefulsets, make sure EnvFrom fields are compared as well.	2017-12-14 14:39:33 +01:00
Oleksii Kliukin	87bc47d8d0	Fixes for the case of re-creating the cluster after deletion. - make sure that the secrets for the system users (superuser, replication) are not deleted when the main cluster is. Therefore, we can re-create the cluster, potentially forcing Patroni to restore it from the backup and enable Patroni to connect, since it will use the old password, not the newly generated random one. - when syncing users, always check whether they are already in the DB. Previously, we did this only for the sync cluster case, but the new cluster could be actually the one restored from the backup by Patroni, having all or some of the users already in place. - delete endponts last. Patroni uses the $clustername endpoint in order to store the leader related metadata. If we remove it before removing all pods, one of those pods running Patroni will re-create it and the next attempt to create the cluster with the same name will stuble on the existing endpoint. - Use db.Exec instead of db.Query for queries that expect no result. This also fixes the issue with the DB creation, since we didn't release an empty Row object it was not possible to create more than one database for a cluster.	2017-12-13 16:49:00 +01:00
Oleksii Kliukin	1fb8cf7ea0	Avoid overwriting critical users. (#172 ) * Avoid overwriting critical users. Disallow defining new users either in the cluster manifest, teams API or infrastructure roles with the names mentioned in the new protected_role_names parameter (list of comma-separated names) Additionally, forbid defining a user with the name matching either super_username or replication_username, so that we don't overwrite system roles required for correct working of the operator itself. Also, clear PostgreSQL roles on each sync first in order to avoid using the old definitions that are no longer present in the current manifest, infrastructure roles secret or the teams API.	2017-12-05 14:27:12 +01:00
Oleksii Kliukin	022ce29314	Make an error message more verbose.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	637921cdee	Tests for initHumanUsers and initinitRobotUsers. Change the Cluster class in the process to implelement Teams API calls and Oauth token fetches as interfaces, so that we can mock them in the tests.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	611cfe96d6	Fix an issue when not assigning the merge result. Add some tests.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	831ebb1f32	Fix the error reporting.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	2e226dee26	Avoid overwriting infrastrure roles. When a role is defined in the infrastructure roles and the cluster manifest use the infrastructure role definition and add flags defined in the manifest. Previously the role has been overwritten by the definition from the manifest. Because a random password is generated for each role from the manifest the applications relying on the infrastructure role credentials from the infrastructure roles secret were unable to connect.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	975b21f633	Rename api roles configuration parameter. Change api_roles_configuration to team_api_role_configuration	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	2352fc9a39	go fmt run	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	415a7fdc4d	Allow global configuration options for API roles. Add options to the PgUser structure, potentially allowing to set per-role options in the cluster definition as well. Introduce api_roles_configuration operator option with the default of log_statement=all	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	c25e849fe4	Fix a failure to create new statefulset at sync. Also do a fmt run.	2017-11-08 18:24:17 +01:00
Murat Kabilov	86803406db	use sync methods while updating the cluster	2017-11-03 12:00:43 +01:00
Oleksii Kliukin	ce960e892a	Create new databases and change owners of existing ones during sync. (#153 ) * Create new databases and change owners of existing ones during sync.	2017-11-02 17:46:33 +01:00
Oleksii Kliukin	7a76be7d3e	Minor fixes around PDB (pod-distruption-budget) syncing: (#147 ) - Call comparison function in the case of the sync as well as for update - Include full cluster name in PDB name - Assign cluster labels to the PDB object	2017-10-23 12:26:59 +02:00
Murat Kabilov	661b141849	Fix Pod Disruption Budget null pointer exception	2017-10-20 11:43:50 +02:00
Oleksii Kliukin	eba23279c8	Kube cluster upgrade	2017-10-19 10:49:42 +02:00
Murat Kabilov	6c4cb4e9da	Perform manual failover during the scale down	2017-10-16 17:41:23 +02:00
Jan Mussler	cec695d48e	Superuser toggle for team members Make superuser toggleable for team members. Add and "admin" role to team members if superuser is disabled.	2017-10-12 15:01:54 +02:00
Murat Kabilov	8d5faaa5a5	return idle status when worker has nothing to do	2017-10-11 15:42:20 +02:00
Murat Kabilov	83c8d6c419	Extend diagnostic api with worker status info	2017-10-11 12:26:09 +02:00
Murat Kabilov	a35e9c6119	move from tpr to crd	2017-10-06 15:12:08 +02:00
Jan Mussler	c4af0ac6a6	Update cluster.go	2017-10-05 10:58:23 +02:00
Jan M	4a1170855a	Adding '_' to allowed chars.	2017-10-05 10:53:19 +02:00
Murat Kabilov	48ec6b35b9	perform manual failover on pg cluster rolling upgrade	2017-10-04 16:56:47 +03:00
Murat Kabilov	00194d0130	create dbs on cluster create	2017-10-04 16:24:27 +03:00
Murat Kabilov	90b49a24ba	make postgresql roles public	2017-09-11 17:44:32 +02:00
Murat Kabilov	8aa11ecee2	Add patroni api client	2017-08-30 16:01:18 +02:00
Murat Kabilov	899c0bef45	Use warningf instead of warnf	2017-08-30 14:35:56 +02:00
Murat Kabilov	5967837875	pass the name of the status in the log message on set cluster status failure	2017-08-17 12:18:53 +02:00
Murat Kabilov	272d7e1bcf	rename service field to services as it contains service per role	2017-08-15 15:55:56 +02:00
Murat Kabilov	82f58b57d8	add cluster and controller methods for getting status	2017-08-15 12:11:06 +02:00
Murat Kabilov	5470f20be4	always pass a cluster name as a logger field	2017-08-15 10:29:18 +02:00
Murat Kabilov	e26db66cb5	start all the log messages with lowercase letters	2017-08-15 10:12:36 +02:00
Oleksii Kliukin	f15f93f479	Bugfix/close db connections (#78 ) Open and close DB connections on-demand. Previously, we used to leave the DB connection open while the cluster was registered with the operator, potentially resutling in dangled connections if the operator terminates abnormally. Small refactoring around the role syncing code.	2017-08-10 10:10:00 +02:00
Murat Kabilov	cf663cb841	Fix golint warnings	2017-08-01 16:08:56 +02:00
Murat Kabilov	1f8b37f33d	Make use of kubernetes client-go v4 * client-go v4.0.0-beta0 * remove unnecessary methods for tpr object * rest client: use interface instead of structure pointer * proper names for constants; some clean up for log messages * remove teams api client from controller and make it per cluster	2017-07-25 15:25:17 +02:00
Oleksii Kliukin	4455f1b639	Feature/unit tests (#53 ) - Avoid relying on Clientset structure to call Kubernetes API functions. While Clientset is a convinient "catch-all" abstraction for calling REST API related to different Kubernetes objects, it's impossible to mock. Replacing it wih the kubernetes.Interface would be quite straightforward, but would require an exra level of mocked interfaces, because of the versioning. Instead, a new interface is defined, which contains only the objects we need of the pre-defined versions. - Move KubernetesClient to k8sutil package. - Add more tests.	2017-07-24 16:56:46 +02:00
Oleksii Kliukin	00150711e4	Configure load balancer on a per-cluster and operator-wide level (#57 ) * Deny all requests to the load balancer by default. * Operator-wide toggle for the load-balancer. * Define per-cluster useLoadBalancer option. If useLoadBalancer is not set - then operator-wide defaults take place. If it is true - the load balancer is created, otherwise a service type clusterIP is created. Internally, we have to completely replace the service if the service type changes. We cannot patch, since some fields from the old service that will remain after patch are incompatible with the new one, and handling them explicitly when updating the service is ugly and error-prone. We cannot update the service because of the immutable fields, that leaves us the only option of deleting the old service and creating the new one. Unfortunately, there is still an issue of unnecessary removal of endpoints associated with the service, it will be addressed in future commits. * Revert the unintended effect of go fmt * Recreate endpoints on service update. When the service type is changed, the service is deleted and then the one with the new type is created. Unfortnately, endpoints are deleted as well. Re-create them afterwards, preserving the original addresses stored in them. * Improve error messages and comments. Use generate instead of gen in names.	2017-06-30 13:38:49 +02:00
Murat Kabilov	1540a2ba65	fix typos; remove unnecessary tests; go fmt -s	2017-06-08 15:52:01 +02:00
Oleksii Kliukin	bc0e9ab4bc	Add error checks per report from errcheck-ng	2017-06-08 10:41:44 +02:00
Murat Kabilov	292a9bda05	Check for dns annotation of the service	2017-06-07 16:41:39 +02:00
Oleksii Kliukin	dc36c4ca12	Implement replicaLoadBalancer boolean flag. (#38 ) The flag adds a replica service with the name cluster_name-repl and a DNS name that defaults to {cluster}-repl.{team}.{hostedzone}. The implementation converted Service field of the cluster into a map with one or two elements and deals with the cases when the new flag is changed on a running cluster (the update and the sync should create or delete the replica service). In order to pick up master and replica service and master endpoint when listing cluster resources. * Update the spec when updating the cluster.	2017-06-07 13:54:17 +02:00
Oleksii Kliukin	7b0ca31bfb	Implements EBS volume resizing #35 . In order to support volumes different from EBS and filesystems other than EXT2/3/4 the respective code parts were implemented as interfaces. Adding the new resize for the volume or the filesystem will require implementing the interface, but no other changes in the cluster code itself. Volume resizing first changes the EBS and the filesystem, and only afterwards is reflected in the Kubernetes "PersistentVolume" object. This is done deliberately to be able to check if the volume needs resizing by peeking at the Size of the PersistentVolume structure. We recheck, nevertheless, in the EBSVolumeResizer, whether the actual EBS volume size doesn't match the spec, since call to the AWS ModifyVolume is counted against the resize limit of once every 6 hours, even for those calls that shouldn't result in an actual resize (i.e. when the size matches the one for the running volume). As a collateral, split the constants into multiple files, move the volume code into a separate file and fix minor issues related to the error reporting.	2017-06-06 13:53:27 +02:00
Murat Kabilov	009db16c7c	Use queues for the pod events (#30 )	2017-05-23 15:24:14 +02:00
Oleksii Kliukin	afce38f6f0	Fix error messages (#27 ) Use lowercase for kubernetes objects Use %v instead of %s for errors Start error messages with a lowercase letter.	2017-05-22 14:12:06 +02:00
Murat Kabilov	4acaf27a5d	Remove etcd requests (#25 ) update glide	2017-05-19 17:18:37 +02:00
Murat Kabilov	d34273543e	Fix the golint, gosimple warnings	2017-05-18 17:38:54 +02:00
Murat Kabilov	233e8529c1	Return error instead of logging it	2017-05-18 17:24:44 +02:00
Murat Kabilov	3b6454c2dc	add missed return (#20 )	2017-05-17 11:54:50 +02:00
Oleksii Kliukin	c2826b10e2	Merge branch 'master' into fix/go-vet-fixes	2017-05-17 11:30:07 +02:00
Oleksii Kliukin	4457ce4e47	Replace the statefulset if it cannot be updated. (#18 ) Updates to statefulset spec for fields other than 'replicas' and containers' are forbidden. However, it is possible to delete the old statefulset without deleting its pods and create the new one, using the changed specs. The new statefulset shall pick up the orphaned pods. Change the statefulset's comparison to return the combined effect of all checks, not just the first non-matching field.	2017-05-17 11:28:21 +02:00
Murat Kabilov	6e5d7abcc5	pass cluster by reference	2017-05-17 11:05:15 +02:00
Murat Kabilov	356be8f0f1	skip clusters with invalid spec	2017-05-16 16:46:37 +02:00
Oleksii Kliukin	03064637f1	Allow disabling access to the DB and the Teams API. Command-line options --nodatabaseaccess and --noteamsapi disable all teams api interaction and access to the Postgres database. This is useful for debugging purposes when the operator runs out of cluster (with --outofcluster flag). The same effect can be achieved by setting enable_db_access and/or enable_teams_api to false.	2017-05-12 17:40:48 +02:00
Murat Kabilov	92d7fbf372	replace github.bus.zalan.do with github.cm/zalando-incubator	2017-05-12 11:50:16 +02:00
Oleksii Kliukin	6983f444ed	Periodically sync roles with the running clusters. (#102 ) The sync adds or alters database roles based on the roles defined in the cluster's TPR, Team API and operator's infrastructure roles. At the moment, roles are not deleted, as it would be dangerous for the robot roles in case TPR is misconfigured. In addition, ALTER ROLE does not remove role options, i.e. SUPERUSER or CREATEROLE, neither it removes role membership: only new options are added and new role membership is granted. So far, options like NOSUPERUSER and NOCREATEROLE won't be handed correctly, when mixed with the non-negative counterparts, also NOLOGIN should be processed correctly. The code assumes that only MD5 passwords are stored in the DB and will likely break with the new SCRAM auth in PostgreSQL 10. On the implementation side, create the new interface to abstract roles merge and creation, move most of the role-based functionality from cluster/pg into the new 'users' module, strip create user code of special cases related to human-based users (moving them to init instead) and fixed the password md5 generator to avoid processing already encrypted passwords. In addition, moved the system roles off the slice containing all other roles in order to avoid extra efforts to avoid creating them. Also, fix a leak in DB connections when the new connection is not considered healthy and discarded without being closed. Initialize the database during the sync phase before syncing users.	2017-05-12 11:41:35 +02:00
Murat Kabilov	2370659c69	Parallel cluster processing Run operations concerning multiple clusters in parallel. Each cluster gets its own worker in order to create, update, sync or delete clusters. Each worker acquires the lock on a cluster. Subsequent operations on the same cluster have to wait until the current one finishes. There is a pool of parallel workers, configurable with the `workers` parameter in the configmap and set by default to 4. The cluster-related tasks are assigned to the workers based on a cluster name: the tasks for the same cluster will be always assigned to the same worker. There is no blocking between workers, although there is a chance that a single worker will become a bottleneck if too many clusters are assigned to it; therefore, for large-scale deployments it might be necessary to bump up workers from the default value.	2017-05-12 11:41:35 +02:00
Oleksii Kliukin	1c4bce86df	Avoid "bulk-comparing" pod resources during sync. (#109 ) * Avoid "bulk-comparing" pod resources during sync. First attempt to fix bogus restarts due to the reported mismatch of container resources where one of the resources is an empty struct, while the other has all fields set to nil. In addition, add an ability to set limits and requests per pod, as well as the operator-level defaults.	2017-05-12 11:41:35 +02:00
Murat Kabilov	a7c57874d5	Do not create roles if cluster is masterless fix pod deletion	2017-05-12 11:41:34 +02:00
Murat Kabilov	08c0e3b6dd	Use unified type for the namespaced object names	2017-05-12 11:41:34 +02:00
Oleksii Kliukin	71b93b4cc2	Feature/infrastructure roles (#91 ) * Add infrastructure roles configured globally. Those are the roles defined in the operator itself. The operator's configuration refers to the secret containing role names, passwords and membership information. While they are referred to as roles, in reality those are users. In addition, improve the regex to filter out invalid users and make sure user secret names are compatible with DNS name spec. Add an example manifest for the infrastructure roles.	2017-05-12 11:41:33 +02:00
Murat Kabilov	655f6dcadb	make cluster resources private	2017-05-12 11:41:33 +02:00
Murat Kabilov	322676a6b9	Skip deleting Pods and PVCs if failed to delete StatefulSet	2017-05-12 11:41:32 +02:00
Murat Kabilov	bb4fec25ae	Fix deletion of the failed cluster; more debug messages	2017-05-12 11:41:32 +02:00
Oleksii Kliukin	8e658174e8	Fix the issue with calling a non-existent function.	2017-05-12 11:41:31 +02:00
Murat Kabilov	d4bb72989a	Warn if etcd key for the new cluster already exist	2017-05-12 11:41:31 +02:00
Murat Kabilov	852c5beae5	Check etcd key availability for the new cluster	2017-05-12 11:41:31 +02:00
Oleksii Kliukin	04ed22f73f	Remove unnecessary initializations.	2017-05-12 11:41:31 +02:00
Murat Kabilov	ee83e196a9	Fix secrets sync * log if secret already exists	2017-05-12 11:41:30 +02:00
Oleksii Kliukin	8268b07ad2	Set logger level per package instead of doing this globally	2017-05-12 11:41:30 +02:00
Oleksii Kliukin	3a4c6268be	Increase log verbosity, namely for object updates. - add a new environment variable for triggering debug log level - show both new, old object and diff during syncs and updates - use pretty package to pretty-print go structures -	2017-05-12 11:41:29 +02:00
Murat Kabilov	c2d2a67ad5	Get config from environment variables; ignore pg major version change; get rid of resources package;	2017-05-12 11:41:29 +02:00
Murat Kabilov	79a6726d4d	Increase logging verbosity, restructure code	2017-05-12 11:41:28 +02:00
Murat Kabilov	6f7399b36f	Sync clusters states * move statefulset creation from cluster spec to the separate function * sync cluster state with desired state; * move out from arrays for cluster resources; * recreate pods instead of deleting them in case of statefulset change * check for master while creating cluster/updating pods * simplify retryutil * list pvc while listing resources * name kubernetes resources with capital letter * do rolling update in case of env variables change	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	1377724b2e	Fix a compliation error.	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	31d7426327	ClusterTeamName -> ClusterName. Add a TODO item.	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	55dbacdfa6	Assign DNS name to the cluster. DNS name is generated from the team name and cluster name. Use "zalando.org/dnsname" service annotation that makes 'mate' service assign a CNAME to the load balancer name.	2017-05-12 11:41:27 +02:00
Murat Kabilov	486c8ecb07	use neutral name for set cluster status function	2017-05-12 11:41:26 +02:00
Murat Kabilov	1c6e7ac2e7	loadBalancerSourceRanges update	2017-05-12 11:41:26 +02:00
Murat Kabilov	fc127069ab	remove unnecessary ControllerNamespace	2017-05-12 11:41:26 +02:00
Murat Kabilov	416dace289	get rid of arrays in the kuberesources; use shorter form of checking for errors	2017-05-12 11:41:26 +02:00
Oleksii Kliukin	abd313f2d9	Fix a missing colon.	2017-05-12 11:41:26 +02:00
Oleksii Kliukin	f65fab00dd	Fix a typo	2017-05-12 11:41:26 +02:00
Oleksii Kliukin	033c28f03a	Delete persistent volumes on deletion of the cluster.	2017-05-12 11:41:26 +02:00
Murat Kabilov	caa0eab19b	Move statefulset creation from cluster spec to the separate function	2017-05-12 11:41:25 +02:00
Oleksii Kliukin	a2e78ac2ec	Feature/persistent volumes	2017-05-12 11:41:25 +02:00
Murat Kabilov	ae77fa15e8	Pod Rolling update introduce Pod events channel; add parsing of the MaintenanceWindows section; skip deleting Etcd key on cluster delete; use external etcd host; watch for tpr/pods in the namespace of the operator pod only;	2017-05-12 11:41:25 +02:00

1 2 3 4 5 ...

255 Commits