postgres-operator

Commit Graph

Author	SHA1	Message	Date
Oleksii Kliukin	e1ed4b847d	Use code-generation for CRD API and deepcopy methods (#369 ) Client-go provides a https://github.com/kubernetes/code-generator package in order to provide the API to work with CRDs similar to the one available for built-in types, i.e. Pods, Statefulsets and so on. Use this package to generate deepcopy methods (required for CRDs), instead of using an external deepcopy package; we also generate APIs used to manipulate both Postgres and OperatorConfiguration CRDs, as well as informers and listers for the Postgres CRD, instead of using generic informers and CRD REST API; by using generated code we can get rid of some custom and obscure CRD-related code and use a better API. All generated code resides in /pkg/generated, with an exception of zz_deepcopy.go in apis/acid.zalan.do/v1 Rename postgres-operator-configuration CRD to OperatorConfiguration, since the former broke naming convention in the code-generator. Moved Postgresql, PostgresqlList, OperatorConfiguration and OperatorConfigurationList and other types used by them into Change the type of the Error field in the Postgresql crd to a string, so that client-go could generate a deepcopy for it. Use generated code to set status of CRD objects as well. Right now this is done with patch, however, Kubernetes 1.11 introduces the /status subresources, allowing us to set the status with the special updateStatus call in the future. For now, we keep the code that is compatible with earlier versions of Kubernetes. Rename postgresql.go to database.go and status.go to logs_and_api.go to reflect the purpose of each of those files. Update client-go dependencies. Minor reformatting and renaming.	2018-08-15 17:22:25 +02:00
Oleksii Kliukin	e933908084	Configure pg_hba in the local postgresql configuration of Patroni. (#361 ) Previously, the operator put pg_hba into the bootstrap/pg_hba key of Patroni. That had 2 adverse effects: - pg_hba.conf was shadowed by Spilo default section in the local postgresql configuration - when updating pg_hba in the cluster manifest, the updated lines were not propagated to DCS, since the key was defined in the boostrap section of Patroni. Include some minor refactoring, moving methods to unexported when possible and commenting out usage of md5, so that gosec won't complain. Per https://github.com/zalando-incubator/postgres-operator/issues/330 Review by @zerg-junior	2018-08-08 11:01:26 +02:00
Oleksii Kliukin	199aa6508c	Populate list of clusters in the controller at startup. (#364 ) Assign the list of clusters in the controller with the up-to-date list of Postgres manifests on Kubernetes during the startup. Node migration routines launched asynchronously to the cluster processing rely on an up-to-date list of clusters in the controller to detect clusters affected by the migration of the node and lock them when doing migration of master pods. Without the initial list the operator was subject to race conditions like the one described at https://github.com/zalando-incubator/postgres-operator/issues/363 Restructure the code to decouple list cluster function required by the postgresql informer from the one that emits cluster sync events. No extra work is introduced, since cluster sync already runs in a separate goroutine (clusterResync). Introduce explicit initial cluster sync at the end of acquireInitialListOfClusters instead of relying on an implicit one coming from list function of the PostgreSQL informer. Some minor refactoring. Review by @zerg-junior	2018-08-08 11:00:56 +02:00
Oleksii Kliukin	b06186eb41	Linter-induced code refactoring, run round 2. (#360 ) Run more linters in the gometalinter, i.e. deadcode, megacheck, nakedret, dup. More consistent code formatting, remove two dead functions, eliminate naked a bunch of naked returns, refactor a few functions to avoid code duplication.	2018-08-06 12:09:19 +02:00
Oleksii Kliukin	59f0c5551e	Allow configuring pod priority globally and per cluster. (#353 ) * Allow configuring pod priority globally and per cluster. Allow to specify pod priority class for all pods managed by the operator, as well as for those belonging to individual clusters. Controlled by the pod_priority_class_name operator configuration parameter and the podPriorityClassName manifest option. See https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#priorityclass for the explanation on how to define priority classes since Kubernetes 1.8. Some import order changes are due to go fmt. Removal of OrphanDependents deprecated field. Code review by @zerg-junior	2018-08-03 14:03:37 +02:00
Oleksii Kliukin	ac7b132314	Refactoring inspired by gometalinter. (#357 ) Among other things, fix a few issues with deepcopy implementation.	2018-08-03 11:09:45 +02:00
Oleksii Kliukin	d2d3f21dc2	Client go upgrade v6 (#352 ) There are shortcuts in this code, i.e. we created the deepcopy function by using the deepcopy package instead of the generated code, that will be addressed once migrated to client-go v8. Also, some objects, particularly statefulsets, are still taken from v1beta, this will also be addressed in further commits once the changes are stabilized.	2018-08-01 11:08:01 +02:00
Oleksii Kliukin	0181a1b5b1	Introduce a repair scan to fix failing clusters (#304 ) A repair is a sync scan that acts only on those clusters that indicate that the last add, update or sync operation on them has failed. It is supposed to kick in more frequently than the repair scan. The repair scan still remains to be useful to fix the consequences of external actions (i.e. someone deletes a postgres-related service by mistake) unbeknownst to the operator. The repair scan is controlled by the new repair_period parameter in the operator configuration. It has to be at least 2 times more frequent than a sync scan to have any effect (a normal sync scan will update both last synced and last repaired attributes of the controller, since repair is just a sync underneath). A repair scan could be queued for a cluster that is already being synced if the sync period exceeds the interval between repairs. In that case a repair event will be discarded once the corresponding worker finds out that the cluster is not failing anymore. Review by @zerg-junior	2018-07-24 11:21:45 +02:00
zerg-junior	417f13c0bd	Submit RBAC credentials during initial Event processing (#344 ) * During initial Event processing submit the service account for pods and bind it to a cluster role that allows Patroni to successfully start. The cluster role is assumed to be created by the k8s cluster administrator.	2018-07-19 16:40:40 +02:00
Oleksii Kliukin	3a9378d3b8	Allow configuring the operator via the YAML manifest. (#326 ) * Up until now, the operator read its own configuration from the configmap. That has a number of limitations, i.e. when the configuration value is not a scalar, but a map or a list. We use a custom code based on github.com/kelseyhightower/envconfig to decode non-scalar values out of plain text keys, but that breaks when the data inside the keys contains both YAML-special elememtns (i.e. commas) and complex quotes, one good example for that is search_path inside `team_api_role_configuration`. In addition, reliance on the configmap forced a flag structure on the configuration, making it hard to write and to read (see https://github.com/zalando-incubator/postgres-operator/pull/308#issuecomment-395131778). The changes allow to supply the operator configuration in a proper YAML file. That required registering a custom CRD to support the operator configuration and provide an example at manifests/postgresql-operator-default-configuration.yaml. At the moment, both old configmap and the new CRD configuration is supported, so no compatibility issues, however, in the future I'd like to deprecate the configmap-based configuration altogether. Contrary to the configmap-based configuration, the CRD one doesn't embed defaults into the operator code, however, one can use the manifests/postgresql-operator-default-configuration.yaml as a starting point in order to build a custom configuration. Since previously `ReadyWaitInterval` and `ReadyWaitTimeout` parameters used to create the CRD were taken from the operator configuration, which is not possible if the configuration itself is stored in the CRD object, I've added the ability to specify them as environment variables `CRD_READY_WAIT_INTERVAL` and `CRD_READY_WAIT_TIMEOUT` respectively. Per review by @zerg-junior and @Jan-M.	2018-07-16 16:20:46 +02:00
Oleksii Kliukin	25a306244f	Support for per-cluster and operator global sidecars (#331 ) * Define sidecars in the operator configuration. Right now only the name and the docker image can be defined, but with the help of the pod_environment_configmap parameter arbitrary environment variables can be passed to the sidecars. * Refactoring around generatePodTemplate. Original implementation of per-cluster sidecars by @theRealWardo Per review by @zerg-junior and @Jan-M	2018-07-02 16:25:27 +02:00
zerg-junior	7394c15d0a	Make AWS region configurable in the operator cofig map (#333 )	2018-06-27 17:29:02 +02:00
Oleksii Kliukin	9cb48e0889	Document operator configuration parameters. (#313 )	2018-06-08 13:21:57 +02:00
Oleksii Kliukin	04b660519a	Fix exec into pods to resize volumes for multi-container pods. The original code assumed only one container per pod.	2018-06-04 14:51:39 +02:00
Oleksii Kliukin	48a5744314	Use Patroni API to set bootstrap-only options. (#299 ) Call Patroni API /config in order to set special options that are ignored when set in the configuration file, such as max_connections. Per https://github.com/zalando-incubator/postgres-operator/issues/297 * Some minor refacoring: Rename Cluster ManualFailover to Swithover Rename Patroni Failover to Switchover Add more details to error messages and comments introduced in this PR. Review by @zerg-junior	2018-05-29 12:35:25 +02:00
Sergey Dudoladov	2e041c50e6	Bump up default Spilo image	2018-05-28 16:54:27 +02:00
Manuel Gómez	32a1456a68	Update config.go	2018-05-24 16:58:46 +02:00
Sergey Dudoladov	749d723f55	Shorten the commen	2018-05-24 16:22:13 +02:00
Sergey Dudoladov	9824ddae5e	Fix etcd_host default	2018-05-24 16:05:45 +02:00
Oleksii Kliukin	11d568bf65	Address code review by @zerg-junior - new info messages, rename the annotation flag.	2018-05-15 16:50:03 +02:00
Oleksii Kliukin	0c616a802f	Merge branch 'master' into rolling_updates_with_statefulset_annotations # Conflicts: # pkg/cluster/k8sres.go	2018-05-15 15:33:34 +02:00
Oleksii Kliukin	987b43456b	Deprecate old LB options, fix endpoint sync. (#287 ) * Depreate old LB options, fix endpoint sync. - deprecate useLoadBalancer, replicaLoadBalancer from the manifest and enable_load_balancer from the operator configuration. The old operator configuration options become no-op with this commit. For the old manifest options, `useLoadBalancer` and `replicaLoadBalancer` are still consulted, but only in the absense of the new ones (enableMasterLoadBalancer and enableReplicaLoadBalancer). - Make sure the endpoint being created during the sync receives proper addresses subset. This is more critical for the replicas, as for the masters Patroni will normally re-create the endpoint before the operator. - Avoid creating the replica endpoint, since it will be created automatically by the corresponding service. - Update the README and unit tests. Code review by @mgomezch and @zerg-junior	2018-05-15 15:19:18 +02:00
Oleksii Kliukin	332dab5237	Merge branch 'rolling_updates_with_statefulset_annotations' of github.com:zalando-incubator/postgres-operator into rolling_updates_with_statefulset_annotations	2018-05-08 14:51:10 +02:00
Sergey Dudoladov	59ded0c212	Shorten bucket name	2018-05-02 14:05:57 +02:00
Sergey Dudoladov	c45219bafa	Set up an S3 bucket for the postgres daily logs	2018-05-02 12:52:42 +02:00
Sergey Dudoladov	d99b553ec1	Convert default account definiton into JSON	2018-04-25 12:35:16 +02:00
Sergey Dudoladov	e3f7fac443	Comment on the default value for pod service account name	2018-04-24 15:41:28 +02:00
Sergey Dudoladov	485ec4b8ea	Move service account to Controller	2018-04-24 15:13:08 +02:00
Sergey Dudoladov	c31c76281c	Make operator unaware of its own service account	2018-04-23 14:38:20 +02:00
Sergey Dudoladov	bd51d2922b	Turn ServiceAccount into struct value to avoid race conditon during account creation	2018-04-20 13:05:05 +02:00
Sergey Dudoladov	214ae04aa7	Deploy service account for pod creation on demand	2018-04-18 16:20:20 +02:00
Sergey Dudoladov	96d46252f5	Change the default values to closer match previous behaviour	2018-03-26 11:43:46 +02:00
Sergey Dudoladov	a8862aeee1	Enable backward compatibility for enable_load_balancer setting from operator configmap	2018-03-19 17:19:50 +01:00
Sergey Dudoladov	145689c950	Disable load balancer for master service by default (it may cost money)	2018-03-16 13:18:13 +01:00
Sergey Dudoladov	0986e56226	Add separate params for master and replica load balancers to operator configuration	2018-03-14 12:12:28 +01:00
Dmitry Dolgov	bf4b0f0f33	Merge pull request #240 from zalando-incubator/feature/goreport-improvements Some improvements for golint, ineffassign and misspell	2018-02-22 11:31:08 +01:00
Oleksii Kliukin	cca73e30b7	Make code around recreating pods and creating objects in the database less brittle (#213 ) There used to be a masterLess flag that was supposed to indicate whether the cluster it belongs to runs without the acting master by design. At some point, as we didn't really have support for such clusters, the flag has been misused to indicate there is no master in the cluster. However, that was not done consistently (a cluster without all pods running would never be masterless, even when the master is not among the running pods) and it was based on the wrong assumption that the masterless cluster will remain masterless until the next attempt to change that flag, ignoring the possibility of master coming up or some node doing a successful promotion. Therefore, this PR gets rid of that flag completely. When the cluster is running with 0 instances, there is obviously no master and it makes no sense to create any database objects inside the non-existing master. Therefore, this PR introduces an additional check for that. recreatePods were assuming that the roles of the pods recorded when the function has stared will not change; for instance, terminated replica pods should start as replicas. Revisit that assumption by looking at the actual role of the re-spawned pods; that avoids a failover if some replica has promoted to the master role while being re-spawned. In addition, if the failover from the old master was unsuccessful, we used to stop and leave the old master running on an old pod, without recording this fact anywhere. This PR makes the failover failure emit a warning, but not stop recreating the last master pod; in the worst case, the running master will be terminated, however, this case is rather unlikely one. As a side effect, make waitForPodLabel return the pod definition it waited for, avoiding extra API calls in recreatePods and movePodFromEndOfLifeNode	2018-02-22 10:42:05 +01:00
Oleksii Kliukin	85f7c944c2	Improve the condition check.	2018-02-22 10:13:46 +01:00
Sergey Dudoladov	e048328d6a	Comment on special values for watched namespace	2018-02-20 17:26:17 +01:00
Sergey Dudoladov	dcfc9925f6	Respond to code review	2018-02-20 14:43:02 +01:00
Dmitrii Dolgov	a7cd859919	Some improvements for golint, ineffassign and misspell	2018-02-19 17:46:31 +01:00
Sergey Dudoladov	088bf70e7d	Merge branch 'master' into support-many-namespaces	2018-02-16 15:06:10 +01:00
Sergey Dudoladov	06fd9e33f5	Watch the namespace where operator deploys to unless told otherwise	2018-02-13 18:17:47 +01:00
Dmitrii Dolgov	4c1db33c27	Change the order of arguments	2018-02-08 10:43:27 +01:00
Sergey Dudoladov	de2a028592	Warn if the watched namespace does not exist	2018-02-07 17:43:05 +01:00
Dmitrii Dolgov	dd79fcd036	Tests for retry_utils One can argue about how necessary they are, but at least I remembered how to do golang.	2018-02-07 17:04:43 +01:00
Sergey Dudoladov	74fa7b9492	Restrict operator to single watched namespace via env var	2018-02-07 16:44:49 +01:00
Sergey Dudoladov	ea84f9d577	Rename the configmap 'namespace' entry to avoid confusion with the map's owm namespace	2018-02-06 15:09:00 +01:00
Oleksii Kliukin	b90a36c909	Set node_readiness_label default to an empty value. (#204 ) Previously, it was set to the lifecycle-status:ready, breaking a lot of minikube deployments. Also it was not possible befor to run with this label set to an empty value. Document the effect of the label in the new section of the documentation.	2018-01-16 15:43:03 +01:00
Oleksii Kliukin	8e99518eeb	Improve behavior on node decomissionining (#184 ) * Trigger the node migration on the lack of the readiness label. * Examine the node's readiness status on node add. Make sure we don't miss the not ready node, especially when the operator is killed during the migration.	2018-01-04 11:53:15 +01:00
Manuel Gómez	15c278d4e8	Scalyr agent sidecar for log shipping (#190 ) * Scalyr agent sidecar for log shipping * Remove the default for the Scalyr image Now the image needs to be specified explicitly to enable log shipping to Scalyr. This removes the problem of having to generate the config file or publish our agent image repository. * Add configuration variable for Scalyr server URL Defaults to the EU address. * Alter style Newlines are cheap and make code easier to edit/refactor, but ok. * Fix StatefulSet comparison logic I broke it when I made the comparison consider all containers in the PostgreSQL pod.	2017-12-21 15:34:26 +01:00
Oleksii Kliukin	bf80f5225e	Introduce higher and lower bounds for the number of instances (#178 ) * Introduce higher and lower bounds for the number of instances Reduce the number of instances to the min_instances if it is lower and to the max_instances if it is higher. -1 for either of those means there is no lower or upper bound. In addition, terminate the operator when there is a nonsense in the configuration (i.e. max_instances < min_instances). Reviewed by Jan Mußler and Sergey Dudoladov.	2017-12-15 16:02:50 +01:00
Georg Kunz	e8d9c75949	Allow custom Postgres pod environment variables	2017-12-14 14:39:33 +01:00
Oleksii Kliukin	87bc47d8d0	Fixes for the case of re-creating the cluster after deletion. - make sure that the secrets for the system users (superuser, replication) are not deleted when the main cluster is. Therefore, we can re-create the cluster, potentially forcing Patroni to restore it from the backup and enable Patroni to connect, since it will use the old password, not the newly generated random one. - when syncing users, always check whether they are already in the DB. Previously, we did this only for the sync cluster case, but the new cluster could be actually the one restored from the backup by Patroni, having all or some of the users already in place. - delete endponts last. Patroni uses the $clustername endpoint in order to store the leader related metadata. If we remove it before removing all pods, one of those pods running Patroni will re-create it and the next attempt to create the cluster with the same name will stuble on the existing endpoint. - Use db.Exec instead of db.Query for queries that expect no result. This also fixes the issue with the DB creation, since we didn't release an empty Row object it was not possible to create more than one database for a cluster.	2017-12-13 16:49:00 +01:00
Oleksii Kliukin	1fb8cf7ea0	Avoid overwriting critical users. (#172 ) * Avoid overwriting critical users. Disallow defining new users either in the cluster manifest, teams API or infrastructure roles with the names mentioned in the new protected_role_names parameter (list of comma-separated names) Additionally, forbid defining a user with the name matching either super_username or replication_username, so that we don't overwrite system roles required for correct working of the operator itself. Also, clear PostgreSQL roles on each sync first in order to avoid using the old definitions that are no longer present in the current manifest, infrastructure roles secret or the teams API.	2017-12-05 14:27:12 +01:00
Oleksii Kliukin	637921cdee	Tests for initHumanUsers and initinitRobotUsers. Change the Cluster class in the process to implelement Teams API calls and Oauth token fetches as interfaces, so that we can mock them in the tests.	2017-12-04 10:49:25 +01:00
Oleksii Kliukin	dd0affc390	Tweak our reaction to the cluster upgrade process. Previously, the operator started to move the pods off the nodes to be decomissioned by watching the eol_node_label value. Every new postgres pod has been created with the anti-affinity to that label, making sure that the pods being moved won't land on another to be decomissioned node. The changes introduce another label that indicates the ready node. The new pod affinity will esnure that the pod is only scheduled to the node marked as ready, discarding the previous anti-affinity. That way the nodes can transition from the pending-decomission to the other statuses (drained, terminating) without having pods suddently scaled to them. In addition, rename the label that triggers the start of the upgrade process to node_eol_label (for consistency with node_readiness_label) and set its default vvalue to lifecycle-status:pending-decomission.	2017-11-30 14:11:49 +01:00
Oleksii Kliukin	1ffe98ba9f	Fix the connection leak and user options sync. - fix the lack of closing the cursor for the query that returned no rows. - fix syncing of the user options, as previously those were not fetched from the database.	2017-11-27 16:46:34 +01:00
Oleksii Kliukin	086ead03f5	Warn about attempts to use escape quotes.	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	975b21f633	Rename api roles configuration parameter. Change api_roles_configuration to team_api_role_configuration	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	6b2f5071f7	Special case for search_path in user options. - search_path accepts a list of values that cannot be quoted, as quoting would make PostgreSQL interpret the result as a single value. Since we require quoting of values with commas in the operator's configMap in order to avoid confusing them with the separate map entities, we need to strip those quotes before passing the value to PostgreSQL. - make ftm run	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	2079d811b4	Add tests for the string splitting function.	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	e95f80e351	Make configMap marshaling code aware of quotes. A value in a configMap that is a map itself (a key:value string separated by commas) may include commans inside quotes (i.e. search_path:"public,"$user"). The changes make marshaling code process such cases correctly.	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	2352fc9a39	go fmt run	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	71f57c9fe3	Fix escaping of parameter values and extra spaces. - document the newly introduced option (for now in the main README) - make query error output more readable.	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	415a7fdc4d	Allow global configuration options for API roles. Add options to the PgUser structure, potentially allowing to set per-role options in the cluster definition as well. Introduce api_roles_configuration operator option with the default of log_statement=all	2017-11-22 10:43:35 +01:00
Oleksii Kliukin	c25e849fe4	Fix a failure to create new statefulset at sync. Also do a fmt run.	2017-11-08 18:24:17 +01:00
Murat Kabilov	86803406db	use sync methods while updating the cluster	2017-11-03 12:00:43 +01:00
Georg Kunz	47dd766fa7	Add node toleration config to PodSpec (#151 ) * Add node toleration config to PodSpec This allows to taint nodes dedicated to Postgres and prevents other pods from running on these nodes. * Document taint and toleration setup And remove setting from default operator ConfigMap * Allow to overwrite tolerations with Postgres manifest	2017-11-02 19:10:44 +01:00
Oleksii Kliukin	eba23279c8	Kube cluster upgrade	2017-10-19 10:49:42 +02:00
Murat Kabilov	202f2de988	Retry connecting to pg	2017-10-17 17:03:50 +02:00
Murat Kabilov	6c4cb4e9da	Perform manual failover during the scale down	2017-10-16 17:41:23 +02:00
Murat Kabilov	5b29576a8e	Remove redundant constants	2017-10-16 15:52:48 +02:00
Jan Mussler	cec695d48e	Superuser toggle for team members Make superuser toggleable for team members. Add and "admin" role to team members if superuser is disabled.	2017-10-12 15:01:54 +02:00
Murat Kabilov	83c8d6c419	Extend diagnostic api with worker status info	2017-10-11 12:26:09 +02:00
Murat Kabilov	2f3bb1e265	set the proper name for the crd related constants file	2017-10-09 11:01:46 +02:00
Murat Kabilov	a35e9c6119	move from tpr to crd	2017-10-06 15:12:08 +02:00
Murat Kabilov	93d4bf2b55	Merge branch 'master' into api-improvements	2017-09-26 14:47:13 +02:00
Murat Kabilov	9a66e09b88	cluster history api endpoint	2017-09-26 14:30:45 +02:00
Murat Kabilov	ed476ae85d	add missing comment for the method	2017-09-26 13:39:13 +02:00
Murat Kabilov	c44cfff988	add Diff util method	2017-09-26 13:13:15 +02:00
Murat Kabilov	c67f06956e	fix comments for ringlogger	2017-09-26 13:12:38 +02:00
Murat Kabilov	d876f4d88e	set secret name template via config map	2017-09-18 14:25:09 +02:00
Oleksii Kliukin	7667847bfe	Feature/validate role options (#101 ) Be more rigorous about validating user flags. Only accept CREATE ROLE flags that doesn't have any params (i.e. not ADMIN or CONNECTION LIMIT). Check that both flag and NOflag are not used at the same time.	2017-09-15 13:57:48 +02:00
Oleksii Kliukin	8b85935a7a	Allow cloning clusters from the operator. (#90 ) Allow cloning clusters from the operator. The changes add a new JSON node `clone` with possible values `cluster` and `timestamp`. `cluster` is mandatory, and setting a non-empty `timestamp` triggers wal-e point in time recovery. Spilo and Patroni do the whole heavy-lifting, the operator just defines certain variables and gathers some data about how to connect to the host to clone or the target S3 bucket. As a minor change, set the image pull policy to IfNotPresent instead of Always to simplify local testing. Change the default replication username to standby.	2017-09-08 16:47:03 +02:00
Murat Kabilov	8aa11ecee2	Add patroni api client	2017-08-30 16:01:18 +02:00
Murat Kabilov	71dfb33b2b	make pod termination grace period configurable	2017-08-18 16:38:25 +02:00
Murat Kabilov	d2828e5ece	remove var shading; fix imports	2017-08-15 15:59:10 +02:00
Murat Kabilov	38e0ffecf7	make controllerinformer interface private; use named regexp groups	2017-08-15 14:07:16 +02:00
Murat Kabilov	82d5583809	add diagnostic api http server	2017-08-15 12:20:09 +02:00
Murat Kabilov	51fdfb90f7	log cluster and controller events in the ringlog via logrus hook	2017-08-15 12:16:09 +02:00
Murat Kabilov	4ee28e3818	add ringlog	2017-08-15 11:59:09 +02:00
Murat Kabilov	606d000022	fix test	2017-08-15 10:41:04 +02:00
Murat Kabilov	5470f20be4	always pass a cluster name as a logger field	2017-08-15 10:29:18 +02:00
Murat Kabilov	e26db66cb5	start all the log messages with lowercase letters	2017-08-15 10:12:36 +02:00
Oleksii Kliukin	8b58782a4a	fix pam_role_name parameter name.	2017-08-02 17:55:06 +02:00
Murat Kabilov	cf663cb841	Fix golint warnings	2017-08-01 16:08:56 +02:00
Murat Kabilov	1211220208	Skip running empty set of queries	2017-08-01 10:09:09 +02:00
Murat Kabilov	1f8b37f33d	Make use of kubernetes client-go v4 * client-go v4.0.0-beta0 * remove unnecessary methods for tpr object * rest client: use interface instead of structure pointer * proper names for constants; some clean up for log messages * remove teams api client from controller and make it per cluster	2017-07-25 15:25:17 +02:00
Oleksii Kliukin	4455f1b639	Feature/unit tests (#53 ) - Avoid relying on Clientset structure to call Kubernetes API functions. While Clientset is a convinient "catch-all" abstraction for calling REST API related to different Kubernetes objects, it's impossible to mock. Replacing it wih the kubernetes.Interface would be quite straightforward, but would require an exra level of mocked interfaces, because of the versioning. Instead, a new interface is defined, which contains only the objects we need of the pre-defined versions. - Move KubernetesClient to k8sutil package. - Add more tests.	2017-07-24 16:56:46 +02:00
Murat Kabilov	4f36e447c3	Skip config params with no values (#62 )	2017-07-14 17:22:25 +02:00
Oleksii Kliukin	00150711e4	Configure load balancer on a per-cluster and operator-wide level (#57 ) * Deny all requests to the load balancer by default. * Operator-wide toggle for the load-balancer. * Define per-cluster useLoadBalancer option. If useLoadBalancer is not set - then operator-wide defaults take place. If it is true - the load balancer is created, otherwise a service type clusterIP is created. Internally, we have to completely replace the service if the service type changes. We cannot patch, since some fields from the old service that will remain after patch are incompatible with the new one, and handling them explicitly when updating the service is ugly and error-prone. We cannot update the service because of the immutable fields, that leaves us the only option of deleting the old service and creating the new one. Unfortunately, there is still an issue of unnecessary removal of endpoints associated with the service, it will be addressed in future commits. * Revert the unintended effect of go fmt * Recreate endpoints on service update. When the service type is changed, the service is deleted and then the one with the new type is created. Unfortnately, endpoints are deleted as well. Re-create them afterwards, preserving the original addresses stored in them. * Improve error messages and comments. Use generate instead of gen in names.	2017-06-30 13:38:49 +02:00
Murat Kabilov	9a6b0b8c37	Tests for teams API (#46 )	2017-06-12 17:29:32 +02:00
Oleksii Kliukin	987990fb0e	Move service annotation patch template into the constants.	2017-06-12 10:24:23 +02:00
Murat Kabilov	1540a2ba65	fix typos; remove unnecessary tests; go fmt -s	2017-06-08 15:52:01 +02:00
Murat Kabilov	e104a67260	Fix resync of the clusters	2017-06-08 11:51:48 +02:00
Murat Kabilov	bdc2db97ac	Tests for Specs and Teams API	2017-06-08 10:58:48 +02:00
Oleksii Kliukin	bc0e9ab4bc	Add error checks per report from errcheck-ng	2017-06-08 10:41:44 +02:00
Oleksii Kliukin	dc36c4ca12	Implement replicaLoadBalancer boolean flag. (#38 ) The flag adds a replica service with the name cluster_name-repl and a DNS name that defaults to {cluster}-repl.{team}.{hostedzone}. The implementation converted Service field of the cluster into a map with one or two elements and deals with the cases when the new flag is changed on a running cluster (the update and the sync should create or delete the replica service). In order to pick up master and replica service and master endpoint when listing cluster resources. * Update the spec when updating the cluster.	2017-06-07 13:54:17 +02:00
Oleksii Kliukin	7b0ca31bfb	Implements EBS volume resizing #35 . In order to support volumes different from EBS and filesystems other than EXT2/3/4 the respective code parts were implemented as interfaces. Adding the new resize for the volume or the filesystem will require implementing the interface, but no other changes in the cluster code itself. Volume resizing first changes the EBS and the filesystem, and only afterwards is reflected in the Kubernetes "PersistentVolume" object. This is done deliberately to be able to check if the volume needs resizing by peeking at the Size of the PersistentVolume structure. We recheck, nevertheless, in the EBSVolumeResizer, whether the actual EBS volume size doesn't match the spec, since call to the AWS ModifyVolume is counted against the resize limit of once every 6 hours, even for those calls that shouldn't result in an actual resize (i.e. when the size matches the one for the running volume). As a collateral, split the constants into multiple files, move the volume code into a separate file and fix minor issues related to the error reporting.	2017-06-06 13:53:27 +02:00
Murat Kabilov	1fb05212a9	Refactor teams API package	2017-05-30 10:14:30 +02:00
Murat Kabilov	1111964fee	fix password check in pguserpassword remove magic number	2017-05-26 18:19:12 +02:00
Oleksii Kliukin	afce38f6f0	Fix error messages (#27 ) Use lowercase for kubernetes objects Use %v instead of %s for errors Start error messages with a lowercase letter.	2017-05-22 14:12:06 +02:00
Murat Kabilov	d34273543e	Fix the golint, gosimple warnings	2017-05-18 17:38:54 +02:00
Murat Kabilov	95a57d1e4f	Use named arguments in the DNS name format	2017-05-18 17:23:59 +02:00
Oleksii Kliukin	c2826b10e2	Merge branch 'master' into fix/go-vet-fixes	2017-05-17 11:30:07 +02:00
Oleksii Kliukin	4457ce4e47	Replace the statefulset if it cannot be updated. (#18 ) Updates to statefulset spec for fields other than 'replicas' and containers' are forbidden. However, it is possible to delete the old statefulset without deleting its pods and create the new one, using the changed specs. The new statefulset shall pick up the orphaned pods. Change the statefulset's comparison to return the combined effect of all checks, not just the first non-matching field.	2017-05-17 11:28:21 +02:00
Murat Kabilov	22bcae0784	skip unused variable	2017-05-17 11:15:09 +02:00
Oleksii Kliukin	5adceceb36	go fmt run	2017-05-12 17:48:25 +02:00
Oleksii Kliukin	abd04e6f5a	Avoid abbreviations in user-facing parameters.	2017-05-12 17:44:51 +02:00
Oleksii Kliukin	03064637f1	Allow disabling access to the DB and the Teams API. Command-line options --nodatabaseaccess and --noteamsapi disable all teams api interaction and access to the Postgres database. This is useful for debugging purposes when the operator runs out of cluster (with --outofcluster flag). The same effect can be achieved by setting enable_db_access and/or enable_teams_api to false.	2017-05-12 17:40:48 +02:00
Murat Kabilov	92d7fbf372	replace github.bus.zalan.do with github.cm/zalando-incubator	2017-05-12 11:50:16 +02:00
Murat Kabilov	28a74622d7	Fix typo in the teams api json spec	2017-05-12 11:41:36 +02:00
Murat Kabilov	18700b9ef7	Optimize template constant	2017-05-12 11:41:36 +02:00
Murat Kabilov	fd449342e5	Use Kubernetes API instead of API group	2017-05-12 11:41:36 +02:00
Oleksii Kliukin	6983f444ed	Periodically sync roles with the running clusters. (#102 ) The sync adds or alters database roles based on the roles defined in the cluster's TPR, Team API and operator's infrastructure roles. At the moment, roles are not deleted, as it would be dangerous for the robot roles in case TPR is misconfigured. In addition, ALTER ROLE does not remove role options, i.e. SUPERUSER or CREATEROLE, neither it removes role membership: only new options are added and new role membership is granted. So far, options like NOSUPERUSER and NOCREATEROLE won't be handed correctly, when mixed with the non-negative counterparts, also NOLOGIN should be processed correctly. The code assumes that only MD5 passwords are stored in the DB and will likely break with the new SCRAM auth in PostgreSQL 10. On the implementation side, create the new interface to abstract roles merge and creation, move most of the role-based functionality from cluster/pg into the new 'users' module, strip create user code of special cases related to human-based users (moving them to init instead) and fixed the password md5 generator to avoid processing already encrypted passwords. In addition, moved the system roles off the slice containing all other roles in order to avoid extra efforts to avoid creating them. Also, fix a leak in DB connections when the new connection is not considered healthy and discarded without being closed. Initialize the database during the sync phase before syncing users.	2017-05-12 11:41:35 +02:00
Martin Linkhorst	411487e66d	update annotation for ExternalDNS (#115 )	2017-05-12 11:41:35 +02:00
Oleksii Kliukin	49cb395aed	Set ELB timeout annotation for the service. (#114 ) By default the ELB terminates the idle connection after 60 seconds. Increase this interval to a more reasonable one of 1 h.	2017-05-12 11:41:35 +02:00
Murat Kabilov	2370659c69	Parallel cluster processing Run operations concerning multiple clusters in parallel. Each cluster gets its own worker in order to create, update, sync or delete clusters. Each worker acquires the lock on a cluster. Subsequent operations on the same cluster have to wait until the current one finishes. There is a pool of parallel workers, configurable with the `workers` parameter in the configmap and set by default to 4. The cluster-related tasks are assigned to the workers based on a cluster name: the tasks for the same cluster will be always assigned to the same worker. There is no blocking between workers, although there is a chance that a single worker will become a bottleneck if too many clusters are assigned to it; therefore, for large-scale deployments it might be necessary to bump up workers from the default value.	2017-05-12 11:41:35 +02:00
Oleksii Kliukin	1c4bce86df	Avoid "bulk-comparing" pod resources during sync. (#109 ) * Avoid "bulk-comparing" pod resources during sync. First attempt to fix bogus restarts due to the reported mismatch of container resources where one of the resources is an empty struct, while the other has all fields set to nil. In addition, add an ability to set limits and requests per pod, as well as the operator-level defaults.	2017-05-12 11:41:35 +02:00
Murat Kabilov	8026c69222	update default config param values	2017-05-12 11:41:34 +02:00
Murat Kabilov	da438aab3a	Use ConfigMap to store operator's config	2017-05-12 11:41:34 +02:00
Oleksii Kliukin	47e3e29a56	Add version label to the cluster. (#96 ) * Add version label to the cluster. According to the STUPS team the daemon that exports logs to scalyr stops the export if the version label is missing. * Move label names to constants. * Run go fmt	2017-05-12 11:41:34 +02:00
Murat Kabilov	08c0e3b6dd	Use unified type for the namespaced object names	2017-05-12 11:41:34 +02:00
Oleksii Kliukin	71b93b4cc2	Feature/infrastructure roles (#91 ) * Add infrastructure roles configured globally. Those are the roles defined in the operator itself. The operator's configuration refers to the secret containing role names, passwords and membership information. While they are referred to as roles, in reality those are users. In addition, improve the regex to filter out invalid users and make sure user secret names are compatible with DNS name spec. Add an example manifest for the infrastructure roles.	2017-05-12 11:41:33 +02:00
Murat Kabilov	dd2ed5ff9d	Add team name to tpr object metadata name	2017-05-12 11:41:33 +02:00
Murat Kabilov	101dc06acb	Better logging for teams api calls	2017-05-12 11:41:32 +02:00
Oleksii Kliukin	5b66d0adba	Correct go json tags (extra space).	2017-05-12 11:41:32 +02:00
Oleksii Kliukin	3b99ce3d2e	Improve the diff in cluster resources. - Use the branch of pretty with this feature fixed: https://github.com/kr/pretty/pull/42 - Add the Limit to the resources declaration to avoid dummy differences between statefulsets (where both Resource structures are empty, but in one case the fields are not mentioned, while in another they are assigned to empty values).	2017-05-12 11:41:32 +02:00
Oleksii Kliukin	455f91128f	Move master/replica role names into the constants.	2017-05-12 11:41:32 +02:00
Oleksii Kliukin	a5f0ef10d0	go fmt run	2017-05-12 11:41:31 +02:00
Oleksii Kliukin	0764505a10	correct the wal bucket parameter name.	2017-05-12 11:41:31 +02:00
Oleksii Kliukin	7841b85892	Add configuration to support running WAL-E. - Set WAL_S3_BUCKET to point WAL-E where to fetch/store WAL files - Set annotations/iam.amazonaws.com/role to set the role to access AWS" The new env vairables are PGOP_WAL_S3_BUCKET and PGOP_KUBE_IAM_ROLE.	2017-05-12 11:41:31 +02:00
Murat Kabilov	852c5beae5	Check etcd key availability for the new cluster	2017-05-12 11:41:31 +02:00
Oleksii Kliukin	8db44d6f18	Avoid unnecessary marshaling.	2017-05-12 11:41:30 +02:00
Oleksii Kliukin	b69b6b26e5	git fmt run	2017-05-12 11:41:30 +02:00
Murat Kabilov	310c119dfa	Display config on operator start up	2017-05-12 11:41:30 +02:00
Murat Kabilov	a97dfb07de	fix struct tag delimiter	2017-05-12 11:41:30 +02:00
Oleksii Kliukin	ba8e8d1857	Avoid showing objects alongside diffs. That reduces the amount of clutter in the debug output. Run go fmt on the sources.	2017-05-12 11:41:30 +02:00
Oleksii Kliukin	3a4c6268be	Increase log verbosity, namely for object updates. - add a new environment variable for triggering debug log level - show both new, old object and diff during syncs and updates - use pretty package to pretty-print go structures -	2017-05-12 11:41:29 +02:00
Murat Kabilov	c2d2a67ad5	Get config from environment variables; ignore pg major version change; get rid of resources package;	2017-05-12 11:41:29 +02:00
Murat Kabilov	79a6726d4d	Increase logging verbosity, restructure code	2017-05-12 11:41:28 +02:00
Murat Kabilov	3aaa05fb96	Use encrypted passwords while creating robot users	2017-05-12 11:41:28 +02:00
Oleksii Kliukin	48ba6adf8a	Avoid calling Team API with an expired token. Previously, the controller fetched the Oauth token once at start, so eventually the token would expire and the operator could not create new users. This commit makes the operator fetch the token before each call to the Teams API.	2017-05-12 11:41:28 +02:00
Murat Kabilov	6f7399b36f	Sync clusters states * move statefulset creation from cluster spec to the separate function * sync cluster state with desired state; * move out from arrays for cluster resources; * recreate pods instead of deleting them in case of statefulset change * check for master while creating cluster/updating pods * simplify retryutil * list pvc while listing resources * name kubernetes resources with capital letter * do rolling update in case of env variables change	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	814f75f7c1	Formatting changes	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	7529b84b93	Move all operator-related constants together.	2017-05-12 11:41:27 +02:00
Oleksii Kliukin	55dbacdfa6	Assign DNS name to the cluster. DNS name is generated from the team name and cluster name. Use "zalando.org/dnsname" service annotation that makes 'mate' service assign a CNAME to the load balancer name.	2017-05-12 11:41:27 +02:00
Murat Kabilov	34ac47aed9	Expose container 8080 port	2017-05-12 11:41:26 +02:00
Oleksii Kliukin	776ed3fa0f	Simplify getting configuration.	2017-05-12 11:41:25 +02:00
Oleksii Kliukin	a2e78ac2ec	Feature/persistent volumes	2017-05-12 11:41:25 +02:00
Murat Kabilov	ae77fa15e8	Pod Rolling update introduce Pod events channel; add parsing of the MaintenanceWindows section; skip deleting Etcd key on cluster delete; use external etcd host; watch for tpr/pods in the namespace of the operator pod only;	2017-05-12 11:41:25 +02:00
Murat Kabilov	6e2d64bd50	Create human users from teams api	2017-05-12 11:37:09 +02:00
Murat Kabilov	58506634c4	Create pg users	2017-05-12 11:37:09 +02:00
Murat Kabilov	7e4d0410c2	Use one secret per user	2017-05-12 11:37:09 +02:00
Murat Kabilov	abb1173035	Code refactor	2017-05-12 11:37:09 +02:00

... 2 3 4 5 6 ...

316 Commits