postgres-operator

Commit Graph

Author	SHA1	Message	Date
Felix Kunde	31e568157b	reflect change in github url (#496 ) Project was moved from the incubator to the Zalando main org, hence the rename	2019-02-25 11:26:55 +01:00
Maxim Ivanov	1109c861fb	Report new Postgres CR error when previously incorrect one is being updated (#449 )	2019-01-18 13:36:44 +01:00
Noah Kantrowitz	688d252752	Some tweaks to ensure compat with newer Go. (#383 )	2018-09-17 10:13:07 +02:00
Noah Kantrowitz	0b75a89920	Fix the casing of github.com/Sirupsen/logrus to match what the project itself uses. (#380 ) Dep enforces this.	2018-09-06 10:26:48 +02:00
Oleksii Kliukin	e1ed4b847d	Use code-generation for CRD API and deepcopy methods (#369 ) Client-go provides a https://github.com/kubernetes/code-generator package in order to provide the API to work with CRDs similar to the one available for built-in types, i.e. Pods, Statefulsets and so on. Use this package to generate deepcopy methods (required for CRDs), instead of using an external deepcopy package; we also generate APIs used to manipulate both Postgres and OperatorConfiguration CRDs, as well as informers and listers for the Postgres CRD, instead of using generic informers and CRD REST API; by using generated code we can get rid of some custom and obscure CRD-related code and use a better API. All generated code resides in /pkg/generated, with an exception of zz_deepcopy.go in apis/acid.zalan.do/v1 Rename postgres-operator-configuration CRD to OperatorConfiguration, since the former broke naming convention in the code-generator. Moved Postgresql, PostgresqlList, OperatorConfiguration and OperatorConfigurationList and other types used by them into Change the type of the Error field in the Postgresql crd to a string, so that client-go could generate a deepcopy for it. Use generated code to set status of CRD objects as well. Right now this is done with patch, however, Kubernetes 1.11 introduces the /status subresources, allowing us to set the status with the special updateStatus call in the future. For now, we keep the code that is compatible with earlier versions of Kubernetes. Rename postgresql.go to database.go and status.go to logs_and_api.go to reflect the purpose of each of those files. Update client-go dependencies. Minor reformatting and renaming.	2018-08-15 17:22:25 +02:00
Oleksii Kliukin	199aa6508c	Populate list of clusters in the controller at startup. (#364 ) Assign the list of clusters in the controller with the up-to-date list of Postgres manifests on Kubernetes during the startup. Node migration routines launched asynchronously to the cluster processing rely on an up-to-date list of clusters in the controller to detect clusters affected by the migration of the node and lock them when doing migration of master pods. Without the initial list the operator was subject to race conditions like the one described at https://github.com/zalando-incubator/postgres-operator/issues/363 Restructure the code to decouple list cluster function required by the postgresql informer from the one that emits cluster sync events. No extra work is introduced, since cluster sync already runs in a separate goroutine (clusterResync). Introduce explicit initial cluster sync at the end of acquireInitialListOfClusters instead of relying on an implicit one coming from list function of the PostgreSQL informer. Some minor refactoring. Review by @zerg-junior	2018-08-08 11:00:56 +02:00
Oleksii Kliukin	b06186eb41	Linter-induced code refactoring, run round 2. (#360 ) Run more linters in the gometalinter, i.e. deadcode, megacheck, nakedret, dup. More consistent code formatting, remove two dead functions, eliminate naked a bunch of naked returns, refactor a few functions to avoid code duplication.	2018-08-06 12:09:19 +02:00
Oleksii Kliukin	d2d3f21dc2	Client go upgrade v6 (#352 ) There are shortcuts in this code, i.e. we created the deepcopy function by using the deepcopy package instead of the generated code, that will be addressed once migrated to client-go v8. Also, some objects, particularly statefulsets, are still taken from v1beta, this will also be addressed in further commits once the changes are stabilized.	2018-08-01 11:08:01 +02:00
Oleksii Kliukin	0181a1b5b1	Introduce a repair scan to fix failing clusters (#304 ) A repair is a sync scan that acts only on those clusters that indicate that the last add, update or sync operation on them has failed. It is supposed to kick in more frequently than the repair scan. The repair scan still remains to be useful to fix the consequences of external actions (i.e. someone deletes a postgres-related service by mistake) unbeknownst to the operator. The repair scan is controlled by the new repair_period parameter in the operator configuration. It has to be at least 2 times more frequent than a sync scan to have any effect (a normal sync scan will update both last synced and last repaired attributes of the controller, since repair is just a sync underneath). A repair scan could be queued for a cluster that is already being synced if the sync period exceeds the interval between repairs. In that case a repair event will be discarded once the corresponding worker finds out that the cluster is not failing anymore. Review by @zerg-junior	2018-07-24 11:21:45 +02:00
zerg-junior	417f13c0bd	Submit RBAC credentials during initial Event processing (#344 ) * During initial Event processing submit the service account for pods and bind it to a cluster role that allows Patroni to successfully start. The cluster role is assumed to be created by the k8s cluster administrator.	2018-07-19 16:40:40 +02:00
Oleksii Kliukin	3a9378d3b8	Allow configuring the operator via the YAML manifest. (#326 ) * Up until now, the operator read its own configuration from the configmap. That has a number of limitations, i.e. when the configuration value is not a scalar, but a map or a list. We use a custom code based on github.com/kelseyhightower/envconfig to decode non-scalar values out of plain text keys, but that breaks when the data inside the keys contains both YAML-special elememtns (i.e. commas) and complex quotes, one good example for that is search_path inside `team_api_role_configuration`. In addition, reliance on the configmap forced a flag structure on the configuration, making it hard to write and to read (see https://github.com/zalando-incubator/postgres-operator/pull/308#issuecomment-395131778). The changes allow to supply the operator configuration in a proper YAML file. That required registering a custom CRD to support the operator configuration and provide an example at manifests/postgresql-operator-default-configuration.yaml. At the moment, both old configmap and the new CRD configuration is supported, so no compatibility issues, however, in the future I'd like to deprecate the configmap-based configuration altogether. Contrary to the configmap-based configuration, the CRD one doesn't embed defaults into the operator code, however, one can use the manifests/postgresql-operator-default-configuration.yaml as a starting point in order to build a custom configuration. Since previously `ReadyWaitInterval` and `ReadyWaitTimeout` parameters used to create the CRD were taken from the operator configuration, which is not possible if the configuration itself is stored in the CRD object, I've added the ability to specify them as environment variables `CRD_READY_WAIT_INTERVAL` and `CRD_READY_WAIT_TIMEOUT` respectively. Per review by @zerg-junior and @Jan-M.	2018-07-16 16:20:46 +02:00
Oleksii Kliukin	16a710a99a	Avoid possible skipping SYNC events. OB1 bug in the condition deciding whether to sync.	2018-05-31 18:29:15 +02:00
Oleksii Kliukin	27c7245fed	Avoid terminating delete on errors. When there is an error happening upon deletion of the Kubernetes object belonging to the cluster being removed, it makes no sense to abort the deletion: the manifest will be removed anyway, therefore all the objects after the one we aborted at will stay forever.	2018-05-18 18:10:37 +02:00
Oleksii Kliukin	da4cc2705b	Use deepcopy to propagate the spec to clusters. Avoid sharing pointers to the same spec data between the informer and the clusters. The only catch is that the error field is cleared during deepcopy, since it is an interface that may contain private fields that cannot be copied, however, the error is only used when the manifest is parsed and before it is queued, therefore, we never refer to that field in the cluster structure.	2018-05-17 16:05:12 +02:00
Oleksii Kliukin	ebe50abccb	Make sure we never modify informer cached manifest. (#290 ) `987b434` introduced a new function that modifies the cluster spec in memory before the cluster processes it. Unfortunately, the instance being modified appeared to be the one stored internally in the PostgresInformer, resulting in those modifications to be propagated with futher cluster events and producing update loops in some occasions. This commit makes sure we copy the spec before putting it into the clusterEventQueues.	2018-05-16 18:23:31 +02:00
Oleksii Kliukin	987b43456b	Deprecate old LB options, fix endpoint sync. (#287 ) * Depreate old LB options, fix endpoint sync. - deprecate useLoadBalancer, replicaLoadBalancer from the manifest and enable_load_balancer from the operator configuration. The old operator configuration options become no-op with this commit. For the old manifest options, `useLoadBalancer` and `replicaLoadBalancer` are still consulted, but only in the absense of the new ones (enableMasterLoadBalancer and enableReplicaLoadBalancer). - Make sure the endpoint being created during the sync receives proper addresses subset. This is more critical for the replicas, as for the masters Patroni will normally re-create the endpoint before the operator. - Avoid creating the replica endpoint, since it will be created automatically by the corresponding service. - Update the README and unit tests. Code review by @mgomezch and @zerg-junior	2018-05-15 15:19:18 +02:00
Oleksii Kliukin	f18bb6eaaa	Make errors in the cluster list function visible. Sometimes the operator does not pick up clusters right away when they are created. The change attempts to shed light on the reason behind that.	2018-02-22 16:45:10 +01:00
Sergey Dudoladov	ea84f9d577	Rename the configmap 'namespace' entry to avoid confusion with the map's owm namespace	2018-02-06 15:09:00 +01:00
Murat Kabilov	86803406db	use sync methods while updating the cluster	2017-11-03 12:00:43 +01:00
Oleksii Kliukin	eba23279c8	Kube cluster upgrade	2017-10-19 10:49:42 +02:00
Murat Kabilov	3b32265258	Set status of the cluster on sync fail/success	2017-10-12 15:10:42 +02:00
Murat Kabilov	83c8d6c419	Extend diagnostic api with worker status info	2017-10-11 12:26:09 +02:00
Murat Kabilov	a35e9c6119	move from tpr to crd	2017-10-06 15:12:08 +02:00
Murat Kabilov	9a66e09b88	cluster history api endpoint	2017-09-26 14:30:45 +02:00
Murat Kabilov	f77852a152	store time of the cluster event	2017-09-26 13:17:23 +02:00
Murat Kabilov	4db5bd13d1	delete cluster key from the clusters list only when delete procedure is finished	2017-09-04 18:48:03 +02:00
Murat Kabilov	83760ebbef	discard cluster events from the queue on cluster delete; delete cluster from the clusters map before deleting cluster itself	2017-08-17 12:24:23 +02:00
Murat Kabilov	dad8e2f49f	make cluster event queue consumption non-blocking	2017-08-15 16:03:19 +02:00
Murat Kabilov	51fdfb90f7	log cluster and controller events in the ringlog via logrus hook	2017-08-15 12:16:09 +02:00
Murat Kabilov	5470f20be4	always pass a cluster name as a logger field	2017-08-15 10:29:18 +02:00
Murat Kabilov	e26db66cb5	start all the log messages with lowercase letters	2017-08-15 10:12:36 +02:00
Murat Kabilov	cf663cb841	Fix golint warnings	2017-08-01 16:08:56 +02:00
Murat Kabilov	6183203f4d	fix cluster event queue processing	2017-07-31 10:30:49 +02:00
Murat Kabilov	3ad4b127c4	Fix graceful shutdown graceful shutdown of goroutines on operator exit	2017-07-27 12:54:22 +02:00
Murat Kabilov	1f8b37f33d	Make use of kubernetes client-go v4 * client-go v4.0.0-beta0 * remove unnecessary methods for tpr object * rest client: use interface instead of structure pointer * proper names for constants; some clean up for log messages * remove teams api client from controller and make it per cluster	2017-07-25 15:25:17 +02:00
Murat Kabilov	e104a67260	Fix resync of the clusters	2017-06-08 11:51:48 +02:00
Oleksii Kliukin	bc0e9ab4bc	Add error checks per report from errcheck-ng	2017-06-08 10:41:44 +02:00
Oleksii Kliukin	7b0ca31bfb	Implements EBS volume resizing #35 . In order to support volumes different from EBS and filesystems other than EXT2/3/4 the respective code parts were implemented as interfaces. Adding the new resize for the volume or the filesystem will require implementing the interface, but no other changes in the cluster code itself. Volume resizing first changes the EBS and the filesystem, and only afterwards is reflected in the Kubernetes "PersistentVolume" object. This is done deliberately to be able to check if the volume needs resizing by peeking at the Size of the PersistentVolume structure. We recheck, nevertheless, in the EBSVolumeResizer, whether the actual EBS volume size doesn't match the spec, since call to the AWS ModifyVolume is counted against the resize limit of once every 6 hours, even for those calls that shouldn't result in an actual resize (i.e. when the size matches the one for the running volume). As a collateral, split the constants into multiple files, move the volume code into a separate file and fix minor issues related to the error reporting.	2017-06-06 13:53:27 +02:00
Murat Kabilov	009db16c7c	Use queues for the pod events (#30 )	2017-05-23 15:24:14 +02:00
Murat Kabilov	c470bd6646	reset cluster error on successful update or sync (#29 )	2017-05-22 15:45:38 +02:00
Oleksii Kliukin	bc17897478	Run sync cluster when previous add failed. (#28 )	2017-05-22 15:27:26 +02:00
Oleksii Kliukin	afce38f6f0	Fix error messages (#27 ) Use lowercase for kubernetes objects Use %v instead of %s for errors Start error messages with a lowercase letter.	2017-05-22 14:12:06 +02:00
Murat Kabilov	d34273543e	Fix the golint, gosimple warnings	2017-05-18 17:38:54 +02:00
Murat Kabilov	233e8529c1	Return error instead of logging it	2017-05-18 17:24:44 +02:00
Murat Kabilov	356be8f0f1	skip clusters with invalid spec	2017-05-16 16:46:37 +02:00
Murat Kabilov	92d7fbf372	replace github.bus.zalan.do with github.cm/zalando-incubator	2017-05-12 11:50:16 +02:00
Murat Kabilov	fd449342e5	Use Kubernetes API instead of API group	2017-05-12 11:41:36 +02:00
Oleksii Kliukin	6983f444ed	Periodically sync roles with the running clusters. (#102 ) The sync adds or alters database roles based on the roles defined in the cluster's TPR, Team API and operator's infrastructure roles. At the moment, roles are not deleted, as it would be dangerous for the robot roles in case TPR is misconfigured. In addition, ALTER ROLE does not remove role options, i.e. SUPERUSER or CREATEROLE, neither it removes role membership: only new options are added and new role membership is granted. So far, options like NOSUPERUSER and NOCREATEROLE won't be handed correctly, when mixed with the non-negative counterparts, also NOLOGIN should be processed correctly. The code assumes that only MD5 passwords are stored in the DB and will likely break with the new SCRAM auth in PostgreSQL 10. On the implementation side, create the new interface to abstract roles merge and creation, move most of the role-based functionality from cluster/pg into the new 'users' module, strip create user code of special cases related to human-based users (moving them to init instead) and fixed the password md5 generator to avoid processing already encrypted passwords. In addition, moved the system roles off the slice containing all other roles in order to avoid extra efforts to avoid creating them. Also, fix a leak in DB connections when the new connection is not considered healthy and discarded without being closed. Initialize the database during the sync phase before syncing users.	2017-05-12 11:41:35 +02:00
Murat Kabilov	2370659c69	Parallel cluster processing Run operations concerning multiple clusters in parallel. Each cluster gets its own worker in order to create, update, sync or delete clusters. Each worker acquires the lock on a cluster. Subsequent operations on the same cluster have to wait until the current one finishes. There is a pool of parallel workers, configurable with the `workers` parameter in the configmap and set by default to 4. The cluster-related tasks are assigned to the workers based on a cluster name: the tasks for the same cluster will be always assigned to the same worker. There is no blocking between workers, although there is a chance that a single worker will become a bottleneck if too many clusters are assigned to it; therefore, for large-scale deployments it might be necessary to bump up workers from the default value.	2017-05-12 11:41:35 +02:00
Murat Kabilov	a7c57874d5	Do not create roles if cluster is masterless fix pod deletion	2017-05-12 11:41:34 +02:00

1 2

62 Commits