Commit Graph

258 Commits

Author SHA1 Message Date
Oleksii Kliukin dc36c4ca12 Implement replicaLoadBalancer boolean flag. (#38)
The flag adds a replica service with the name cluster_name-repl and
a DNS name that defaults to {cluster}-repl.{team}.{hostedzone}.

The implementation converted Service field of the cluster into a map
with one or two elements and deals with the cases when the new flag
is changed on a running cluster
(the update and the sync should create or delete the replica service).
In order to pick up master and replica service and master endpoint
when listing cluster resources.

* Update the spec when updating the cluster.
2017-06-07 13:54:17 +02:00
Oleksii Kliukin 7b0ca31bfb Implements EBS volume resizing #35.
In order to support volumes different from EBS and filesystems other than EXT2/3/4 the respective code parts were implemented as interfaces. Adding the new resize for the volume or the filesystem will require implementing the interface, but no other changes in the cluster code itself.

Volume resizing first changes the EBS and the filesystem, and only afterwards is reflected in the Kubernetes "PersistentVolume" object. This is done deliberately to be able to check if the volume needs resizing by peeking at the Size of the PersistentVolume structure. We recheck, nevertheless, in the EBSVolumeResizer, whether the actual EBS volume size doesn't match the spec, since call to the AWS ModifyVolume is counted against the resize limit of once every 6 hours, even for those calls that shouldn't result in an actual resize (i.e. when the size matches the one for the running volume).

As a collateral, split the constants into multiple files, move the volume code into a separate file and fix minor issues related to the error reporting.
2017-06-06 13:53:27 +02:00
Murat Kabilov 1fb05212a9 Refactor teams API package 2017-05-30 10:14:30 +02:00
Murat Kabilov 1111964fee fix password check in pguserpassword
remove magic number
2017-05-26 18:19:12 +02:00
Oleksii Kliukin afce38f6f0 Fix error messages (#27)
Use lowercase for kubernetes objects
Use %v instead of %s for errors
Start error messages with a lowercase letter.
2017-05-22 14:12:06 +02:00
Murat Kabilov d34273543e Fix the golint, gosimple warnings 2017-05-18 17:38:54 +02:00
Murat Kabilov 95a57d1e4f Use named arguments in the DNS name format 2017-05-18 17:23:59 +02:00
Oleksii Kliukin c2826b10e2 Merge branch 'master' into fix/go-vet-fixes 2017-05-17 11:30:07 +02:00
Oleksii Kliukin 4457ce4e47 Replace the statefulset if it cannot be updated. (#18)
Updates to statefulset spec for fields other than 'replicas' and
containers' are forbidden. However, it is possible to delete the old
statefulset without deleting its pods and create the new one, using the
changed specs. The new statefulset shall pick up the orphaned pods.

Change the statefulset's comparison to return the combined effect of
all checks, not just the first non-matching field.
2017-05-17 11:28:21 +02:00
Murat Kabilov 22bcae0784 skip unused variable 2017-05-17 11:15:09 +02:00
Oleksii Kliukin 5adceceb36 go fmt run 2017-05-12 17:48:25 +02:00
Oleksii Kliukin abd04e6f5a Avoid abbreviations in user-facing parameters. 2017-05-12 17:44:51 +02:00
Oleksii Kliukin 03064637f1 Allow disabling access to the DB and the Teams API.
Command-line options --nodatabaseaccess and --noteamsapi disable all
teams api interaction and access to the Postgres database. This is
useful for debugging purposes when the operator runs out of cluster
(with --outofcluster flag).

The same effect can be achieved by setting enable_db_access and/or
enable_teams_api to false.
2017-05-12 17:40:48 +02:00
Murat Kabilov 92d7fbf372 replace github.bus.zalan.do with github.cm/zalando-incubator 2017-05-12 11:50:16 +02:00
Murat Kabilov 28a74622d7 Fix typo in the teams api json spec 2017-05-12 11:41:36 +02:00
Murat Kabilov 18700b9ef7 Optimize template constant 2017-05-12 11:41:36 +02:00
Murat Kabilov fd449342e5 Use Kubernetes API instead of API group 2017-05-12 11:41:36 +02:00
Oleksii Kliukin 6983f444ed Periodically sync roles with the running clusters. (#102)
The sync adds or alters database roles based on the roles defined
in the cluster's TPR, Team API and operator's infrastructure roles.
At the moment, roles are not deleted, as it would be dangerous for
the robot roles in case TPR is misconfigured. In addition, ALTER
ROLE does not remove role options, i.e. SUPERUSER or CREATEROLE,
neither it removes role membership: only new options are added and
new role membership is granted. So far, options like NOSUPERUSER
and NOCREATEROLE won't be handed correctly, when mixed with the
non-negative counterparts, also NOLOGIN should be processed correctly.
The code assumes that only MD5 passwords are stored in the DB and
will likely break with the new SCRAM auth in PostgreSQL 10.

On the implementation side, create the new interface to abstract
roles merge and creation, move most of the role-based functionality
from cluster/pg into the new 'users' module, strip create user code
of special cases related to human-based users (moving them to init
instead) and fixed the password md5 generator to avoid processing
already encrypted passwords. In addition, moved the system roles
off the slice containing all other roles in order to avoid extra
efforts to avoid creating them.

Also, fix a leak in DB connections when the new connection is not
considered healthy and discarded without being closed. Initialize
the database during the sync phase before syncing users.
2017-05-12 11:41:35 +02:00
Martin Linkhorst 411487e66d update annotation for ExternalDNS (#115) 2017-05-12 11:41:35 +02:00
Oleksii Kliukin 49cb395aed Set ELB timeout annotation for the service. (#114)
By default the ELB terminates the idle connection after 60 seconds. Increase this interval to a more reasonable one of 1 h.
2017-05-12 11:41:35 +02:00
Murat Kabilov 2370659c69 Parallel cluster processing
Run operations concerning multiple clusters in parallel. Each cluster gets its
own worker in order to create, update, sync or delete clusters.  Each worker
acquires the lock on a cluster.  Subsequent operations on the same cluster
have to wait until the current one finishes.  There is a pool of parallel
workers, configurable with the `workers` parameter in the configmap and set by
default to 4. The cluster-related tasks  are assigned to the workers based on
a cluster name: the tasks for the same cluster will be always assigned to the
same worker. There is no blocking between workers, although there is a chance
that a single worker will become a bottleneck if too many clusters are
assigned to it; therefore, for large-scale deployments it might be necessary
to bump up workers from the default value.
2017-05-12 11:41:35 +02:00
Oleksii Kliukin 1c4bce86df Avoid "bulk-comparing" pod resources during sync. (#109)
* Avoid "bulk-comparing" pod resources during sync.

First attempt to fix bogus restarts due to the reported mismatch
of container resources where one of the resources is an empty struct,
while the other has all fields set to nil.

In addition, add an ability to set limits and requests per pod, as well as the operator-level defaults.
2017-05-12 11:41:35 +02:00
Murat Kabilov 8026c69222 update default config param values 2017-05-12 11:41:34 +02:00
Murat Kabilov da438aab3a Use ConfigMap to store operator's config 2017-05-12 11:41:34 +02:00
Oleksii Kliukin 47e3e29a56 Add version label to the cluster. (#96)
* Add version label to the cluster.

According to the STUPS team the daemon that exports logs to scalyr
stops the export if the version label is missing.

* Move label names to constants. 

* Run go fmt
2017-05-12 11:41:34 +02:00
Murat Kabilov 08c0e3b6dd Use unified type for the namespaced object names 2017-05-12 11:41:34 +02:00
Oleksii Kliukin 71b93b4cc2 Feature/infrastructure roles (#91)
* Add infrastructure roles configured globally.

Those are the roles defined in the operator itself. The operator's
configuration refers to the secret containing role names, passwords
and membership information. While they are referred to as roles, in
reality those are users.

In addition, improve the regex to filter out invalid users and
make sure user secret names are compatible with DNS name spec.

Add an example manifest for the infrastructure roles.
2017-05-12 11:41:33 +02:00
Murat Kabilov dd2ed5ff9d Add team name to tpr object metadata name 2017-05-12 11:41:33 +02:00
Murat Kabilov 101dc06acb Better logging for teams api calls 2017-05-12 11:41:32 +02:00
Oleksii Kliukin 5b66d0adba Correct go json tags (extra space). 2017-05-12 11:41:32 +02:00
Oleksii Kliukin 3b99ce3d2e Improve the diff in cluster resources.
- Use the branch of pretty with this feature fixed:
  https://github.com/kr/pretty/pull/42
- Add the Limit to the resources declaration to avoid dummy
  differences between statefulsets (where both Resource structures
  are empty, but in one case the fields are not mentioned, while
  in another they are assigned to empty values).
2017-05-12 11:41:32 +02:00
Oleksii Kliukin 455f91128f Move master/replica role names into the constants. 2017-05-12 11:41:32 +02:00
Oleksii Kliukin a5f0ef10d0 go fmt run 2017-05-12 11:41:31 +02:00
Oleksii Kliukin 0764505a10 correct the wal bucket parameter name. 2017-05-12 11:41:31 +02:00
Oleksii Kliukin 7841b85892 Add configuration to support running WAL-E.
- Set WAL_S3_BUCKET to point WAL-E where to fetch/store WAL files
- Set annotations/iam.amazonaws.com/role to set the role to access AWS"

The new env vairables are PGOP_WAL_S3_BUCKET and PGOP_KUBE_IAM_ROLE.
2017-05-12 11:41:31 +02:00
Murat Kabilov 852c5beae5 Check etcd key availability for the new cluster 2017-05-12 11:41:31 +02:00
Oleksii Kliukin 8db44d6f18 Avoid unnecessary marshaling. 2017-05-12 11:41:30 +02:00
Oleksii Kliukin b69b6b26e5 git fmt run 2017-05-12 11:41:30 +02:00
Murat Kabilov 310c119dfa Display config on operator start up 2017-05-12 11:41:30 +02:00
Murat Kabilov a97dfb07de fix struct tag delimiter 2017-05-12 11:41:30 +02:00
Oleksii Kliukin ba8e8d1857 Avoid showing objects alongside diffs.
That reduces the amount of clutter in the debug output.
Run go fmt on the sources.
2017-05-12 11:41:30 +02:00
Oleksii Kliukin 3a4c6268be Increase log verbosity, namely for object updates.
- add a new environment variable for triggering debug log level
- show both new, old object and diff during syncs and updates
- use pretty package to pretty-print go structures
-
2017-05-12 11:41:29 +02:00
Murat Kabilov c2d2a67ad5 Get config from environment variables;
ignore pg major version change;
get rid of resources package;
2017-05-12 11:41:29 +02:00
Murat Kabilov 79a6726d4d Increase logging verbosity, restructure code 2017-05-12 11:41:28 +02:00
Murat Kabilov 3aaa05fb96 Use encrypted passwords while creating robot users 2017-05-12 11:41:28 +02:00
Oleksii Kliukin 48ba6adf8a Avoid calling Team API with an expired token.
Previously, the controller fetched the Oauth token once at start, so eventually the token would expire and the operator could not create new users. This commit makes the operator fetch the token before each call to the Teams API.
2017-05-12 11:41:28 +02:00
Murat Kabilov 6f7399b36f Sync clusters states
* move statefulset creation from cluster spec to the separate function
* sync cluster state with desired state;
* move out from arrays for cluster resources;
* recreate pods instead of deleting them in case of statefulset change
* check for master while creating cluster/updating pods
* simplify retryutil
* list pvc while listing resources
* name kubernetes resources with capital letter
* do rolling update in case of env variables change
2017-05-12 11:41:27 +02:00
Oleksii Kliukin 814f75f7c1 Formatting changes 2017-05-12 11:41:27 +02:00
Oleksii Kliukin 7529b84b93 Move all operator-related constants together. 2017-05-12 11:41:27 +02:00
Oleksii Kliukin 55dbacdfa6 Assign DNS name to the cluster.
DNS name is generated from the team name and cluster name.
Use "zalando.org/dnsname" service annotation that makes 'mate' service assign a CNAME to the load balancer name.
2017-05-12 11:41:27 +02:00
Murat Kabilov 34ac47aed9 Expose container 8080 port 2017-05-12 11:41:26 +02:00
Oleksii Kliukin 776ed3fa0f Simplify getting configuration. 2017-05-12 11:41:25 +02:00
Oleksii Kliukin a2e78ac2ec Feature/persistent volumes 2017-05-12 11:41:25 +02:00
Murat Kabilov ae77fa15e8 Pod Rolling update
introduce Pod events channel;
add parsing of the MaintenanceWindows section;
skip deleting Etcd key on cluster delete;
use external etcd host;
watch for tpr/pods in the namespace of the operator pod only;
2017-05-12 11:41:25 +02:00
Murat Kabilov 6e2d64bd50 Create human users from teams api 2017-05-12 11:37:09 +02:00
Murat Kabilov 58506634c4 Create pg users 2017-05-12 11:37:09 +02:00
Murat Kabilov 7e4d0410c2 Use one secret per user 2017-05-12 11:37:09 +02:00
Murat Kabilov abb1173035 Code refactor 2017-05-12 11:37:09 +02:00