orchard

Commit Graph

Author	SHA1	Message	Date
Nikolay Edigaryev	818f4288c2	Controller API: correctly detect WebSocket closure in Watch RPC (#259 )	2025-02-20 02:00:57 +04:00
Nikolay Edigaryev	61d7d34ea4	RPC v2: fix Ping() hanging due to PONG not being processed (#247 )	2025-02-07 22:05:09 +04:00
Nikolay Edigaryev	8dd74db446	Worker notification improvements (#246 ) * OpenAPI: document all default "wait" values * Re-use waitContext instead of instantiating it anew	2025-02-07 00:38:04 +04:00
Fedor Korotkov	86f0afb5a3	Small timout for worker notification (#242 ) * Small timout for worker notification It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors. This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists. PS not sure if we need to make the timeout configurable. * Wait via context * Make sure all `context`s for `Notify` is time bounded * Lint issues	2025-02-06 17:30:09 +00:00
Nikolay Edigaryev	26c8808506	Support scheduling by labels (#244 )	2025-02-06 18:05:36 +04:00
Nikolay Edigaryev	581de320b9	Allow creating VMs with implicit CPU and memory (#243 ) * Allow creating VMs with implicit CPU and memory * Clarify why cpu/memory can be 0 a bit better * Controller(API): don't forget to update DefaultCPU and DefaultMemory * Add an integration test for implicit CPU and memory	2025-02-06 00:50:01 +04:00
Nikolay Edigaryev	88fba8004d	Introduce WebSocket-based RPC v2 (#239 ) * Introduce WebSocket-based RPC v2 * go test: add -ldflags="-B gobuildid" * No need to change the "controller.workerNotifier.Notify()" error message * No need to modify Protocol Buffers/gRPC generated code * rpcWatch(): explain that connection shouldn't be normally be closed * Avoid "port forwarding failed: " repetition in error messages * Improve comments and avoid repetition in IP resolution errors	2025-01-30 17:33:32 +04:00
Nikolay Edigaryev	077252f6d4	Prevent goroutine leak when Close()'ing *grpc_net_conn.Conn (#237 )	2025-01-23 18:17:14 +04:00
Nikolay Edigaryev	1fce915d67	API: only overwrite specific worker fields when worker already exists (#236 ) * API: only overwrite specific worker fields when worker already exists * Don't forget to return when creating new worker * Return updated worker when updating the worker	2025-01-16 16:42:17 +04:00
Nikolay Edigaryev	d7b6f477e1	Never list workers in Update()/storeUpdate() transactions (#228 ) * POST /v1/workers: do not list workers in a single update txn * schedulingLoopIteration(): do not list workers in a single update txn * .golangci.yml: remove mentions of fully deprecated linters	2024-12-05 16:59:50 +04:00
Nikolay Edigaryev	d94690176e	Schedule opportunistically and more granularly (#225 ) * Schedule opportunistically and more granularly To avoid transaction conflicts. * Measure scheduling loop iteration duration and log it at debugging level * Use "continue NextWorker" instead of just "continue" for clarity	2024-12-03 14:11:48 +00:00
Nikolay Edigaryev	7fe0414981	"--scheduler-profile" option to allow different orchestration patterns (#224 ) * "--scheduler-profile" option to allow different orchestration patterns * API(cluster settings): provide a default value for scheduler profile	2024-11-28 20:07:46 +04:00
Nikolay Edigaryev	772336a7bd	Scheduler: stop iterating over workers when candidate worker is found (#220 )	2024-11-13 17:59:08 +04:00
Nikolay Edigaryev	60948e14fe	Rendezvous: use a buffered channel of size 1 (#219 ) * Rendezvous: use a buffered channel of size 1 * Fix spelling of "absence" in comment	2024-11-08 11:19:54 +04:00
Nikolay Edigaryev	2a2ddea62a	Controller: emit lifecycle events when the VM gets restarted or deleted (#208 ) * Controller: emit lifecycle events when the VM gets restarted or deleted * vm_{scheduling,run}_time → vm_{scheduling,run}_duration for clarity * Update VM endpoint: only update VM started time when zero	2024-09-24 17:53:10 +04:00
Nikolay Edigaryev	1730eaf67c	orchard controller: make sure that output goes through the logger (#207 ) ...which emits JSON on the production for easier processing.	2024-09-17 22:54:43 +04:00
Mark McWhirter	979af1f699	Expose 2 new metrics about worker health (#203 ) * Expose more metrics about worker health * PR feedback * PR feedback	2024-09-10 10:13:41 -04:00
Nikolay Edigaryev	8aaf05c4f7	controller run: make bootstrap process more user-friendly (#201 ) * controller run: make bootstrap process more user-friendly * Badger: log to zap instead of standard error	2024-09-03 18:54:28 +04:00
Nikolay Edigaryev	cd9794197b	API: update service account fields on PUT (#198 ) * API: update service account fields on PUT * Disable G115 integer overflow linter of gosec	2024-08-21 20:03:52 +04:00
Nikolay Edigaryev	4df43e6432	Default ?wait= to 0 seconds (#190 )	2024-07-03 23:07:14 +04:00
Nikolay Edigaryev	76f192bdb0	API endpoint and associated RPC changes to resolve VMs IP's (#188 ) * API endpoint and associated RPC changes to resolve VMs IP's * Fix "Missing expected argument '<name>'" error when doing "tart set" * Implement TestIPEndpoint() and IP() method in controller HTTP client	2024-07-03 22:56:43 +04:00
Nikolay Edigaryev	8119b22817	orchard controller run: introduce --insecure-ssh-no-client-auth (#187 )	2024-06-28 23:55:18 +04:00
Nikolay Edigaryev	ff0497b1d8	Produce OpenTelemetry metrics (#185 ) * .golangci.yml: remove mentions of deprecated linters * Fix "staticcheck" linter error by using grpc.NewClient * Configure OpenTelemetry Metrics only for now. * Produce OpenTelemetry metrics * Update DeploymentGuide.md Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com> * Update DeploymentGuide.md Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com> * Introduce "org.cirruslabs.orchard.controller.worker_status" --------- Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>	2024-06-24 18:19:51 +04:00
Nikolay Edigaryev	d59bc7f8a7	Orchard Controller: implement an SSH server that acts as a jump host (#179 ) * proxy.Connections(): require io.ReadWriteCloser instead of net.Conn * Orchard Controller: implement an SSH server that acts as a jump host * Issue a warning if the name used will be invalid in the future * Further restrict uppercase characters in names in the future The rationale is similar to https://github.com/kubernetes/kubernetes/issues/71140. We won't want to munge the user's input and introduce subtle bugs doing lowercase comparisons.	2024-06-11 19:32:45 +04:00
Nikolay Edigaryev	c845f3b2fd	API: do not return null when methods returning a list have no items (#170 ) * API: do not return null when methods returning a list have no items * Use "omitempty" in all API structs	2024-04-29 15:49:09 -04:00
Nikolay Edigaryev	7fb0a85834	API(VM): new image FQN (fully-qualified name) field (#165 )	2024-04-15 20:14:44 +04:00
Nikolay Edigaryev	13b4e192f0	Introduce "orchard {port-forward, vnc} worker WORKER_NAME" (#140 ) * Fix potential NPE in Client.wsRequest() * Introduce "orchard {port-forward, vnc} worker WORKER_NAME" * portspec.go: simplify logic and respect [LOCAL_PORT]:REMOTE_PORT format	2023-10-09 18:51:34 +04:00
Nikolay Edigaryev	40f58e4aee	More RPC-related logs (#136 ) * More RPC-related logs * Notifier should be set before we use it in the scheduler	2023-09-27 20:16:00 +04:00
Nikolay Edigaryev	8c62df0eba	Only allow simple names when creating workers, VMs, etc. and escape paths in API client (#129 ) * Controller: only allow simple names when creating workers, VMs, etc. * Client: escape paths * simplename: allow ':' character	2023-09-22 14:51:43 -04:00
Nikolay Edigaryev	036eb954be	Retry DB transactions on badger.ErrConflict (#114 ) * Log HTTP 500 errors in more detail * Log errors in storeView and storeUpdate * Retry on badger.ErrConflict	2023-08-15 15:18:47 +04:00
Nikolay Edigaryev	6759618f28	orchard create vm: support --image-pull-policy=Always (#110 )	2023-07-26 17:43:14 +04:00
Nikolay Edigaryev	fd88ce5890	Introduce ORCHARD_LICENSE_TIER environment variable (#111 ) * Introduce ORCHARD_LICENSE_TIER environment variable * Only parse ORCHARD_LICENSE_TIER if it was provided	2023-07-26 17:28:38 +04:00
Nikolay Edigaryev	a52c205c34	API(port forward endpoint): handle normal WebSocket closure gracefully (#108 )	2023-07-20 20:55:42 +00:00
Nikolay Edigaryev	d57d18d380	Support for sharing files with the host system (#103 ) * Support for sharing files with the host system * Integration tests * Added back TestVMGarbageCollection comment	2023-07-04 18:10:53 +04:00
Nikolay Edigaryev	6a325daf74	Switch from golang.org/x/net/websocket to nhooyr.io/websocket and handle NotFound errors (#105 ) * Switch from golang.org/x/net/websocket to nhooyr.io/websocket * Do not attach errors that we can handle to the Gin's context * Add missing newline to "no credentials specified or found, ..." message * Fix potential NPE in ChooseUsernameAndPassword() * Fix type in PortForward() error message in "orchard ssh vm" * Fix potential NPE in Connections() * Use header.Set() for consistency's sake for Authorization header	2023-07-04 18:10:41 +04:00
Fedor Korotkov	f6b48b7c42	Change event prefix to preserve order under load (#89 ) * Change event prefix to preserve order under load When there are a lot of events streamed from a worker, it's possible to have two batches coming for the same timestamp (which is a timestamp of the event on the worker). This way the existing logic would mess up the order because `index` and the random number doesn't guarantee the order. To fix this I've changed the format of the prefix for the event to include tro things: 1. Timestamp in nanoseconds of the injection time on the controller so two sequential batches will have guaranteed order unless they are processed within a nanosecond. 2. Made the `index` being fixed length with trailing zeros, so they are properly lexicographically sorted (`000001`, `000002`, ...). * No need to disable linting	2023-06-05 17:01:12 +00:00
Nikolay Edigaryev	60e564da88	Implement restart policy for VMs (#83 ) * Implement restart policy for VMs * Do not update VM.Resource, we only use it as a read-only specification * Err()/setErr(): use atomic.Pointer instead of sync.Mutex	2023-04-24 19:30:08 +04:00
Fedor Korotkov	010df300a3	Add basic Prometheus metrics (#82 ) Fixes #71	2023-04-21 10:05:01 +04:00
Nikolay Edigaryev	06de1094ba	Remove worker role (#77 )	2023-04-12 12:03:24 +04:00
Nikolay Edigaryev	77656517fd	Controller info endpoint and API integration examples (#75 ) * Controller API: introduce controller's information endpoint * Prevent generation of empty events after channel closure * Allow events to be buffered in the events channel * Controller API: introduce controller's information endpoint[1] * IntegrationGuide.md: a couple of Python and Golang examples * Rephrase a sentence Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com> --------- Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>	2023-04-11 07:28:46 +00:00
Nikolay Edigaryev	84633d0e45	Introduce "orchard pause" and "orchard resume" commands (#73 )	2023-04-07 22:59:41 +04:00
Nikolay Edigaryev	4eafec99a5	Fail VMs if the worker had crashed/is unhealthy (#70 ) * Fail VMs if the worker had crashed/is unhealthy * OnDiskName: properly handle cases when VM's name contains hyphens * Worker: introduce Offline() method and check it before scheduling * tart.List(): use Tart's JSON output * OnDiskName: remove empty parts check * Scheduler: move health-checking logic to a separate function * Only fail "running" VMs * Only fail orphaned VMs if they're in terminal state * Integration tests * Run healthCheckingLoopIteration() before schedulingLoopIteration() * Worker: sync on-disk VMs only once at start	2023-04-03 16:47:49 +04:00
Fedor Korotkov	f152043f19	Reactive Scheduling (#67 ) Before we had two main loops: controller loop to assign VMs and worker loop to start VMs. Each of the loops was performed upon an interval every N seconds. This change introduces a mechanism for reactively requesting loop execution: 1. Controller loop will be executed upon VM creation to try to immediately schedule. 2. A worker will be notified upon a VM assigment and worker loop will be requested to sync immediately. Fixes #31	2023-03-28 20:51:41 +04:00
Fedor Korotkov	5eaf6b24d4	Make port-forward endpoint to wait for the VM (#65 ) * Make port-forward endpoint to wait for the VM Fixes #62 * Fixes after rebase	2023-03-27 23:52:21 +04:00
Nikolay Edigaryev	357a042937	REST API: provide error messages in error responses (#66 ) * REST API: provide error messages in error responses * Fix role checking logic and add tests * Ignore testpackage linter error * Rename NewError() to NewErrorResponse()	2023-03-27 14:12:03 -04:00
Nikolay Edigaryev	cb39836ee0	Resources support (#63 ) * Resources support * Ability to provide VM and worker resources via the CLI * orchard dev: always listen on :6120 * orchard dev: support --resources * REST API: provide resource defaults when creating VM * OpenAPI: document "resources" field * orchard dev: serve Swagger API documentation on /v1/ * Integration guide	2023-03-27 17:30:54 +04:00
Nikolay Edigaryev	7647ccdc10	Remove Generation field (#57 )	2023-03-24 17:23:07 +00:00
Nikolay Edigaryev	49753ebf4c	Tests: use separate controller listening ports to prevent conflicts (#58 )	2023-03-24 17:22:58 +00:00
Fedor Korotkov	63ba8b5532	Separate context for `orchard dev` (#56 ) Fixes #51	2023-03-24 13:10:35 -04:00
Fedor Korotkov	362ea85b4f	Always require a client for running a worker (#52 ) * Always require a client for running a worker * Actually validate roles * Delete worker Fixes #46 * Update internal/worker/worker.go Co-authored-by: Nikolay Edigaryev <edigaryev@gmail.com> --------- Co-authored-by: Nikolay Edigaryev <edigaryev@gmail.com>	2023-03-24 17:44:20 +04:00

1 2

66 Commits