Commit Graph

66 Commits

Author SHA1 Message Date
Nikolay Edigaryev 818f4288c2
Controller API: correctly detect WebSocket closure in Watch RPC (#259) 2025-02-20 02:00:57 +04:00
Nikolay Edigaryev 61d7d34ea4
RPC v2: fix Ping() hanging due to PONG not being processed (#247) 2025-02-07 22:05:09 +04:00
Nikolay Edigaryev 8dd74db446
Worker notification improvements (#246)
* OpenAPI: document all default "wait" values

* Re-use waitContext instead of instantiating it anew
2025-02-07 00:38:04 +04:00
Fedor Korotkov 86f0afb5a3
Small timout for worker notification (#242)
* Small timout for worker notification

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.

This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists.

PS not sure if we need to make the timeout configurable.

* Wait via context

* Make sure all `context`s for `Notify` is time bounded

* Lint issues
2025-02-06 17:30:09 +00:00
Nikolay Edigaryev 26c8808506
Support scheduling by labels (#244) 2025-02-06 18:05:36 +04:00
Nikolay Edigaryev 581de320b9
Allow creating VMs with implicit CPU and memory (#243)
* Allow creating VMs with implicit CPU and memory

* Clarify why cpu/memory can be 0 a bit better

* Controller(API): don't forget to update DefaultCPU and DefaultMemory

* Add an integration test for implicit CPU and memory
2025-02-06 00:50:01 +04:00
Nikolay Edigaryev 88fba8004d
Introduce WebSocket-based RPC v2 (#239)
* Introduce WebSocket-based RPC v2

* go test: add -ldflags="-B gobuildid"

* No need to change the "controller.workerNotifier.Notify()" error message

* No need to modify Protocol Buffers/gRPC generated code

* rpcWatch(): explain that connection shouldn't be normally be closed

* Avoid "port forwarding failed: " repetition in error messages

* Improve comments and avoid repetition in IP resolution errors
2025-01-30 17:33:32 +04:00
Nikolay Edigaryev 077252f6d4
Prevent goroutine leak when Close()'ing *grpc_net_conn.Conn (#237) 2025-01-23 18:17:14 +04:00
Nikolay Edigaryev 1fce915d67
API: only overwrite specific worker fields when worker already exists (#236)
* API: only overwrite specific worker fields when worker already exists

* Don't forget to return when creating new worker

* Return updated worker when updating the worker
2025-01-16 16:42:17 +04:00
Nikolay Edigaryev d7b6f477e1
Never list workers in Update()/storeUpdate() transactions (#228)
* POST /v1/workers: do not list workers in a single update txn

* schedulingLoopIteration(): do not list workers in a single update txn

* .golangci.yml: remove mentions of fully deprecated linters
2024-12-05 16:59:50 +04:00
Nikolay Edigaryev d94690176e
Schedule opportunistically and more granularly (#225)
* Schedule opportunistically and more granularly

To avoid transaction conflicts.

* Measure scheduling loop iteration duration and log it at debugging level

* Use "continue NextWorker" instead of just "continue" for clarity
2024-12-03 14:11:48 +00:00
Nikolay Edigaryev 7fe0414981
"--scheduler-profile" option to allow different orchestration patterns (#224)
* "--scheduler-profile" option to allow different orchestration patterns

* API(cluster settings): provide a default value for scheduler profile
2024-11-28 20:07:46 +04:00
Nikolay Edigaryev 772336a7bd
Scheduler: stop iterating over workers when candidate worker is found (#220) 2024-11-13 17:59:08 +04:00
Nikolay Edigaryev 60948e14fe
Rendezvous: use a buffered channel of size 1 (#219)
* Rendezvous: use a buffered channel of size 1

* Fix spelling of "absence" in comment
2024-11-08 11:19:54 +04:00
Nikolay Edigaryev 2a2ddea62a
Controller: emit lifecycle events when the VM gets restarted or deleted (#208)
* Controller: emit lifecycle events when the VM gets restarted or deleted

* vm_{scheduling,run}_time → vm_{scheduling,run}_duration for clarity

* Update VM endpoint: only update VM started time when zero
2024-09-24 17:53:10 +04:00
Nikolay Edigaryev 1730eaf67c
orchard controller: make sure that output goes through the logger (#207)
...which emits JSON on the production for easier processing.
2024-09-17 22:54:43 +04:00
Mark McWhirter 979af1f699
Expose 2 new metrics about worker health (#203)
* Expose more metrics about worker health

* PR feedback

* PR feedback
2024-09-10 10:13:41 -04:00
Nikolay Edigaryev 8aaf05c4f7
controller run: make bootstrap process more user-friendly (#201)
* controller run: make bootstrap process more user-friendly

* Badger: log to zap instead of standard error
2024-09-03 18:54:28 +04:00
Nikolay Edigaryev cd9794197b
API: update service account fields on PUT (#198)
* API: update service account fields on PUT

* Disable G115 integer overflow linter of gosec
2024-08-21 20:03:52 +04:00
Nikolay Edigaryev 4df43e6432
Default ?wait= to 0 seconds (#190) 2024-07-03 23:07:14 +04:00
Nikolay Edigaryev 76f192bdb0
API endpoint and associated RPC changes to resolve VMs IP's (#188)
* API endpoint and associated RPC changes to resolve VMs IP's

* Fix "Missing expected argument '<name>'" error when doing "tart set"

* Implement TestIPEndpoint() and IP() method in controller HTTP client
2024-07-03 22:56:43 +04:00
Nikolay Edigaryev 8119b22817
orchard controller run: introduce --insecure-ssh-no-client-auth (#187) 2024-06-28 23:55:18 +04:00
Nikolay Edigaryev ff0497b1d8
Produce OpenTelemetry metrics (#185)
* .golangci.yml: remove mentions of deprecated linters

* Fix "staticcheck" linter error by using grpc.NewClient

* Configure OpenTelemetry

Metrics only for now.

* Produce OpenTelemetry metrics

* Update DeploymentGuide.md

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>

* Update DeploymentGuide.md

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>

* Introduce "org.cirruslabs.orchard.controller.worker_status"

---------

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>
2024-06-24 18:19:51 +04:00
Nikolay Edigaryev d59bc7f8a7
Orchard Controller: implement an SSH server that acts as a jump host (#179)
* proxy.Connections(): require io.ReadWriteCloser instead of net.Conn

* Orchard Controller: implement an SSH server that acts as a jump host

* Issue a warning if the name used will be invalid in the future

* Further restrict uppercase characters in names in the future

The rationale is similar to https://github.com/kubernetes/kubernetes/issues/71140.

We won't want to munge the user's input and introduce subtle bugs doing
lowercase comparisons.
2024-06-11 19:32:45 +04:00
Nikolay Edigaryev c845f3b2fd
API: do not return null when methods returning a list have no items (#170)
* API: do not return null when methods returning a list have no items

* Use "omitempty" in all API structs
2024-04-29 15:49:09 -04:00
Nikolay Edigaryev 7fb0a85834
API(VM): new image FQN (fully-qualified name) field (#165) 2024-04-15 20:14:44 +04:00
Nikolay Edigaryev 13b4e192f0
Introduce "orchard {port-forward, vnc} worker WORKER_NAME" (#140)
* Fix potential NPE in Client.wsRequest()

* Introduce "orchard {port-forward, vnc} worker WORKER_NAME"

* portspec.go: simplify logic and respect [LOCAL_PORT]:REMOTE_PORT format
2023-10-09 18:51:34 +04:00
Nikolay Edigaryev 40f58e4aee
More RPC-related logs (#136)
* More RPC-related logs

* Notifier should be set before we use it in the scheduler
2023-09-27 20:16:00 +04:00
Nikolay Edigaryev 8c62df0eba
Only allow simple names when creating workers, VMs, etc. and escape paths in API client (#129)
* Controller: only allow simple names when creating workers, VMs, etc.

* Client: escape paths

* simplename: allow ':' character
2023-09-22 14:51:43 -04:00
Nikolay Edigaryev 036eb954be
Retry DB transactions on badger.ErrConflict (#114)
* Log HTTP 500 errors in more detail

* Log errors in storeView and storeUpdate

* Retry on badger.ErrConflict
2023-08-15 15:18:47 +04:00
Nikolay Edigaryev 6759618f28
orchard create vm: support --image-pull-policy=Always (#110) 2023-07-26 17:43:14 +04:00
Nikolay Edigaryev fd88ce5890
Introduce ORCHARD_LICENSE_TIER environment variable (#111)
* Introduce ORCHARD_LICENSE_TIER environment variable

* Only parse ORCHARD_LICENSE_TIER if it was provided
2023-07-26 17:28:38 +04:00
Nikolay Edigaryev a52c205c34
API(port forward endpoint): handle normal WebSocket closure gracefully (#108) 2023-07-20 20:55:42 +00:00
Nikolay Edigaryev d57d18d380
Support for sharing files with the host system (#103)
* Support for sharing files with the host system

* Integration tests

* Added back TestVMGarbageCollection comment
2023-07-04 18:10:53 +04:00
Nikolay Edigaryev 6a325daf74
Switch from golang.org/x/net/websocket to nhooyr.io/websocket and handle NotFound errors (#105)
* Switch from golang.org/x/net/websocket to nhooyr.io/websocket

* Do not attach errors that we can handle to the Gin's context

* Add missing newline to "no credentials specified or found, ..." message

* Fix potential NPE in ChooseUsernameAndPassword()

* Fix type in PortForward() error message in "orchard ssh vm"

* Fix potential NPE in Connections()

* Use header.Set() for consistency's sake for Authorization header
2023-07-04 18:10:41 +04:00
Fedor Korotkov f6b48b7c42
Change event prefix to preserve order under load (#89)
* Change event prefix to preserve order under load

When there are a lot of events streamed from a worker, it's possible to have two batches coming for the same timestamp (which is a timestamp of the event on the worker). This way the existing logic would mess up the order because `index` and the random number doesn't guarantee the order.

To fix this I've changed the format of the prefix for the event to include tro things:

1. Timestamp in nanoseconds of the injection time on the controller so two sequential batches will have guaranteed order unless they are processed within a nanosecond.
2. Made the `index` being fixed length with trailing zeros, so they are properly lexicographically sorted (`000001`, `000002`, ...).

* No need to disable linting
2023-06-05 17:01:12 +00:00
Nikolay Edigaryev 60e564da88
Implement restart policy for VMs (#83)
* Implement restart policy for VMs

* Do not update VM.Resource, we only use it as a read-only specification

* Err()/setErr(): use atomic.Pointer instead of sync.Mutex
2023-04-24 19:30:08 +04:00
Fedor Korotkov 010df300a3
Add basic Prometheus metrics (#82)
Fixes #71
2023-04-21 10:05:01 +04:00
Nikolay Edigaryev 06de1094ba
Remove worker role (#77) 2023-04-12 12:03:24 +04:00
Nikolay Edigaryev 77656517fd
Controller info endpoint and API integration examples (#75)
* Controller API: introduce controller's information endpoint

* Prevent generation of empty events after channel closure

* Allow events to be buffered in the events channel

* Controller API: introduce controller's information endpoint[1]

* IntegrationGuide.md: a couple of Python and Golang examples

* Rephrase a sentence

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>

---------

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>
2023-04-11 07:28:46 +00:00
Nikolay Edigaryev 84633d0e45
Introduce "orchard pause" and "orchard resume" commands (#73) 2023-04-07 22:59:41 +04:00
Nikolay Edigaryev 4eafec99a5
Fail VMs if the worker had crashed/is unhealthy (#70)
* Fail VMs if the worker had crashed/is unhealthy

* OnDiskName: properly handle cases when VM's name contains hyphens

* Worker: introduce Offline() method and check it before scheduling

* tart.List(): use Tart's JSON output

* OnDiskName: remove empty parts check

* Scheduler: move health-checking logic to a separate function

* Only fail "running" VMs

* Only fail orphaned VMs if they're in terminal state

* Integration tests

* Run healthCheckingLoopIteration() before schedulingLoopIteration()

* Worker: sync on-disk VMs only once at start
2023-04-03 16:47:49 +04:00
Fedor Korotkov f152043f19
Reactive Scheduling (#67)
Before we had two main loops: controller loop to assign VMs and worker loop to start VMs. Each of the loops was performed upon an interval every N seconds.

This change introduces a mechanism for reactively requesting loop execution:

 1. Controller loop will be executed upon VM creation to try to immediately schedule.
 2. A worker will be notified upon a VM assigment and worker loop will be requested to sync immediately.

 Fixes #31
2023-03-28 20:51:41 +04:00
Fedor Korotkov 5eaf6b24d4
Make port-forward endpoint to wait for the VM (#65)
* Make port-forward endpoint to wait for the VM

Fixes #62

* Fixes after rebase
2023-03-27 23:52:21 +04:00
Nikolay Edigaryev 357a042937
REST API: provide error messages in error responses (#66)
* REST API: provide error messages in error responses

* Fix role checking logic and add tests

* Ignore testpackage linter error

* Rename NewError() to NewErrorResponse()
2023-03-27 14:12:03 -04:00
Nikolay Edigaryev cb39836ee0
Resources support (#63)
* Resources support

* Ability to provide VM and worker resources via the CLI

* orchard dev: always listen on :6120

* orchard dev: support --resources

* REST API: provide resource defaults when creating VM

* OpenAPI: document "resources" field

* orchard dev: serve Swagger API documentation on /v1/

* Integration guide
2023-03-27 17:30:54 +04:00
Nikolay Edigaryev 7647ccdc10
Remove Generation field (#57) 2023-03-24 17:23:07 +00:00
Nikolay Edigaryev 49753ebf4c
Tests: use separate controller listening ports to prevent conflicts (#58) 2023-03-24 17:22:58 +00:00
Fedor Korotkov 63ba8b5532
Separate context for `orchard dev` (#56)
Fixes #51
2023-03-24 13:10:35 -04:00
Fedor Korotkov 362ea85b4f
Always require a client for running a worker (#52)
* Always require a client for running a worker

* Actually validate roles

* Delete worker

Fixes #46

* Update internal/worker/worker.go

Co-authored-by: Nikolay Edigaryev <edigaryev@gmail.com>

---------

Co-authored-by: Nikolay Edigaryev <edigaryev@gmail.com>
2023-03-24 17:44:20 +04:00