Commit Graph

21 Commits

Author SHA1 Message Date
Nikolay Edigaryev 9092a9f172
Support Vetu virtualization on Linux in addition to Tart on macOS (#419)
* Support Vetu virtualization on Linux in addition to Tart on macOS

* api(portForward): ensure that rendezvousConn is closed

* Re-try SSH connections in integration tests

Because a VM might be still booting.
2026-03-16 11:12:28 +01:00
Nikolay Edigaryev 3fffe5fb74
Replace Prometheus with OpenTelemetry (#413) 2026-02-23 19:01:10 +01:00
Nikolay Edigaryev 76a552bade
Ability to set VM's power state and retrieve backing Tart VM's name (#373)
* Ability to set VM's power state and retrieve backing Tart VM's name

* Validate user-provided "powerState" field

* Introduce TestSpecUpdatePowerStateSuspend

* Introduce TestSpecUpdatePowerStateStopped

* OpenAPI specification: add note about suspended VMs to "tartName" desc.

* Sometimes we need to wait more than 30 seconds
2025-12-02 16:43:17 -05:00
Nikolay Edigaryev 26668f2cbd
orchard controller run: introduce --experimental-disable-db-compression (#336) 2025-08-19 17:31:18 +04:00
Nikolay Edigaryev 39fbbbc2a6
Disable Prometheus metrics by default (#331) 2025-07-17 00:58:13 +04:00
Fedor Korotkov 86f0afb5a3
Small timout for worker notification (#242)
* Small timout for worker notification

It seems at the moment if a worker re-establishes notify stream (for example, if network flips or proxy breaks the connection) then we can see "no worker registered with this name" errors.

This change makes Notifier to wait for 30 seconds before failing, at the time of calling `Notifier#Notify` we know such worker exists.

PS not sure if we need to make the timeout configurable.

* Wait via context

* Make sure all `context`s for `Notify` is time bounded

* Lint issues
2025-02-06 17:30:09 +00:00
Nikolay Edigaryev 26c8808506
Support scheduling by labels (#244) 2025-02-06 18:05:36 +04:00
Nikolay Edigaryev 581de320b9
Allow creating VMs with implicit CPU and memory (#243)
* Allow creating VMs with implicit CPU and memory

* Clarify why cpu/memory can be 0 a bit better

* Controller(API): don't forget to update DefaultCPU and DefaultMemory

* Add an integration test for implicit CPU and memory
2025-02-06 00:50:01 +04:00
Nikolay Edigaryev d7b6f477e1
Never list workers in Update()/storeUpdate() transactions (#228)
* POST /v1/workers: do not list workers in a single update txn

* schedulingLoopIteration(): do not list workers in a single update txn

* .golangci.yml: remove mentions of fully deprecated linters
2024-12-05 16:59:50 +04:00
Nikolay Edigaryev d94690176e
Schedule opportunistically and more granularly (#225)
* Schedule opportunistically and more granularly

To avoid transaction conflicts.

* Measure scheduling loop iteration duration and log it at debugging level

* Use "continue NextWorker" instead of just "continue" for clarity
2024-12-03 14:11:48 +00:00
Nikolay Edigaryev 7fe0414981
"--scheduler-profile" option to allow different orchestration patterns (#224)
* "--scheduler-profile" option to allow different orchestration patterns

* API(cluster settings): provide a default value for scheduler profile
2024-11-28 20:07:46 +04:00
Nikolay Edigaryev 772336a7bd
Scheduler: stop iterating over workers when candidate worker is found (#220) 2024-11-13 17:59:08 +04:00
Nikolay Edigaryev 2a2ddea62a
Controller: emit lifecycle events when the VM gets restarted or deleted (#208)
* Controller: emit lifecycle events when the VM gets restarted or deleted

* vm_{scheduling,run}_time → vm_{scheduling,run}_duration for clarity

* Update VM endpoint: only update VM started time when zero
2024-09-24 17:53:10 +04:00
Mark McWhirter 979af1f699
Expose 2 new metrics about worker health (#203)
* Expose more metrics about worker health

* PR feedback

* PR feedback
2024-09-10 10:13:41 -04:00
Nikolay Edigaryev ff0497b1d8
Produce OpenTelemetry metrics (#185)
* .golangci.yml: remove mentions of deprecated linters

* Fix "staticcheck" linter error by using grpc.NewClient

* Configure OpenTelemetry

Metrics only for now.

* Produce OpenTelemetry metrics

* Update DeploymentGuide.md

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>

* Update DeploymentGuide.md

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>

* Introduce "org.cirruslabs.orchard.controller.worker_status"

---------

Co-authored-by: Fedor Korotkov <fedor.korotkov@gmail.com>
2024-06-24 18:19:51 +04:00
Nikolay Edigaryev 60e564da88
Implement restart policy for VMs (#83)
* Implement restart policy for VMs

* Do not update VM.Resource, we only use it as a read-only specification

* Err()/setErr(): use atomic.Pointer instead of sync.Mutex
2023-04-24 19:30:08 +04:00
Fedor Korotkov 010df300a3
Add basic Prometheus metrics (#82)
Fixes #71
2023-04-21 10:05:01 +04:00
Nikolay Edigaryev 84633d0e45
Introduce "orchard pause" and "orchard resume" commands (#73) 2023-04-07 22:59:41 +04:00
Nikolay Edigaryev 4eafec99a5
Fail VMs if the worker had crashed/is unhealthy (#70)
* Fail VMs if the worker had crashed/is unhealthy

* OnDiskName: properly handle cases when VM's name contains hyphens

* Worker: introduce Offline() method and check it before scheduling

* tart.List(): use Tart's JSON output

* OnDiskName: remove empty parts check

* Scheduler: move health-checking logic to a separate function

* Only fail "running" VMs

* Only fail orphaned VMs if they're in terminal state

* Integration tests

* Run healthCheckingLoopIteration() before schedulingLoopIteration()

* Worker: sync on-disk VMs only once at start
2023-04-03 16:47:49 +04:00
Fedor Korotkov f152043f19
Reactive Scheduling (#67)
Before we had two main loops: controller loop to assign VMs and worker loop to start VMs. Each of the loops was performed upon an interval every N seconds.

This change introduces a mechanism for reactively requesting loop execution:

 1. Controller loop will be executed upon VM creation to try to immediately schedule.
 2. A worker will be notified upon a VM assigment and worker loop will be requested to sync immediately.

 Fixes #31
2023-03-28 20:51:41 +04:00
Nikolay Edigaryev cb39836ee0
Resources support (#63)
* Resources support

* Ability to provide VM and worker resources via the CLI

* orchard dev: always listen on :6120

* orchard dev: support --resources

* REST API: provide resource defaults when creating VM

* OpenAPI: document "resources" field

* orchard dev: serve Swagger API documentation on /v1/

* Integration guide
2023-03-27 17:30:54 +04:00