maand health_check
Health check verifies workers are reachable, then checks each job. Manifest probes run on active allocations only (disabled=0). health_check commands fan out to non-removed allocations (includes disabled).
Each job may use manifest probes, a custom command, or both:
- Manifest probes —
health_check.checksinmanifest.json(tcp / http / ssh) - Custom command —
hook_*withexecuted_on: ["health_check"]
When both are defined, manifest probes run first, then health_check commands (in DB order).
Order when you run maand health_check:
- Worker health — TCP dial to each worker’s SSH port (
maand.confssh_port, default 22). - Job health — manifest probes (if any), then
health_checkcommand scripts (if any).
Deploy runs job health automatically after restart / job_control for jobs that define one of the above. Deploy does not re-run the worker SSH gate on every job (use maand health_check for that).
CLI
maand health_check [flags]
| Flag | Default | Description |
|---|---|---|
--jobs |
all jobs | Comma-separated job names. Unknown names error. |
--wait |
false | Retry until success. Worker SSH gate: 30 attempts, 1 second apart. Per job: health_check.wait in manifest when set, else same defaults. |
--verbose |
false | Stream command output. |
Examples:
maand health_check
maand health_check --jobs api,worker
maand health_check --jobs api --wait --verbose
Prerequisites
maand build(allocations andhooksin DB).- Host tools:
python3,bun(if needed),bash,ssh. - Jobs need a
health_checksection inmanifest.jsonand/or a command withexecuted_on:["health_check"].
If a job has neither probes nor commands, or no non-removed allocations, maand prints a skip line and continues (exit code 0 for that job):
health check skipped: <job> (no allocations)
health check skipped: <job> (no health_check config or commands)
Built-in manifest health (recommended)
Declare probes next to resources.ports. Port names reference manifest keys; maand resolves assigned numbers at check time and probes each active allocation in rollout order, max_concurrent_upgrades workers per batch (worker_ip:port). Disabled allocations are skipped for manifest probes (no error).
{
"selectors": ["cassandra"],
"resources": {
"ports": {
"cassandra_cql_port": {},
"cassandra_http_port": {}
}
},
"health_check": {
"checks": [
{ "type": "tcp", "port": "cassandra_cql_port" },
{ "type": "http", "port": "cassandra_http_port", "path": "/metrics", "expect_status": 200 }
],
"timeout_seconds": 5,
"wait": { "attempts": 30, "interval_seconds": 1 }
}
}
| Probe | Fields |
|---|---|
tcp |
port (required) |
http |
port, path (default /), expect_status (default 200), scheme (default http) |
ssh |
command (required) — one shell line on the worker over SSH (no job workspace staging) |
All checks must pass on every active allocation in the batch (AND). Built-in probes need no Python/Bun script. If the job has no active allocations, manifest probes are skipped (the job can still run health_check commands on disabled allocations).
Example ssh probe (systemd on the worker):
{ "type": "ssh", "command": "systemctl is-active cassandra" }
Custom command health (escape hatch)
{
"selectors": ["worker"],
"hooks": {
"hook_health": {
"executed_on": ["health_check"]
}
}
}
Script: _hooks/hook_health.py. Use for cluster readiness (nodetool status, etc.) when manifest probes are not enough. You may combine with manifest probes; commands run after probes pass.
Health-fast workspace: for health_check commands only, maand stages _hooks/hook_<name>.*, embedded maand.py / maand.ts, and certs — not the full job tree (Makefile, templates, etc.).
Worker SSH health
Before job checks, maand health_check dials worker_ip:ssh_port for every worker in the catalog (same rows as maand cat workers). Skips with worker health check skipped: no workers when the catalog is empty.
Configure in maand.conf:
ssh_port = 22
Output:
worker health check passed
or retry/fail when --wait is set.
What happens internally
Open maand.db
Begin transaction
kv.Initialize + StartRuntimeAPI (localhost:8080 for hook scripts)
SetupRuntime (healthcheck run context)
CheckWorkers — TCP dial worker_ip:ssh_port for every catalog worker
(--wait: up to 30 attempts, 1s apart; not controlled by manifest wait)
Resolve job list (--jobs filter or all jobs from DB)
Run up to 16 jobs in parallel (shared transaction)
For each job:
Skip when no non-removed allocations
Skip when no manifest probes and no health_check commands
runChecks (with --wait retry using manifest health_check.wait when set):
1. Manifest probes (if defined) — active allocations only:
resolve rollout order → batches of max_concurrent_upgrades
for each batch (sequential):
all checks × all workers in batch in parallel
2. health_check commands (if defined) — non-removed allocations:
resolve rollout order → batches of max_concurrent_upgrades
for each command in DB order:
for each batch (sequential):
run hook scripts on batch workers (parallel within batch)
Commit transaction on success (rollback on any failure)
Probes
Built-in probes dial worker_ip:assigned_port from the CLI host (or run ssh probes on the worker). No hook scripts, no batch env vars. Targets active allocations only. Within a batch, every probe runs against every worker in that batch in parallel.
Command scripts
For each non-removed allocation in the current batch, maand:
- Stages files under
tmp/workers/<ip>/jobs/<job>/(health-fast: command module, embeddedmaand.py/maand.ts, certs from KV). - Runs the script on the CLI host with per-allocation env (
ALLOCATION_*,JOB,EVENT,COMMAND, versions, …) plus batch env (BATCH_*,DEPLOY_PHASE=health_check,ROLLOUT_ORDER, …).
One failed allocation fails the job. Multiple failed jobs produce a batch error listing each job.
--wait behavior
When --wait is set:
Worker SSH gate (before any job):
- Up to 30 attempts, 1 second apart (fixed; not read from manifest).
- Success prints:
worker health check passed
Per job (manifest probes, then commands):
- Retry interval and attempt count come from
health_check.waitinmanifest.jsonwhen set; otherwise 30 attempts and 1 second apart. - On failure, sleep and retry the whole job check (probes then commands).
- Success prints:
health check passed: <job> - Failure returns
HealthCheckErrorwith the last underlying error.
Without --wait, worker and job checks run once; failures return immediately.
Relationship to deploy
| Context | wait |
verbose |
|---|---|---|
maand health_check |
User-controlled (--wait) |
User-controlled (--verbose) |
| Deploy after restart / job_control | true (wait for recovery) | true |
Production deploy waits for health to pass after rolling updates; ad-hoc CLI checks can be one-shot unless you pass --wait. Use maand deploy --force to redeploy without a workspace change. See deploy.md.
Relationship to build / demands
health_checkis a validexecuted_onvalue at build time.- It does not affect
deployment_sequnless paired with demands (demands are between jobs;health_checkalone does not create edges). post_buildhooks run at end of build;health_checkdoes not run automatically on build.
Parallelism
| Scope | Limit |
|---|---|
| Jobs in one CLI run | Up to 16 jobs at once (goroutines; one DB transaction) |
| Manifest probes | Active allocations only; max_concurrent_upgrades per batch |
| Command scripts | Non-removed allocations (includes disabled); same batch width and rollout order |
| Within a probe batch | All workers × all checks in parallel |
| Within a command batch | All allocations in the batch in parallel |
Batches within a job run sequentially. Tune burst size with max_concurrent_upgrades in manifest.json — see manifest.md and rolling-deploy.md.
Errors
| Error | Meaning |
|---|---|
HealthCheckError |
One job failed (wraps probe, command, SSH, or script error). |
WorkerHealthCheckError |
Worker SSH port unreachable (with --wait, after retries). |
| Batch error | Multiple jobs failed; message lists each job. |
| Skip (not an error) | health check skipped: <job> (no allocations) or (no health_check config or commands) |
jobs not in this bucket: [...] |
Bad --jobs name. |
NotFoundError |
Command not registered for health_check on that job. |
Typical usage patterns
After deploy:
maand deploy -b
maand health_check --wait
Single job smoke test:
maand health_check --jobs myservice --verbose
CI gate:
maand health_check --wait && echo OK
Force redeploy without workspace or hash changes:
maand deploy --force --jobs api
Writing health check scripts
Use the same libraries as other events — see hook-api.md:
- Python:
maand.py→ KV get/put, demands, semaphores. - Bun:
maand.ts→ same API.
Scripts should exit 0 when healthy, non-zero when not. Keep checks read-only when possible; mutations go to KV and persist only if you also run deploy or call persist APIs appropriately.
Related commands
deploy.md— automatic health check after rollout.- hooks.md — events and patterns · hook-api.md — runtime API
build.md— register commands in the catalog.