`maand health_check`

Health check verifies workers are reachable, then checks each job. Manifest probes run on active allocations only (disabled=0). health_check commands fan out to non-removed allocations (includes disabled).

Each job may use manifest probes, a custom command, or both:

Manifest probes — health_check.checks in manifest.json (tcp / http / ssh)
Custom command — hook_* with executed_on: ["health_check"]

When both are defined, manifest probes run first, then health_check commands (in DB order).

Order when you run maand health_check:

Worker health — TCP dial to each worker’s SSH port (maand.conf ssh_port, default 22).
Job health — manifest probes (if any), then health_check command scripts (if any).

Deploy runs job health automatically after restart / job_control for jobs that define one of the above. Deploy does not re-run the worker SSH gate on every job (use maand health_check for that).

CLI

maand health_check [flags]

Flag	Default	Description
`--jobs`	all jobs	Comma-separated job names. Unknown names error.
`--wait`	false	Retry until success. Worker SSH gate: 30 attempts, 1 second apart. Per job: `health_check.wait` in manifest when set, else same defaults.
`--verbose`	false	Stream command output.

Examples:

maand health_check
maand health_check --jobs api,worker
maand health_check --jobs api --wait --verbose

Prerequisites

maand build (allocations and hooks in DB).
Host tools: python3, bun (if needed), bash, ssh.
Jobs need a health_check section in manifest.json and/or a command with executed_on: ["health_check"].

If a job has neither probes nor commands, or no non-removed allocations, maand prints a skip line and continues (exit code 0 for that job):

health check skipped: <job> (no allocations)
health check skipped: <job> (no health_check config or commands)

Built-in manifest health (recommended)

Declare probes next to resources.ports. Port names reference manifest keys; maand resolves assigned numbers at check time and probes each active allocation in rollout order, max_concurrent_upgrades workers per batch (worker_ip:port). Disabled allocations are skipped for manifest probes (no error).

{
  "selectors": ["cassandra"],
  "resources": {
    "ports": {
      "cassandra_cql_port": {},
      "cassandra_http_port": {}
    }
  },
  "health_check": {
    "checks": [
      { "type": "tcp", "port": "cassandra_cql_port" },
      { "type": "http", "port": "cassandra_http_port", "path": "/metrics", "expect_status": 200 }
    ],
    "timeout_seconds": 5,
    "wait": { "attempts": 30, "interval_seconds": 1 }
  }
}

Probe	Fields
`tcp`	`port` (required)
`http`	`port`, `path` (default `/`), `expect_status` (default `200`), `scheme` (default `http`)
`ssh`	`command` (required) — one shell line on the worker over SSH (no job workspace staging)

All checks must pass on every active allocation in the batch (AND). Built-in probes need no Python/Bun script. If the job has no active allocations, manifest probes are skipped (the job can still run health_check commands on disabled allocations).

Example ssh probe (systemd on the worker):

{ "type": "ssh", "command": "systemctl is-active cassandra" }

Custom command health (escape hatch)

{
  "selectors": ["worker"],
  "hooks": {
    "hook_health": {
      "executed_on": ["health_check"]
    }
  }
}

Script: _hooks/hook_health.py. Use for cluster readiness (nodetool status, etc.) when manifest probes are not enough. You may combine with manifest probes; commands run after probes pass.

Health-fast workspace: for health_check commands only, maand stages _hooks/hook_<name>.*, embedded maand.py / maand.ts, and certs — not the full job tree (Makefile, templates, etc.).

Worker SSH health

Before job checks, maand health_check dials worker_ip:ssh_port for every worker in the catalog (same rows as maand cat workers). Skips with worker health check skipped: no workers when the catalog is empty.

Configure in maand.conf:

ssh_port = 22

Output:

worker health check passed

or retry/fail when --wait is set.

What happens internally

Open maand.db
Begin transaction
kv.Initialize + StartRuntimeAPI (localhost:8080 for hook scripts)
SetupRuntime (healthcheck run context)
CheckWorkers — TCP dial worker_ip:ssh_port for every catalog worker
  (--wait: up to 30 attempts, 1s apart; not controlled by manifest wait)
Resolve job list (--jobs filter or all jobs from DB)
Run up to 16 jobs in parallel (shared transaction)
  For each job:
    Skip when no non-removed allocations
    Skip when no manifest probes and no health_check commands
    runChecks (with --wait retry using manifest health_check.wait when set):
      1. Manifest probes (if defined) — active allocations only:
           resolve rollout order → batches of max_concurrent_upgrades
           for each batch (sequential):
             all checks × all workers in batch in parallel
      2. health_check commands (if defined) — non-removed allocations:
           resolve rollout order → batches of max_concurrent_upgrades
           for each command in DB order:
             for each batch (sequential):
               run hook scripts on batch workers (parallel within batch)
Commit transaction on success (rollback on any failure)

Probes

Built-in probes dial worker_ip:assigned_port from the CLI host (or run ssh probes on the worker). No hook scripts, no batch env vars. Targets active allocations only. Within a batch, every probe runs against every worker in that batch in parallel.

Command scripts

For each non-removed allocation in the current batch, maand:

Stages files under tmp/workers/<ip>/jobs/<job>/ (health-fast: command module, embedded maand.py / maand.ts, certs from KV).
Runs the script on the CLI host with per-allocation env (ALLOCATION_*, JOB, EVENT, COMMAND, versions, …) plus batch env (BATCH_*, DEPLOY_PHASE=health_check, ROLLOUT_ORDER, …).

One failed allocation fails the job. Multiple failed jobs produce a batch error listing each job.

`--wait` behavior

When --wait is set:

Worker SSH gate (before any job):

Up to 30 attempts, 1 second apart (fixed; not read from manifest).
Success prints: worker health check passed

Per job (manifest probes, then commands):

Retry interval and attempt count come from health_check.wait in manifest.json when set; otherwise 30 attempts and 1 second apart.
On failure, sleep and retry the whole job check (probes then commands).
Success prints: health check passed: <job>
Failure returns HealthCheckError with the last underlying error.

Without --wait, worker and job checks run once; failures return immediately.

Relationship to deploy

Context	`wait`	`verbose`
`maand health_check`	User-controlled (`--wait`)	User-controlled (`--verbose`)
Deploy after restart / job_control	true (wait for recovery)	true

Production deploy waits for health to pass after rolling updates; ad-hoc CLI checks can be one-shot unless you pass --wait. Use maand deploy --force to redeploy without a workspace change. See deploy.md.

Relationship to build / demands

health_check is a valid executed_on value at build time.
It does not affect deployment_seq unless paired with demands (demands are between jobs; health_check alone does not create edges).
post_build hooks run at end of build; health_check does not run automatically on build.

Parallelism

Scope	Limit
Jobs in one CLI run	Up to 16 jobs at once (goroutines; one DB transaction)
Manifest probes	Active allocations only; `max_concurrent_upgrades` per batch
Command scripts	Non-removed allocations (includes disabled); same batch width and rollout order
Within a probe batch	All workers × all checks in parallel
Within a command batch	All allocations in the batch in parallel

Batches within a job run sequentially. Tune burst size with max_concurrent_upgrades in manifest.json — see manifest.md and rolling-deploy.md.

Errors

Error	Meaning
`HealthCheckError`	One job failed (wraps probe, command, SSH, or script error).
`WorkerHealthCheckError`	Worker SSH port unreachable (with `--wait`, after retries).
Batch error	Multiple jobs failed; message lists each job.
Skip (not an error)	`health check skipped: <job> (no allocations)` or `(no health_check config or commands)`
`jobs not in this bucket: [...]`	Bad `--jobs` name.
`NotFoundError`	Command not registered for `health_check` on that job.

Typical usage patterns

After deploy:

maand deploy -b
maand health_check --wait

Single job smoke test:

maand health_check --jobs myservice --verbose

CI gate:

maand health_check --wait && echo OK

Force redeploy without workspace or hash changes:

maand deploy --force --jobs api

Writing health check scripts

Use the same libraries as other events — see hook-api.md:

Python: maand.py → KV get/put, demands, semaphores.
Bun: maand.ts → same API.

Scripts should exit 0 when healthy, non-zero when not. Keep checks read-only when possible; mutations go to KV and persist only if you also run deploy or call persist APIs appropriately.

deploy.md — automatic health check after rollout.
hooks.md — events and patterns · hook-api.md — runtime API
build.md — register commands in the catalog.