maand health_check

Health check verifies workers are reachable, then checks each job. Manifest probes run on active allocations only (disabled=0). health_check commands fan out to non-removed allocations (includes disabled).

Each job may use manifest probes, a custom command, or both:

When both are defined, manifest probes run first, then health_check commands (in DB order).

Order when you run maand health_check:

  1. Worker health — TCP dial to each worker’s SSH port (maand.conf ssh_port, default 22).
  2. Job health — manifest probes (if any), then health_check command scripts (if any).

Deploy runs job health automatically after restart / job_control for jobs that define one of the above. Deploy does not re-run the worker SSH gate on every job (use maand health_check for that).


CLI

maand health_check [flags]
Flag Default Description
--jobs all jobs Comma-separated job names. Unknown names error.
--wait false Retry until success. Worker SSH gate: 30 attempts, 1 second apart. Per job: health_check.wait in manifest when set, else same defaults.
--verbose false Stream command output.

Examples:

maand health_check
maand health_check --jobs api,worker
maand health_check --jobs api --wait --verbose

Prerequisites

  1. maand build (allocations and hooks in DB).
  2. Host tools: python3, bun (if needed), bash, ssh.
  3. Jobs need a health_check section in manifest.json and/or a command with executed_on: ["health_check"].

If a job has neither probes nor commands, or no non-removed allocations, maand prints a skip line and continues (exit code 0 for that job):

health check skipped: <job> (no allocations)
health check skipped: <job> (no health_check config or commands)

Declare probes next to resources.ports. Port names reference manifest keys; maand resolves assigned numbers at check time and probes each active allocation in rollout order, max_concurrent_upgrades workers per batch (worker_ip:port). Disabled allocations are skipped for manifest probes (no error).

{
  "selectors": ["cassandra"],
  "resources": {
    "ports": {
      "cassandra_cql_port": {},
      "cassandra_http_port": {}
    }
  },
  "health_check": {
    "checks": [
      { "type": "tcp", "port": "cassandra_cql_port" },
      { "type": "http", "port": "cassandra_http_port", "path": "/metrics", "expect_status": 200 }
    ],
    "timeout_seconds": 5,
    "wait": { "attempts": 30, "interval_seconds": 1 }
  }
}
Probe Fields
tcp port (required)
http port, path (default /), expect_status (default 200), scheme (default http)
ssh command (required) — one shell line on the worker over SSH (no job workspace staging)

All checks must pass on every active allocation in the batch (AND). Built-in probes need no Python/Bun script. If the job has no active allocations, manifest probes are skipped (the job can still run health_check commands on disabled allocations).

Example ssh probe (systemd on the worker):

{ "type": "ssh", "command": "systemctl is-active cassandra" }

Custom command health (escape hatch)

{
  "selectors": ["worker"],
  "hooks": {
    "hook_health": {
      "executed_on": ["health_check"]
    }
  }
}

Script: _hooks/hook_health.py. Use for cluster readiness (nodetool status, etc.) when manifest probes are not enough. You may combine with manifest probes; commands run after probes pass.

Health-fast workspace: for health_check commands only, maand stages _hooks/hook_<name>.*, embedded maand.py / maand.ts, and certs — not the full job tree (Makefile, templates, etc.).


Worker SSH health

Before job checks, maand health_check dials worker_ip:ssh_port for every worker in the catalog (same rows as maand cat workers). Skips with worker health check skipped: no workers when the catalog is empty.

Configure in maand.conf:

ssh_port = 22

Output:

worker health check passed

or retry/fail when --wait is set.


What happens internally

Open maand.db
Begin transaction
kv.Initialize + StartRuntimeAPI (localhost:8080 for hook scripts)
SetupRuntime (healthcheck run context)
CheckWorkers — TCP dial worker_ip:ssh_port for every catalog worker
  (--wait: up to 30 attempts, 1s apart; not controlled by manifest wait)
Resolve job list (--jobs filter or all jobs from DB)
Run up to 16 jobs in parallel (shared transaction)
  For each job:
    Skip when no non-removed allocations
    Skip when no manifest probes and no health_check commands
    runChecks (with --wait retry using manifest health_check.wait when set):
      1. Manifest probes (if defined) — active allocations only:
           resolve rollout order → batches of max_concurrent_upgrades
           for each batch (sequential):
             all checks × all workers in batch in parallel
      2. health_check commands (if defined) — non-removed allocations:
           resolve rollout order → batches of max_concurrent_upgrades
           for each command in DB order:
             for each batch (sequential):
               run hook scripts on batch workers (parallel within batch)
Commit transaction on success (rollback on any failure)

Probes

Built-in probes dial worker_ip:assigned_port from the CLI host (or run ssh probes on the worker). No hook scripts, no batch env vars. Targets active allocations only. Within a batch, every probe runs against every worker in that batch in parallel.

Command scripts

For each non-removed allocation in the current batch, maand:

  1. Stages files under tmp/workers/<ip>/jobs/<job>/ (health-fast: command module, embedded maand.py / maand.ts, certs from KV).
  2. Runs the script on the CLI host with per-allocation env (ALLOCATION_*, JOB, EVENT, COMMAND, versions, …) plus batch env (BATCH_*, DEPLOY_PHASE=health_check, ROLLOUT_ORDER, …).

One failed allocation fails the job. Multiple failed jobs produce a batch error listing each job.


--wait behavior

When --wait is set:

Worker SSH gate (before any job):

Per job (manifest probes, then commands):

Without --wait, worker and job checks run once; failures return immediately.


Relationship to deploy

Context wait verbose
maand health_check User-controlled (--wait) User-controlled (--verbose)
Deploy after restart / job_control true (wait for recovery) true

Production deploy waits for health to pass after rolling updates; ad-hoc CLI checks can be one-shot unless you pass --wait. Use maand deploy --force to redeploy without a workspace change. See deploy.md.


Relationship to build / demands


Parallelism

Scope Limit
Jobs in one CLI run Up to 16 jobs at once (goroutines; one DB transaction)
Manifest probes Active allocations only; max_concurrent_upgrades per batch
Command scripts Non-removed allocations (includes disabled); same batch width and rollout order
Within a probe batch All workers × all checks in parallel
Within a command batch All allocations in the batch in parallel

Batches within a job run sequentially. Tune burst size with max_concurrent_upgrades in manifest.json — see manifest.md and rolling-deploy.md.


Errors

Error Meaning
HealthCheckError One job failed (wraps probe, command, SSH, or script error).
WorkerHealthCheckError Worker SSH port unreachable (with --wait, after retries).
Batch error Multiple jobs failed; message lists each job.
Skip (not an error) health check skipped: <job> (no allocations) or (no health_check config or commands)
jobs not in this bucket: [...] Bad --jobs name.
NotFoundError Command not registered for health_check on that job.

Typical usage patterns

After deploy:

maand deploy -b
maand health_check --wait

Single job smoke test:

maand health_check --jobs myservice --verbose

CI gate:

maand health_check --wait && echo OK

Force redeploy without workspace or hash changes:

maand deploy --force --jobs api

Writing health check scripts

Use the same libraries as other events — see hook-api.md:

Scripts should exit 0 when healthy, non-zero when not. Keep checks read-only when possible; mutations go to KV and persist only if you also run deploy or call persist APIs appropriately.