Debugging deployment issues

Structured checklist for maand deploy failures, skipped jobs, partial rollouts, and worker sync problems. Start with dry-run and cat commands before SSHing to workers.


Quick diagnostic flow

1. maand deploy --dry-run [-b]     # would deploy run? which jobs?
2. maand cat deployments [--jobs J]     # per-allocation rollout state
3. maand cat allocations [--jobs J]
4. maand cat jobs                  # deployment_seq, disabled flag
5. maand build && maand deploy     # refresh plan hashes after workspace edit
6. Worker: /opt/worker/<bucket_id>/worker.json, jobs/<job>/logs/
maand info
maand deploy --dry-run
maand cat deployments --jobs api
maand cat allocations --jobs api
maand cat kv --jobs api

Dry-run first

maand deploy --dry-run
maand deploy -b -n              # build + dry-run
maand deploy --dry-run --jobs api,vault
maand deploy --dry-run --force  # preview forced restart

Dry-run stages locally and refreshes plan hashes; it does not change workers or commit hash promotion. It may not simulate stop of removed/disabled allocations (see deploy.md).

Per allocation, dry-run prints the planned action:

Action Meaning
start First deploy on this worker (previous_hash empty)
restart Full recreate — default policy always, or reload + restart_globs match
reload Soft apply — restart_policy: reload and no glob match
sync Rsync + promote only (--sync-only or restart_policy: never)
skip Already promoted

When restart_globs forces a restart, the line includes matched= with the changed paths (for example matched=Makefile,bin/app).

Interpret output:

Message Meaning
deployment required At least one job needs rollout
no deployment required All active allocations promoted (hashes + versions match)
deploy required per job That job will stage/rsync and run lifecycle (or sync-only)
skip / already promoted JobNeedsRollout false for that job

Read allocation hash state

maand cat deployments
maand cat deployments --jobs api --active
maand cat deployments --workers 10.0.0.1
Rollout Meaning Typical action
new Never promoted First deploy → start
restart Staged content or version differs from promoted Lifecycle on deploy — make restart (default always), make reload (reload policy), or sync only (never / --sync-only)
promoted In sync Skipped unless --force
health_failed Legacy hash state from prior health-check marking Fix health, then deploy or deploy --force
disabled Allocation disabled; stopped; catalog current Re-enable via disable and drain
disabled_restart Disabled; catalog has pending content/version Deploy updates plan; still no restart until active
removed Soft-deleted allocation deploy then gc

Columns current_version / new_version: version-only rollout when hashes match but versions differ. Column post_deploy_status: pending, success, or failed when the job has post_deploy hooks.


Common symptoms

deploy: skip job "..." (deploy complete on all allocations)

Cause Fix
No workspace change since last successful deploy (rollout + post_deploy) Expected; edit job or use --force
Edited workspace but only ran deploy Run maand build then maand deploy
Version bump without content change build sets new_version; deploy should restart — check cat deployments for version mismatch
TLS cert expired or expiring soon maand cat certsCA expired fails maand build; CA expiring warns on stderr; leaf certs auto-renew on build — then maand deploycerts.md
Job fully disabled No active allocations; enable or use --jobs on active jobs only

Job not in deploy wave at all

maand cat jobs                    # job exists? deployment_seq?
maand cat allocations --jobs J    # any active rows?
Cause Fix
No matching workers Fix selectors vs worker_labels; check workers.json
Job filtered out deploy --jobs list omits it
Lower deployment_seq jobs failing Fix earlier wave; deploy processes seq 0 first
pre_deploy failed Check hook logs; job skipped from staging this run

pre_deploy / post_deploy / hook failure

Hooks run on the CLI host (Python/Bun). Failures return deploy failed with the job name.

maand cat hooks --jobs api
maand hooks hook_migrate api --verbose   # reproduce cli event
Event On failure
pre_deploy Job not staged this run; others continue
post_deploy Job deploy fails; post_deploy_status = failed on all non-removed allocations; content may already be promoted — re-run maand deploy (step 4 when rollout complete)
after_allocation_started / after_allocation_stopped Blocks promote (and reconcile cleanup for stop) for that batch; worker may already have restarted/stopped — fix hook and retry
job_control Entire job deploy path fails

KV written during hooks commits with deploy on success; rolls back if deploy aborts before commit.

SSH / rsync errors

# From maand host
ssh -i secrets/worker.key agent@10.0.0.1 true
Symptom Check
Connection refused / timeout Firewall, worker down, wrong IP
Permission denied secrets/worker.key authorized on worker
rsync / sudo errors maand.conf: use_sudo, ssh_user; worker needs rsync, make, python3, timeout
Host prerequisites CLI needs bash, ssh, rsync, python3 (and bun if .ts/.js commands exist)

Deploy logs deploy: removed worker X unreachable, assuming dead for off-catalog removed allocations — usually safe to ignore.

health check failed

maand health_check --jobs api --wait --verbose
Cause Fix
Probe not ready yet Increase wait; fix startup order / deployment_seq
Command health failed Fix script; then deploy or deploy --force
Wrong port in manifest maand cat ports --jobs api

worker.json / update_seq mismatch (maand job only)

maand job status api    # fails sync check
maand deploy            # refreshes worker.json on workers

Deploy does not require this check; maand job does.

Partial deploy (some jobs succeeded, others failed)

Deploy commits successful jobs. Failed jobs may have partial per-batch promote (earlier batches promoted before health failure) or post_deploy_status = failed after promote.

maand cat deployments        # promoted vs restart vs post_deploy_status per allocation
maand deploy            # retries rollout, post_deploy-only, or skips complete jobs

Fix the failing job, re-run maand deploy — fully complete jobs are skipped.

Template / KV errors during stage

maand cat kv --jobs api
maand cat kv get vars/job/api mykey
Error Fix
get / getSecret missing key Run hook that writes KV (post_build, pre_deploy) or maand hooks
Template panic Allowed namespaces only — see templates.md

Disabled / re-enable surprises

Symptom Fix
Job stopped after disable Expected; see disable and drain
Re-enabled job not starting Run maand build after clearing disabled.json, then deploy
KV missing on disabled job Should be retained; if gone, check whether allocations were removed not disabled
cat kv --jobs J empty Job may be fully removed (no non-removed allocations)

Removed allocation / GC

maand build && maand deploy && maand gc
maand cat allocations --jobs api

Removed rows: hash cleared on deploy; worker jobs/<job>/ tree deleted on gc. See gc.md.


Worker-side inspection

Paths on worker (replace <bucket_id> from maand info):

/opt/worker/<bucket_id>/worker.json
/opt/worker/<bucket_id>/jobs.json
/opt/worker/<bucket_id>/jobs/<job>/Makefile
/opt/worker/<bucket_id>/jobs/<job>/data/
/opt/worker/<bucket_id>/jobs/<job>/logs/
/opt/worker/<bucket_id>/bin/runner.py
maand run_command "cat /opt/worker/<bucket_id>/worker.json" --workers 10.0.0.1
maand run_command "ls -la /opt/worker/<bucket_id>/jobs/api/" --workers 10.0.0.1

Compare update_seq in worker.json with maand info.


Staging directory (maand host)

During deploy, inspect rendered trees before rsync:

tmp/workers/<worker_ip>/jobs/<job>/
tmp/workers/<worker_ip>/worker.json

If deploy fails mid-run, this directory is removed when the command exits.


Logging

Structured bucket logs: logs/<worker_ip>.log, logs/maand.log, and per-invocation logs/runs/<run_id>/.

maand logs show --worker 10.48.200.3 --job postgres --format human
maand logs show --event deploy_skip --format human

Full reference (formats, terminal output, flags): logging.


Command reference for debugging

Goal Command
Bucket summary maand info
Plan deploy maand deploy --dry-run
Hash / version state maand cat deployments
Allocation flags maand cat allocations
Deploy order maand cat jobs
KV for templates/hooks maand cat kv --jobs <job>
Ports maand cat ports --jobs <job>
Reproduce hook maand hooks <cmd> [job] --verbose
Filter bucket logs maand logs show --worker <ip> --job <job> --format human
Health only maand health_check --jobs <job> --wait --verbose
Force reroll maand deploy --force --jobs <job>
Config-only push (no lifecycle) maand deploy --sync-only --jobs <job> (fails if start required)

Escalation checklist

  1. maand build succeeded after last workspace edit?
  2. maand cat deployments: expected rollout state per worker?
  3. maand deploy --dry-run: job listed as required?
  4. deployment_seq: blocked by an earlier job?
  5. disabled.json: unintended drain?
  6. pre_deploy / post_deploy logs clean?
  7. Worker prerequisites and SSH from CLI host?
  8. Health probes passing with --wait?