Debugging deployment issues
Structured checklist for maand deploy failures, skipped jobs, partial rollouts, and worker sync problems. Start with dry-run and cat commands before SSHing to workers.
Quick diagnostic flow
1. maand deploy --dry-run [-b] # would deploy run? which jobs?
2. maand cat deployments [--jobs J] # per-allocation rollout state
3. maand cat allocations [--jobs J]
4. maand cat jobs # deployment_seq, disabled flag
5. maand build && maand deploy # refresh plan hashes after workspace edit
6. Worker: /opt/worker/<bucket_id>/worker.json, jobs/<job>/logs/
maand info
maand deploy --dry-run
maand cat deployments --jobs api
maand cat allocations --jobs api
maand cat kv --jobs api
Dry-run first
maand deploy --dry-run
maand deploy -b -n # build + dry-run
maand deploy --dry-run --jobs api,vault
maand deploy --dry-run --force # preview forced restart
Dry-run stages locally and refreshes plan hashes; it does not change workers or commit hash promotion. It may not simulate stop of removed/disabled allocations (see deploy.md).
Per allocation, dry-run prints the planned action:
| Action | Meaning |
|---|---|
| start | First deploy on this worker (previous_hash empty) |
| restart | Full recreate — default policy always, or reload + restart_globs match |
| reload | Soft apply — restart_policy: reload and no glob match |
| sync | Rsync + promote only (--sync-only or restart_policy: never) |
| skip | Already promoted |
When restart_globs forces a restart, the line includes matched= with the changed paths (for example matched=Makefile,bin/app).
Interpret output:
| Message | Meaning |
|---|---|
deployment required |
At least one job needs rollout |
no deployment required |
All active allocations promoted (hashes + versions match) |
deploy required per job |
That job will stage/rsync and run lifecycle (or sync-only) |
skip / already promoted |
JobNeedsRollout false for that job |
Read allocation hash state
maand cat deployments
maand cat deployments --jobs api --active
maand cat deployments --workers 10.0.0.1
| Rollout | Meaning | Typical action |
|---|---|---|
new |
Never promoted | First deploy → start |
restart |
Staged content or version differs from promoted | Lifecycle on deploy — make restart (default always), make reload (reload policy), or sync only (never / --sync-only) |
promoted |
In sync | Skipped unless --force |
health_failed |
Legacy hash state from prior health-check marking | Fix health, then deploy or deploy --force |
disabled |
Allocation disabled; stopped; catalog current | Re-enable via disable and drain |
disabled_restart |
Disabled; catalog has pending content/version | Deploy updates plan; still no restart until active |
removed |
Soft-deleted allocation | deploy then gc |
Columns current_version / new_version: version-only rollout when hashes match but versions differ. Column post_deploy_status: pending, success, or failed when the job has post_deploy hooks.
Common symptoms
deploy: skip job "..." (deploy complete on all allocations)
| Cause | Fix |
|---|---|
| No workspace change since last successful deploy (rollout + post_deploy) | Expected; edit job or use --force |
| Edited workspace but only ran deploy | Run maand build then maand deploy |
| Version bump without content change | build sets new_version; deploy should restart — check cat deployments for version mismatch |
| TLS cert expired or expiring soon | maand cat certs — CA expired fails maand build; CA expiring warns on stderr; leaf certs auto-renew on build — then maand deploy — certs.md |
| Job fully disabled | No active allocations; enable or use --jobs on active jobs only |
Job not in deploy wave at all
maand cat jobs # job exists? deployment_seq?
maand cat allocations --jobs J # any active rows?
| Cause | Fix |
|---|---|
| No matching workers | Fix selectors vs worker_labels; check workers.json |
| Job filtered out | deploy --jobs list omits it |
Lower deployment_seq jobs failing |
Fix earlier wave; deploy processes seq 0 first |
pre_deploy failed |
Check hook logs; job skipped from staging this run |
pre_deploy / post_deploy / hook failure
Hooks run on the CLI host (Python/Bun). Failures return deploy failed with the job name.
maand cat hooks --jobs api
maand hooks hook_migrate api --verbose # reproduce cli event
| Event | On failure |
|---|---|
pre_deploy |
Job not staged this run; others continue |
post_deploy |
Job deploy fails; post_deploy_status = failed on all non-removed allocations; content may already be promoted — re-run maand deploy (step 4 when rollout complete) |
after_allocation_started / after_allocation_stopped |
Blocks promote (and reconcile cleanup for stop) for that batch; worker may already have restarted/stopped — fix hook and retry |
job_control |
Entire job deploy path fails |
KV written during hooks commits with deploy on success; rolls back if deploy aborts before commit.
SSH / rsync errors
# From maand host
ssh -i secrets/worker.key agent@10.0.0.1 true
| Symptom | Check |
|---|---|
| Connection refused / timeout | Firewall, worker down, wrong IP |
| Permission denied | secrets/worker.key authorized on worker |
| rsync / sudo errors | maand.conf: use_sudo, ssh_user; worker needs rsync, make, python3, timeout |
| Host prerequisites | CLI needs bash, ssh, rsync, python3 (and bun if .ts/.js commands exist) |
Deploy logs deploy: removed worker X unreachable, assuming dead for off-catalog removed allocations — usually safe to ignore.
health check failed
maand health_check --jobs api --wait --verbose
| Cause | Fix |
|---|---|
| Probe not ready yet | Increase wait; fix startup order / deployment_seq |
| Command health failed | Fix script; then deploy or deploy --force |
| Wrong port in manifest | maand cat ports --jobs api |
worker.json / update_seq mismatch (maand job only)
maand job status api # fails sync check
maand deploy # refreshes worker.json on workers
Deploy does not require this check; maand job does.
Partial deploy (some jobs succeeded, others failed)
Deploy commits successful jobs. Failed jobs may have partial per-batch promote (earlier batches promoted before health failure) or post_deploy_status = failed after promote.
maand cat deployments # promoted vs restart vs post_deploy_status per allocation
maand deploy # retries rollout, post_deploy-only, or skips complete jobs
Fix the failing job, re-run maand deploy — fully complete jobs are skipped.
Template / KV errors during stage
maand cat kv --jobs api
maand cat kv get vars/job/api mykey
| Error | Fix |
|---|---|
get / getSecret missing key |
Run hook that writes KV (post_build, pre_deploy) or maand hooks |
| Template panic | Allowed namespaces only — see templates.md |
Disabled / re-enable surprises
| Symptom | Fix |
|---|---|
| Job stopped after disable | Expected; see disable and drain |
| Re-enabled job not starting | Run maand build after clearing disabled.json, then deploy |
| KV missing on disabled job | Should be retained; if gone, check whether allocations were removed not disabled |
cat kv --jobs J empty |
Job may be fully removed (no non-removed allocations) |
Removed allocation / GC
maand build && maand deploy && maand gc
maand cat allocations --jobs api
Removed rows: hash cleared on deploy; worker jobs/<job>/ tree deleted on gc. See gc.md.
Worker-side inspection
Paths on worker (replace <bucket_id> from maand info):
/opt/worker/<bucket_id>/worker.json
/opt/worker/<bucket_id>/jobs.json
/opt/worker/<bucket_id>/jobs/<job>/Makefile
/opt/worker/<bucket_id>/jobs/<job>/data/
/opt/worker/<bucket_id>/jobs/<job>/logs/
/opt/worker/<bucket_id>/bin/runner.py
maand run_command "cat /opt/worker/<bucket_id>/worker.json" --workers 10.0.0.1
maand run_command "ls -la /opt/worker/<bucket_id>/jobs/api/" --workers 10.0.0.1
Compare update_seq in worker.json with maand info.
Staging directory (maand host)
During deploy, inspect rendered trees before rsync:
tmp/workers/<worker_ip>/jobs/<job>/
tmp/workers/<worker_ip>/worker.json
If deploy fails mid-run, this directory is removed when the command exits.
Logging
Structured bucket logs: logs/<worker_ip>.log, logs/maand.log, and per-invocation logs/runs/<run_id>/.
maand logs show --worker 10.48.200.3 --job postgres --format human
maand logs show --event deploy_skip --format human
Full reference (formats, terminal output, flags): logging.
Command reference for debugging
| Goal | Command |
|---|---|
| Bucket summary | maand info |
| Plan deploy | maand deploy --dry-run |
| Hash / version state | maand cat deployments |
| Allocation flags | maand cat allocations |
| Deploy order | maand cat jobs |
| KV for templates/hooks | maand cat kv --jobs <job> |
| Ports | maand cat ports --jobs <job> |
| Reproduce hook | maand hooks <cmd> [job] --verbose |
| Filter bucket logs | maand logs show --worker <ip> --job <job> --format human |
| Health only | maand health_check --jobs <job> --wait --verbose |
| Force reroll | maand deploy --force --jobs <job> |
| Config-only push (no lifecycle) | maand deploy --sync-only --jobs <job> (fails if start required) |
Escalation checklist
maand buildsucceeded after last workspace edit?maand cat deployments: expected rollout state per worker?maand deploy --dry-run: job listed as required?deployment_seq: blocked by an earlier job?disabled.json: unintended drain?pre_deploy/post_deploylogs clean?- Worker prerequisites and SSH from CLI host?
- Health probes passing with
--wait?
Related
- deploy.md — pipeline and failure table
- Applying changes on workers —
restart_policy,restart_globs,--sync-only, dry-run actions - disable and drain — disable and re-enable
- rolling-deploy —
max_concurrent_upgradesand reboot patterns - health-check.md
- hook-api.md — hook debugging
- day-2-ops.md — operations checklist