maand deploy
Deploy pushes job artifacts from the database to worker nodes, runs lifecycle actions (start / restart / stop), executes hook commands (pre_deploy, post_deploy, job_control), and updates allocation hashes so later deploys can skip unchanged jobs or resume after partial failure.
Requires a prior maand build (or maand deploy --build).
CLI
maand deploy [flags]
| Flag | Short | Description |
|---|---|---|
--jobs |
Comma-separated job names. Default: all jobs (per deployment sequence). | |
--build |
-b |
Run maand build before deploy. |
--dry-run |
-n |
Stage locally and compare allocation hashes; report whether deploy is required without changing workers or persisting hash updates. |
--force |
Redeploy jobs even when all allocations are already promoted (restart active allocations). | |
--sync-only |
Rsync and promote without start / restart / reload. Fails when any allocation still needs start (new allocation). |
Examples:
maand deploy
maand deploy -b
maand deploy --jobs api,worker
maand deploy --dry-run
maand deploy -b -n
maand deploy --force --jobs vault
maand deploy --sync-only --jobs prometheus
maand deploy --dry-run --sync-only
Prerequisites
- Initialized bucket with
maand buildcompleted successfully. - Host tools:
bash,ssh,rsync, andpython3on the CLI host (bunwhen any hook uses.ts/.js). Deploy checks these before syncing. maand.conf:ssh_user,ssh_key(undersecrets/), optionaluse_sudo.- Workers reachable by SSH from the maand CLI host; ensure
secrets/worker.keyis authorized on workers. - Each worker has
python3,make,rsync,bash, andtimeoutonPATH. Whenuse_sudo = true,sudomust work andsudo rsync --versionmust succeed. Deploy SSH-checks workers before syncing. - Each active job has a
Makefileunless the job uses onlyjob_controlcommands.
High-level pipeline
Open DB transaction + kv.Initialize
Start hook HTTP API on host
Reconcile once: stop removed/disabled allocations → after_allocation_stopped → cleanup
(post-reconcile health_check when a job stopped running allocations but still has survivors)
For deployment_seq = 0 .. max:
For each job in this sequence (respect --jobs filter):
pre_deploy (if registered) + stage + plan hash refresh (skip detection)
When at least one job in the wave needs rollout or post_deploy retry:
UpdateSeq (+1) — once per deploy invocation, before first rsync
Prepare worker.json / jobs.json / bin/ for all workers
For each job to deploy:
Stage job files + transpile .tpl + certs → tmp/workers/<ip>/jobs/<job>/
Rsync (filtered per job) to /opt/worker/<bucket_id>/
Update allocation content hashes (MD5 of staged tree)
deployJob: batched lifecycle + per-batch promote + health; then post_deploy + promote sweep
KV checkpoint
Final rsync: per successfully deployed job only (filtered + jobs.json refresh)
Commit transaction (even if some jobs failed — partial deploy)
Return joined errors if any job failed
Deployment sequence
Jobs with the same deployment_seq (from build) are processed in the same wave. Lower sequences complete before higher ones. This respects demands between hooks (e.g. job B depends on job A).
Reference: deployment-sequence.md — demand graph, deployment_seq algorithm, max_concurrent_upgrades within a wave, examples.
Within one sequence, jobs are independent except they share worker staging directories under tmp/workers/<ip>/.
Which jobs run in a deploy wave
A job is considered only if it appears in the sequence and passes the --jobs filter.
JobNeedsRollout and JobNeedsPostDeploy decide whether a job is staged and deployed:
| Condition | Action |
|---|---|
| Active allocation has no hash row yet | Rollout (first deploy). |
previous_hash != current_hash on an allocation |
Rollout (updated content). |
current_version != new_version on an active allocation |
Rollout (version pending). |
Rollout complete but post_deploy_status != success on any non-removed allocation |
Retry step 4 only (post_deploy wave). |
Rollout complete and post_deploy_status is success or NULL (no post_deploy hooks) |
Skipped — log: deploy: skip job "..." (deploy complete on all allocations). |
--force |
Stage and restart all active allocations (except new ones, which still start). Hooks, health, and post_deploy still apply. |
Per-batch promote and deploy completion
During rollout, maand promotes each allocation batch after after_allocation_started hooks succeed and before the health gate for that phase. Hook failure blocks promote for that batch; health failure does not unpromote batches that already promoted.
A job is finished for the current generation only when post_deploy succeeds (or the job has no post_deploy commands). Catalog field post_deploy_status on each allocation hash tracks this:
post_deploy_status |
Meaning |
|---|---|
NULL |
No pending post_deploy for this generation, or job has no post_deploy hooks |
pending |
Content promoted; post_deploy not yet succeeded |
success |
post_deploy succeeded — deploy complete for this generation |
failed |
post_deploy ran and failed (retry runs pre-rolling health, then step 4) |
After a successful deploy wave, promoteAllocationHash (or per-batch promote) sets previous_hash = current_hash and hash.current_version = allocations.new_version. A re-run of maand deploy continues from failed jobs only (partial deploy resume). Jobs with all allocations promoted but post_deploy still pending are staged for step 4 only. Use --force to roll the same content again without a workspace change.
Allocation version tracking
Each active allocation tracks running vs target version alongside content hashes in the hash table (namespace <job>_allocation, key alloc_id).
| Field | Meaning |
|---|---|
current_version (hash table) |
Last promoted (running) version on that allocation |
new_version (allocations table) |
Target version from maand build (manifest.json → job.version) |
Defaults: If manifest.json omits version, maand uses 0.0.0 for KV and allocation version fields (build-time dependency rules still require an explicit version when the job participates in the demand graph).
Lifecycle (mirrors content hash promote):
First deploy: current_version=0.0.0 new_version=2.0.0
→ make start with CURRENT_VERSION=0.0.0 NEW_VERSION=2.0.0
→ promote → current_version=2.0.0
Upgrade: current_version=2.0.0 new_version=2.1.0
→ restart/reload (per restart_policy) → promote → current_version=2.1.0
Same version, unchanged tree: hash unchanged and `current_version = new_version` → job skipped (no lifecycle)
Version-only bump: **`build`** updates `allocations.new_version`; **`deploy`** runs lifecycle when `hash.current_version != allocations.new_version` even if the content hash is unchanged (typically **`reload`** when policy is **`reload`**)
During a rolling deploy, allocations on different workers can briefly differ (current_version updated per allocation as each wave promotes).
Where to read versions
| Surface | Keys / fields |
|---|---|
| Job-level KV (target) | maand cat kv get maand/job/<job> version |
| Catalog (per allocation) | hash.current_version (running), allocations.new_version (target) — maand cat deployments |
Templates (.tpl) |
{{ .CurrentVersion }}, {{ .NewVersion }} on allocation context |
Worker make env |
CURRENT_VERSION, NEW_VERSION on start / restart / reload |
| Hook scripts | Same env vars (plus NEW_ALLOCATIONS / UPDATED_ALLOCATIONS for job_control) |
Per-allocation KV namespace maand/job/<job>/worker/<ip> does not store a version key — deploy clears any legacy copy on promote. Use job-level KV or template context .NewVersion for the build target.
Example Makefile upgrade hook:
restart:
@echo "Upgrading $(CURRENT_VERSION) -> $(NEW_VERSION)"
./bin/upgrade.sh
Build-time version rules and demand min_version / max_version are separate — see manifest.md.
Use maand deploy --dry-run to see whether a real deploy would run, without rsync, lifecycle, or hash promotion:
- Stages job files under
tmp/workers/(same as deploy), includingpre_deployhooks when registered (so plan hashes match real deploy staging).pre_deploymay SSH to workers when the hook runs hooks on allocations. - Computes MD5 of each active allocation’s staged tree and compares to
previous_hashin the database (rolled back afterward). - Prints per job whether deploy is required and per allocation the planned action: start, restart, reload, sync, or skip. When
restart_globsforces a restart, the line includesmatched=with the changed paths.
maand deploy --dry-run
maand deploy -b -n --jobs api
maand deploy --dry-run --force # preview forced redeploy
Example output when content changed since last promote:
deploy dry-run: deployment required
deployment sequence 0:
job "api": deploy required
10.0.0.1 reload previous_hash=abc... current_hash=def...
10.0.0.2 restart previous_hash=abc... current_hash=def... matched=Makefile
When everything is promoted:
deploy dry-run: no deployment required
deployment sequence 0:
job "api": skip (already promoted on all allocations)
Default deploy path (deployJob steps 1–4)
For each job in the wave that needs rollout or post_deploy-only retry:
pre_deploy(optional, before staging)- Stage job files to
tmp/workers/<ip>/jobs/<job>/ - Rsync that job to each worker allocation
- Update allocation hashes for that job
deployJob— canonical steps below- KV checkpoint after
deployJob
When the job has no job_control commands, deployJob runs:
| Step | Makefile path | Batch order |
|---|---|---|
| 1 | handleNewAllocations |
lifecycle → after_allocation_started → promote; one health check after all start batches |
| 2 | Pre-rolling health | All active allocations (skipped when no upgrade pending) |
| 3 | handleUpdatedAllocations |
reload workers first, then restart workers; per batch: lifecycle → hooks → promote → health (wait) |
| 4 | finalizeJobDeploy |
job-wide post_deploy wave → mark post_deploy_status → idempotent promote sweep |
When job_control commands exist, steps 1–3 are replaced by batched job_control scripts; per batch: commands → promote → health (wait). Step 4 is unchanged.
Reconcile (before any job staging): stop removed/disabled allocations → after_allocation_stopped → cleanup. Stop hook failure aborts the entire deploy (no cleanup, no job waves). Post-reconcile health_check runs for jobs that stopped a running allocation and still have active survivors.
--sync-only: skips steps 1–3; runs step 4 only (rsync + post_deploy + promote).
Makefile lifecycle detail
handleNewAllocations: Workers whereprevious_hash IS NULL(new alloc) →
python3 /opt/worker/<bucket_id>/bin/runner.py <bucket_id> start --jobs <job>
in batches ofmax_concurrent_starts(0 = all at once), ordered byrollout_order.
after_allocation_startedhooks run after each batch, then promote that batch. One health check runs after all start batches complete.- Pre-rolling health: Before the first upgrade batch when any allocation needs restart/reload.
handleUpdatedAllocations: Workers where hash or version changed → lifecycle perrestart_policy(see below) in batches ofmax_concurrent_upgrades, ordered byrollout_order.
after_allocation_startedhooks run after each batch, then promote, then health_check (wait/retry) before the next batch.post_deploy: Hooks with eventpost_deploy(job-wide wave, batched per allocation).promoteAllocationHash: Idempotent sweep for any allocation not yet promoted (e.g. disabled, sync-only).
When allocations are stopped during reconcile (removed/disabled), after_allocation_stopped hooks run once per stopped batch; cleanup runs only after hooks pass.
Makefile on the worker (under jobs/<job>/) receives CURRENT_VERSION and NEW_VERSION in the environment for start, restart, and reload. Use them for upgrade logic (see Allocation version tracking):
start:
stop:
restart:
reload:
Data/logs/bin on workers are excluded from rsync (--exclude=jobs/*/data, etc.).
Applying changes on workers
Deploy always rsyncs staged files before it decides whether to touch running processes. That split matters: you can push config to disk while choosing how (or whether) the process reacts.
What triggers rollout
| Situation | Typical action |
|---|---|
First deploy on a worker (previous_hash empty) |
make start |
Staged tree differs from last promote (previous_hash ≠ current_hash) |
Lifecycle per restart_policy (below) |
Version target changed (current_version ≠ new_version, same tree) |
Same lifecycle policy (usually reload when policy is reload) |
| Already promoted on all active allocations | Skip — no rsync wave for that job |
--force |
Rollout all active allocations even when hash and version match |
New allocations always start. No policy or flag can replace start with rsync-only — use normal deploy for first boot, then tune policy for upgrades.
Default path: Makefile + runner.py
When the job has no job_control commands, deploy calls runner.py on the worker, which runs make targets in the job directory:
| Target | When |
|---|---|
start |
New allocation |
restart |
Full recreate / stop-start (policy always, or reload + restart_globs match) |
reload |
Soft apply — config reload, HTTP /-/reload, systemctl reload, etc. |
Each target receives CURRENT_VERSION and NEW_VERSION in the environment (see Allocation version tracking).
Rolling batches use max_concurrent_starts (starts) and max_concurrent_upgrades (restarts/reloads), ordered by rollout_order. Health checks run after start batches complete and after each update batch when a lifecycle target runs.
restart_policy (manifest)
Set in manifest.json. Default always. Applies to updated allocations only.
| Value | After rsync | Makefile |
|---|---|---|
always |
Recreate or full restart | make restart |
reload |
Soft apply when possible | make reload (see restart_globs) |
never |
Files only | — (rsync + promote) |
Example — Prometheus picks up rule and config changes without restarting the process when only non-critical files change:
{ "restart_policy": "reload" }
reload:
curl -sf -X POST http://127.0.0.1:$(PROM_PORT)/-/reload
Prometheus needs --web.enable-lifecycle. See guides/prometheus.md.
Stateful jobs (databases, queues) usually keep always. Stateless HTTP services and monitoring stacks often use reload.
restart_globs (manifest, with reload only)
Optional list of job-relative globs (*, ?, ** — same rules as .dashboardignore). maand build rejects restart_globs unless restart_policy is reload.
Maand stores a per-file manifest on each allocation (hash.current_files / hash.previous_files): path → content MD5 of the last staged and promoted trees. On upgrade it diffs those maps:
- If any changed path matches a glob →
make restart - Otherwise →
make reload
| Changed files | restart_globs |
Result |
|---|---|---|
rules/alerts.yaml |
["prometheus.yml", "Makefile"] |
reload |
Makefile |
same | restart |
| Version bump only (no file diff) | any | reload |
| No promoted file manifest yet | any | reload (conservative default) |
Example — reload for most edits, restart when compose or binaries change:
{
"restart_policy": "reload",
"restart_globs": [
"docker-compose.yml",
"docker-compose.yml.tpl",
"Dockerfile",
"bin/**"
]
}
Dry-run shows which paths triggered restart:
10.0.0.3 restart previous_hash=... current_hash=... matched=Makefile,bin/app
10.0.0.4 reload previous_hash=... current_hash=...
--sync-only (CLI)
One-deploy override: rsync, post_deploy, and promote without start, restart, or reload. Same effect as restart_policy: never for updated allocations, but chosen on the command line.
| Case | Behavior |
|---|---|
| Updated allocation | Rsync + promote; no lifecycle |
| New allocation | Error — cannot bootstrap without start |
job_control job |
Skips custom lifecycle; rsync + promote only |
--dry-run |
Reports sync; errors if start would be required |
Use when the process reads files directly, or when you will run maand job run <job> --target reload yourself afterward.
maand build
maand deploy --dry-run --sync-only --jobs api
maand deploy --sync-only --jobs api
--force --sync-only still skips lifecycle even when force would otherwise roll allocations.
job_control (custom lifecycle)
If the manifest registers job_control, default start / restart / reload are not used. Your script receives NEW_ALLOCATIONS, UPDATED_ALLOCATIONS, CURRENT_VERSION, and NEW_VERSION — implement canary, blue/green, or sync logic there. See hook-api.md.
restart_policy, restart_globs, and --sync-only do not apply on this path (except --sync-only still skips the script and only rsyncs + promotes).
Staging and rsync
Staging (tmp/workers/<worker_ip>/)
| Path | Content |
|---|---|
worker.json |
bucket_id, worker_id, labels, update_seq |
jobs.json |
Active jobs on this worker (removed jobs omitted) |
bin/runner.py, bin/worker.py |
Embedded helpers |
jobs/<job>/ |
Copy of job_files + rendered .tpl + certs/ |
Rsync
- From the bucket directory on the CLI host to
agent@<worker>:/opt/worker/<bucket_id>/(user frommaand.conf). - Staging rsync: filter includes only jobs in
jobsToStage(+ jobs/<job>/,- jobs/*). - Final rsync: one pass per successfully deployed job, only to workers with an active allocation for that job; same per-job filter so other jobs on the host are not touched.
Templates (.tpl)
Files under jobs/<job>/ ending in .tpl are rendered at staging time. Full reference: templates.md.
Prometheus job staging
When the staged job ships prometheus.yml or prometheus.yml.tpl, maand assembles monitoring artifacts before rsync (from job_files, not from maand/prometheus KV except scrape expansion):
Output (under jobs/prometheus/ on worker) |
Source |
|---|---|
rules/<maand_job>/*.yaml |
Each job's _prometheus/alerts/ (+ runbook URL injection) |
rules/maand/certs.yaml |
Embedded cert alert rules when server config exists |
consoles/runbooks/<job>/<slug>.html |
_prometheus/runbooks/*.md → HTML + index + CSS |
consoles/dashboards/<job>/<path> |
_prometheus/dashboards/** copied as-is (+ index, CSS) |
prometheus.yml (rendered) |
Template with {{ scrapeConfigs }} / {{ ruleFiles }} |
{{ scrapeConfigs }} reads scrape KV (maand/prometheus/scrape*), expands maand:port/* using active allocations, and skips jobs that would expand to zero targets (does not fail the whole render).
After deploy commits, maand best-effort pushes cert expiry metrics via Prometheus remote write (see certs.md).
Details: prometheus.md.
Removed and disabled allocations
| State | Behavior |
|---|---|
removed=1 (worker/job dropped at build) |
If previously deployed: stop, then remove deployed job files on the worker (data/ and logs/ are left in place for redeploy). Local staging under tmp/workers/<ip>/jobs/<job>/ is removed. Workers removed from workers.json: after all their removed allocations are processed, rm -rf /opt/worker/<bucket_id>/. Unreachable removed workers are assumed dead (logged, deploy continues). |
disabled=1 |
Excluded from start/restart/reload/rsync targets; stop if was running; keep deployed job files, KV, and hash/version state. Content and version changes are still staged, hashed, and promoted on deploy (rollout shows disabled or disabled_restart in maand cat deployments). After re-enable (maand build clears disabled.json), deploy starts the allocation via GetNewAllocations. |
Redeploying the same job on the same worker reuses existing data/ and logs/ (rsync excludes those paths). deploy deletes the allocation hash row when reconciling removed=1 allocations (even if the job is skipped from rollout), so a later redeploy treats it as a new rollout (make start) while worker data/ and logs/ remain. After build only, hashes still show the last promoted state until deploy runs.
When reconcile finishes and a job has no non-removed allocations (every row removed=1), deploy purges all job-scoped KV namespaces. Jobs with disabled-only allocations retain KV. maand build also clears build-owned namespaces when the job is inactive. Run maand gc to delete worker jobs/<job>/ trees and purge removed allocation rows from the catalog.
pre_deploy and post_deploy
- Registered in manifest with
executed_on:pre_deploy/post_deploy. - Run via
hookson the CLI host (Python or Bun). pre_deployfailure: job is not added tojobsToStagefor this deploy; other jobs continue.post_deployfailure: fails that job’s deploy; setspost_deploy_status = failedon all non-removed allocations; earlier jobs in the same run may already be promoted. Retry:maand deployruns pre-rolling health (if prior failed), then step 4 only.
KV checkpoint
Hooks can write to the in-memory KV store (e.g. connection strings for templates). After each job’s pre_deploy and after deployJob (including post_deploy), maand flushes pending KV changes into the deploy transaction via kv.PersistToSessionTransaction.
Those writes commit when deploy commits (including partial deploy — successful jobs persist even if later jobs fail). If deploy aborts before commit, KV checkpoint writes roll back with the rest of the catalog state.
Partial deploy and retry
- Job A succeeds → content hashes and
current_versionpromoted;post_deploy_status = successwhen hooks exist. - Job B fails mid-rollout (e.g. health on batch 2) → batch 1 allocations stay promoted; later batches not started;
post_deployskipped. - Job C fails
post_deploy→ all non-removed allocations markedpost_deploy_status = failed; content already promoted. - Transaction still commits; command returns
deploy failedwith errors. - Fix and run
maand deployagain:- Job A: skipped (deploy complete).
- Job B: resumes rollout from unpromoted allocations.
- Job C: post_deploy-only retry (step 4).
Use maand deploy --force to redeploy promoted jobs without a workspace change.
update_seq
When deploy stages at least one job (rsync to workers), bucket.update_seq increments once before the first staging wave (committed with the deploy transaction). Workers receive the new value in worker.json so they can detect bucket-wide changes.
Skip-only runs (deploy complete on all allocations) do not bump update_seq. maand deploy --dry-run never bumps it either.
Configuration
Uses maand.conf at the bucket root (SSH user, key file under secrets/, sudo for remote rsync). See configuration.md.
Worker key path: secrets/<ssh_key> relative to the bucket root.
Inspect state
maand cat allocations
maand cat deployments
maand cat jobs
maand info
Hash state lives in table hash with namespace <job>_allocation and key alloc_id. maand cat deployments shows current_hash, previous_hash, versions, post_deploy_status, and rollout (removed, disabled, or hash-derived new / restart / promoted / health_failed). Use --active to see only allocations deploy would target.
Common failures
| Symptom | Likely cause |
|---|---|
| SSH / rsync errors | Wrong key, firewall, worker down, sudo needed (use_sudo=true) |
deploy: skip job on first deploy |
Run build first; allocation should have no hash row until first successful promote |
deploy: skip job after workspace edit |
Run maand build then maand deploy; deploy refreshes plan hashes from staged content before the skip check |
| Job stuck after promote, before post_deploy | Check post_deploy_status in cat deployments; re-run maand deploy for step 4 |
| Job not staging | pre_deploy failed or JobNeedsRollout false |
| Template panic | KV key missing; namespace not allowed in .tpl |
health check failed |
health_check command returned non-zero |
| Stale files on worker | Deploy removes deployed job files on dealloc (keeps data//logs/); GC deletes runtime dirs; final rsync is per-job only |
Related
- build.md — catalog and sequences
- templates.md —
.tplrendering - KV namespaces · KV persistence
- hooks.md · hook-api.md
- disable and drain — disable/re-enable
- rolling-deploy —
max_concurrent_upgrades, version upgrades - debugging-deploy.md — dry-run,
cat deployments, failures - health-check.md — standalone health checks
- job.md — manual start/stop/restart/reload
- gc.md — purge removed allocations