Rolling upgrades and rolling worker reboots

Maand rolls out job changes through maand deploy (content hash and/or version). Worker host reboots and ad-hoc restarts use maand job and maand run_command. This guide covers both.

Rolling job upgrade (deploy)

Deploy is the primary way to roll out workspace changes. Maand compares content hashes and version targets per allocation, rsyncs when needed, then runs start, restart, reload, or nothing — according to manifest policy and optional CLI flags.

When an allocation is touched

Situation	What deploy does
First deploy on worker	`make start`
Content changed since last promote	Rsync, then lifecycle per `restart_policy`
Version bumped, files unchanged	Same lifecycle (often `reload` when policy is `reload`)
Already promoted	Skip that job
`--force`	Roll all active allocations (policy still applies to how)
`--sync-only`	Rsync + promote only; errors if `start` is required

See deploy.md for restart_policy, restart_globs, and --sync-only.

Choosing a lifecycle strategy

Job type	Typical manifest	Makefile
Stateful (Postgres, Kafka)	`"restart_policy": "always"`	`restart` recreates or restarts service
Monitoring (Prometheus)	`"restart_policy": "reload"`	`reload` calls `/-/reload`
App with compose + static config	`"restart_policy": "reload"` + `restart_globs` for compose/Dockerfile	`reload` for config; `restart` when globs match
Files consumed without process hook	`"restart_policy": "never"` or `--sync-only`	Process watches disk or you run `maand job run … --target reload`

Add a reload: target whenever policy is reload. Without it, deploy still invokes make reload and the target must exist.

Configure batch size

Two manifest fields control rollout batching:

Field	Phase	Default	Behavior
`max_concurrent_starts`	First deploy (start new allocations)	`0` (= all at once)	Start new allocations in batches of N; one health check after all batches
`max_concurrent_upgrades`	Upgrades (restart / reload changed allocations)	`1`	Lifecycle in batches of N; health check after each batch when a target runs

In workspace/jobs/<job>/manifest.json:

{
  "version": "2.0.0",
  "selectors": ["worker"],
  "max_concurrent_starts": 2,
  "max_concurrent_upgrades": 2
}

Both fields use rollout_order (KV key maand/job/<job>/rollout_order, synced from catalog on build) to pick worker order within each batch. Override in pre_deploy with put_rollout_order() — see hook-api.md. Deploy validates softly and falls back to catalog order if the list is stale.

Example: job api on workers 10.0.0.1 … 10.0.0.4 with max_concurrent_upgrades: 2 and restart_policy: always:

Batch 1: restart 10.0.0.1 and 10.0.0.2 (parallel)
         → health_check (wait) for the job
Batch 2: restart 10.0.0.3 and 10.0.0.4 (parallel)
         → health_check again

With restart_policy: reload, the same batches call make reload instead (or restart on workers where restart_globs matched).

Set max_concurrent_upgrades to the largest restart burst you can tolerate while keeping the service healthy (often 1 for stateful jobs, higher for stateless). Use max_concurrent_starts the same way for first deploy when bringing up a multi-node cluster (Vault, Cassandra, Postgres primaries/replicas).

After each batch start, restart, or reload, maand runs after_allocation_started hooks (if registered), promotes that batch in the catalog, then runs the health gate for that phase. Promote happens after hooks and before health; hook failure blocks promote; health failure does not unpromote earlier batches.

Version and Makefile env

On start, restart, and reload, the worker Makefile receives:

CURRENT_VERSION=<running, pre-promote>
NEW_VERSION=<target from build>

Use them for migration scripts:

restart:
	./bin/upgrade.sh "$(CURRENT_VERSION)" "$(NEW_VERSION)"
	$(MAKE) start

After a successful deploy wave, per-batch promote sets current_version = new_version during rollout; post_deploy (when registered) must succeed before post_deploy_status = success marks the job deploy complete.

Deployment sequence (`deployment_seq`)

Jobs with demands deploy in waves by deployment_seq (lower first). Within one sequence value, jobs are independent; each job applies its own max_concurrent_starts (starts) and max_concurrent_upgrades (restarts).

maand cat jobs                    # deployment_seq column
maand deploy --jobs api,worker    # still respects seq order

See deployment-sequence.md.

`job_control` custom rollout

If any manifest hook uses executed_on: ["job_control"], default Makefile lifecycle (start/restart/reload) is not used. Your script runs in batches of max_concurrent_upgrades; after each batch maand promotes those allocations and runs health_check (wait). Step 4 (post_deploy) runs after all command batches succeed. Script env:

NEW_ALLOCATIONS=10.0.0.1,10.0.0.2
UPDATED_ALLOCATIONS=10.0.0.3
BATCH_ALLOCATIONS=10.0.0.1,10.0.0.2
BATCH_INDEX=0
BATCH_COUNT=2
CURRENT_VERSION=...
NEW_VERSION=...

Implement canary or blue/green logic inside the command. See hook-api.md.

Recommended upgrade flow

# 1. Change workspace (manifest version, files, templates)
vim workspace/jobs/api/manifest.json

# 2. Plan
maand build
maand deploy --dry-run

# 3. Roll out
maand deploy
# or: maand deploy --jobs api

# 4. Verify
maand cat deployments --jobs api
maand health_check --jobs api --wait

Version-only upgrade (no file change)

Bump version in manifest.json only:

maand build
maand deploy --dry-run    # should show restart
maand deploy

Force redeploy (same tree)

After operator-initiated reroll:

maand deploy --force --jobs api

Rolling job restart (without deploy)

Use when the catalog is already promoted but processes need a bounce (config reload, memory leak, etc.).

All allocations

maand job restart api
maand health_check --jobs api --wait    # optional

One worker at a time

for ip in 10.0.0.1 10.0.0.2 10.0.0.3; do
  maand job restart api --allocations "$ip"
  maand health_check --jobs api --wait
done

Custom Makefile target

maand job run api --target reload

Requires maand deploy to have run at least once (worker.json / update_seq in sync).

Rolling worker reboot

Host OS reboot patterns (disable → reboot → re-enable): worker-reboot.md.

Health checks during rolling work

Step	Command
After each deploy batch	Automatic when job defines manifest probes or `health_check` commands
After manual restart	`maand health_check --jobs api --wait`
After worker reboot	`maand health_check --wait` or per-job `--jobs`
After `run_command` batch	`maand run_command ... --health_check`

Inspect rolling state

maand deploy --dry-run
maand cat deployments --jobs api
maand cat allocations --jobs api

During deploy, different workers may briefly show different current_version values until each batch promotes.

deploy.md — full deploy pipeline and version tracking
disable and drain — drain workers or jobs
job.md — manual start/stop/restart/reload
run-command.md — SSH batches and --concurrency
health-check.md — probes and wait/retry
debugging-deploy.md — when rolling upgrade stalls or fails