Tutorial: Day-2 operations

After the guided tour or quickstart, use these patterns for everyday cluster operations. Assumes a working bucket with deployed jobs.


Inspect catalog state

Quick summary:

maand info

Detailed tables:

maand cat workers
maand cat jobs
maand cat allocations
maand cat hooks
maand cat ports
maand cat certs
maand cat kv

Filter allocations:

maand cat allocations --jobs api,worker
maand cat allocations --workers 10.0.0.1

Check TLS expiry (CA + job leaf certs). CA expired blocks maand build; expiring prints a stderr warning:

maand cat certs
maand cat certs --jobs api,postgres

See certs.md.

Read one KV key:

maand cat kv get maand/job/api job_name

Manual job control

maand job runs Makefile targets (or job_control commands) on workers. It verifies each worker’s worker.json matches the database — run maand deploy first if you see sync errors.

maand job restart api
maand job run api --target reload
maand job stop api --allocations 10.0.0.2
maand job start api --health_check
maand job run api --target migrate
maand job status api

Use maand job run --target reload after maand deploy --sync-only when you pushed config to disk but want the process to pick it up yourself.

vs deploy maand deploy maand job
When Catalog or job files changed Ops / one-off lifecycle
Sync check No Yes (update_seq)
Hash skip Yes No

See job.md.


Health checks

Each job may use manifest probes, a custom command, or both (probes run first):

Option A — manifest probes (tcp/http/ssh) in manifest.json — see health-check.md.

Option B — custom command:

"hooks": {
  "hook_health": {
    "executed_on": ["health_check"]
  }
}

Add workspace/jobs/api/_hooks/hook_health.py (see hook-api.md).

Run checks:

maand health_check
maand health_check --jobs api --wait --verbose

Redeploy after fixing health:

maand deploy --jobs api

Force a full redeploy without workspace changes:

maand deploy --force --jobs api

Deploy runs health checks automatically after restart when health is configured.

See health-check.md.


Prometheus monitoring

Add _prometheus/ under each job that exposes metrics (see prometheus.md):

workspace/jobs/api/_prometheus/
├── scrape.yaml              # optional
├── alerts/                  # optional
├── runbooks/                # optional
└── dashboards/              # optional

After adding or changing _prometheus/ content:

maand build
maand deploy --jobs api,...      # app jobs first
maand deploy --jobs prometheus   # assemble rules, runbooks, dashboards, scrape config

Console pages: runbooks at /consoles/runbooks/..., dashboards at /consoles/dashboards/... — see prometheus.md.


Ad-hoc commands on workers

maand collect facts probes host memory and CPU. Redirect with --generate-workers to update workspace/workers.json:

maand collect facts
maand collect facts --generate-workers > workspace/workers.json
maand build

See collect.md.

maand run_command runs shell on workers (not job workspaces):

maand run_command "uptime"
maand run_command "df -h /opt/worker" --workers 10.0.0.1,10.0.0.2
maand run_command "hostname" --labels worker --concurrency 4
maand run_command "systemctl status myservice" --health_check

Host needs bash and ssh; workers need bash and timeout.

See run-command.md.


Disable an allocation temporarily

See disable and drain for the full guide (per-allocation, per-job, per-worker, re-enable).

Create or edit workspace/disabled.json:

{
  "jobs": {
    "api": {
      "allocations": ["10.0.0.2"]
    }
  }
}

Disable every job on a worker:

{
  "workers": ["10.0.0.3"]
}

Disable an entire job everywhere:

{
  "jobs": {
    "api": {}
  }
}

Then:

maand build
maand deploy

Disabled allocations are skipped for start/restart/reload/rsync; deploy stops them if running and keeps artifacts and KV. Re-enable: clear disabled.json, maand build, maand deploy.


Remove a worker or job

  1. Remove the host from workers.json or delete workspace/jobs/<name>/
  2. maand build — marks related allocations removed = 1
  3. maand deploy — stops jobs, removes deployed job files (keeps data/ and logs/ on workers)
  4. maand gc — deletes worker data//logs//bin/, allocation rows, and KV references
# after editing workspace
maand build
maand deploy
maand gc
maand gc --retain-days 7   # keep deleted KV history longer

See gc.md.


Partial deploy and dry-run

Check whether deploy would change anything:

maand deploy --dry-run

Deploy only specific jobs (still ordered by deployment_seq):

maand deploy --jobs api,worker

If deploy fails partway, fix the issue and re-run — hash tracking resumes unchanged allocations.

Force redeploy when content is already promoted:

maand deploy --force --jobs api
maand deploy --dry-run --force    # preview

Push files without lifecycle (rsync + promote only; fails when any allocation still needs start):

maand deploy --sync-only --jobs api
maand deploy --dry-run --sync-only --jobs api    # preview sync actions

For ongoing config-only rollouts, prefer restart_policy: reload in the manifest — see Applying changes on workers.

See deploy.md.


Per-job config overrides

Optional workspace/bucket.jobs.conf:

[api]
memory = "512 mb"

If maand.conf sets job_config_selector = "prod", use bucket.jobs.prod.conf instead.

After editing:

maand build
maand deploy -b

Upgrade maand schema

When upgrading the maand binary:

maand init    # applies DB migrations, keeps bucket_id and CA
maand build
maand deploy

Rolling upgrades

See rolling-deploy for max_concurrent_upgrades, version-only deploys, and rolling worker reboots.


Troubleshooting checklist

See debugging-deploy.md for a full deploy troubleshooting guide.

Symptom Likely fix
worker.json / update_seq mismatch maand deploy
Host prerequisite error Install ssh/rsync/python3/bun on CLI host
Worker prerequisite error Install make/python3/rsync on worker; fix sudo
No allocations for job Check selectors vs worker labels; run maand build
Build resource error Add memory/cpu to workers or lower job limits
Port collision Remove duplicate port names; maand assigns unique numbers from the pool
ErrPortRangeExhausted Widen port_range in bucket.conf or remove unused jobs/ports
ErrInvalidJobVersion Add or fix version on jobs in the dependency graph
ErrHookDemandVersionMismatch Bump upstream job version or relax min_version/max_version
Upgrade script needs old/new release Read CURRENT_VERSION / NEW_VERSION in Makefile or hook env — deploy.md

Concept reference: concepts.md