Ops-dashboard/ops-agent/commands.yml.example
Madhura68 ab87c0fada feat(server-backup): restic dual-repo backup (NAS + B2) with dashboard UI
Adds a server-wide backup capability beyond the existing ops_dashboard
pg_dump flow:

- Daily systemd timer (03:30) runs pg_dumpall + Forgejo dump, then restic
  to a local NAS repo and an offsite Backblaze B2 repo with Object Lock.
  Phase-based script with single-instance flock, structured statusfile,
  systemd hardening, and live-datadir excludes (Postgres / Forgejo) so
  the dumps stay authoritative.
- Ops-agent gets nine new read-only/trigger commands (snapshots, stats,
  status, logs, plus two triggers) backed by sudoers-whitelisted wrapper
  scripts that source /etc/restic-backup.env so the agent never sees the
  restic password or B2 keys.
- Two new flows (server_backup_full, server_backup_restore_test) drive
  the dashboard's "Backup now" and "Restore test" buttons.
- /settings/backups gains a Server backup section with overall + per-phase
  status, NAS / B2 snapshot tables, restore-size / raw-data / dedup-ratio
  stats, and the last restore-test result. The existing pg_dump section
  is preserved unchanged.
- Runbook docs/runbooks/server-backup.md follows the tailscale-setup
  pattern (plan + addendum) and covers B2 Object Lock + scoped keys,
  Forgejo subplan with isolated restore-test stack, the off-server
  maintenance flow for B2 prune, and the integrity-check schedule.

Code-only change — installation on scrum4me-srv follows the runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:03:00 +02:00

300 lines
11 KiB
Text

# Whitelist of allowed commands for ops-agent.
# Copy to /etc/ops-agent/commands.yml on the host.
# Restart ops-agent after changes.
#
# Schema per command:
# cmd: required — command + static args as array (no shell, no interpolation)
# cwd: optional — working directory for the subprocess
# cwd_pattern: optional — working directory as a glob/pattern (resolved at runtime)
# args:
# allowed: optional — whitelist of argument values accepted from the caller
# If absent or empty, the command takes no extra arguments.
# description: optional — human-readable description
commands:
docker_ps:
cmd: ["docker", "ps", "--format", "table"]
description: "List running Docker containers"
git_status:
cmd: ["git", "status", "--short", "--branch"]
cwd_pattern: "/srv/"
description: "Git status with branch info (first arg = repo path, must start with /srv/)"
git_log_ahead:
cmd: ["git", "log", "@{upstream}..HEAD", "--oneline"]
cwd_pattern: "/srv/"
description: "Local commits not yet pushed (first arg = repo path)"
git_diff:
cmd: ["git", "diff", "HEAD"]
cwd_pattern: "/srv/"
description: "Uncommitted diff against HEAD (first arg = repo path)"
git_fetch:
cmd: ["git", "fetch", "--quiet"]
cwd_pattern: "/srv/"
description: "Fetch all remotes silently (first arg = repo path)"
systemctl_status:
cmd: ["systemctl", "status", "--no-pager", "-l"]
args:
allowed:
- scrum4me-web
- ops-agent
- caddy
- docker
- nginx
- postgresql
description: "Show systemctl status for an allowed service"
journalctl_recent:
cmd: ["journalctl", "--since", "1 hour ago", "-n", "100", "--no-pager", "-u"]
args:
allowed:
- scrum4me-web
- ops-agent
- caddy
- docker
- nginx
- postgresql
description: "Last 100 journal lines from the past hour for an allowed service"
caddy_show_config:
cmd: ["caddy", "fmt", "/etc/caddy/Caddyfile"]
description: "Print the formatted Caddy config"
caddy_list_certs:
cmd:
- sh
- -c
- "for f in /data/caddy/certificates/*/*.crt; do [ -f \"$f\" ] || continue; echo \"CERTFILE:$f\"; openssl x509 -noout -subject -issuer -dates -in \"$f\" 2>&1; echo \"CERTEND\"; done"
description: "List TLS cert info (subject, issuer, validity dates) from Caddy certificate store"
# ── Destructive / write commands ──────────────────────────────────────────
docker_compose_restart:
cmd: ["docker", "compose", "restart"]
cwd: "/srv/scrum4me/compose"
args:
allowed:
- scrum4me-web
- worker-idea
- ops-dashboard
- caddy
- postgres
description: "Restart a docker compose service (ops-agent user must be in the docker group)"
docker_compose_stop:
cmd: ["docker", "compose", "stop"]
cwd: "/srv/scrum4me/compose"
args:
allowed:
- scrum4me-web
- worker-idea
- ops-dashboard
- caddy
- postgres
description: "Stop a docker compose service"
docker_compose_build:
cmd: ["docker", "compose", "build"]
cwd: "/srv/scrum4me/compose"
args:
allowed:
- scrum4me-web
- worker-idea
- ops-dashboard
description: "Build a docker compose service image"
docker_compose_build_worker_fresh:
# De worker-idea Dockerfile clonet scrum4me-mcp van GitHub in een aparte
# laag. Een gewone docker compose build hergebruikt die laag zolang
# MCP_GIT_REF gelijk blijft (= altijd 'main'), dus nieuwe MCP-commits worden
# NIET opgepikt. MCP_CACHE_BUST met een verse timestamp invalideert de
# clone-laag. sh -c is nodig om $(date) te evalueren (geen shell-injectie:
# vaste string, geen externe input).
cmd:
- sh
- -c
- "docker compose build --build-arg MCP_CACHE_BUST=$(date +%s) worker-idea"
cwd: "/srv/scrum4me/compose"
description: "Rebuild worker-idea image, busting the scrum4me-mcp clone cache so the latest MCP code is pulled"
docker_compose_up:
cmd: ["docker", "compose", "up", "-d"]
cwd: "/srv/scrum4me/compose"
args:
allowed:
- scrum4me-web
- worker-idea
- ops-dashboard
description: "Start or recreate a docker compose service in detached mode"
docker_compose_up_recreate:
cmd: ["docker", "compose", "up", "-d", "--force-recreate"]
cwd: "/srv/scrum4me/compose"
args:
allowed:
- scrum4me-web
- worker-idea
- ops-dashboard
description: "Force-recreate a docker compose service (picks up a rebuilt image)"
git_pull:
cmd: ["git", "pull", "--ff-only"]
cwd_pattern: "/srv/"
preconditions:
- git_status_clean
description: "Fast-forward pull — refused when working tree is dirty"
systemctl_restart:
# Requires /etc/sudoers.d/ops-agent (see deploy/ops-agent/sudoers).
cmd: ["sudo", "/usr/bin/systemctl", "restart"]
args:
allowed:
- scrum4me-web
- ops-agent
- caddy
description: "Restart an allowed systemd service via sudo"
caddy_validate:
cmd: ["caddy", "validate", "--config", "/srv/scrum4me/caddy/Caddyfile"]
description: "Validate /srv/scrum4me/caddy/Caddyfile without reloading"
caddy_reload:
cmd: ["caddy", "reload", "--config", "/srv/scrum4me/caddy/Caddyfile"]
description: "Reload Caddy with /srv/scrum4me/caddy/Caddyfile"
caddy_write_config:
# Writes stdin to Caddyfile.new first; mv is atomic on the same filesystem.
# ops-agent user must own /srv/scrum4me/caddy/.
cmd:
- sh
- -c
- "cat > /srv/scrum4me/caddy/Caddyfile.new && mv /srv/scrum4me/caddy/Caddyfile.new /srv/scrum4me/caddy/Caddyfile"
stdin_from_body: true
description: "Atomically replace /srv/scrum4me/caddy/Caddyfile (write stdin to .new, then mv)"
# ── Smoke tests / health checks ───────────────────────────────────────────
curl_smoke_scrum4me_web:
cmd: ["curl", "-sf", "--max-time", "10", "https://scrum4me.com"]
description: "HTTP smoke test — fails (non-zero) if the site is unreachable or returns a non-2xx status"
docker_compose_ps_worker:
cmd: ["docker", "compose", "ps", "--filter", "status=running", "worker-idea"]
cwd: "/srv/scrum4me/compose"
description: "Verify worker-idea container is in the running state"
wait_for_health_worker:
cmd:
- sh
- -c
- "timeout 60 sh -c 'until grep -q \"pre-flight passed\" /var/log/agent/current 2>/dev/null; do sleep 3; done && echo \"pre-flight passed\"'"
description: "Wait up to 60s for MCP worker pre-flight check (/var/log/agent/current)"
# ── Scrum4Me web deployment steps ────────────────────────────────────────
npm_ci:
cmd: ["npm", "ci"]
cwd: "/srv/scrum4me/repos/Scrum4Me"
description: "Install production dependencies for Scrum4Me web (npm ci)"
prisma_migrate_deploy:
cmd: ["npx", "prisma", "migrate", "deploy"]
cwd: "/srv/scrum4me/repos/Scrum4Me"
description: "Apply pending Prisma migrations for Scrum4Me web"
npm_run_build:
cmd: ["npm", "run", "build"]
cwd: "/srv/scrum4me/repos/Scrum4Me"
description: "Build the Scrum4Me web application (next build)"
curl_smoke_scrum4me_thuis:
cmd:
- sh
- -c
- "code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 15 https://thuis.jp-visser.nl/api/products); echo \"HTTP $code\"; [ \"$code\" = \"200\" ] || [ \"$code\" = \"401\" ]"
description: "Smoke test: /api/products must return 200 or 401"
# ── Ops-dashboard database backup ────────────────────────────────────────
pg_dump_ops_db:
cmd:
- sh
- -c
- |
mkdir -p /srv/ops/backups
FNAME="/srv/ops/backups/ops_db_$(date +%Y%m%d_%H%M).dump"
docker exec postgres pg_dump -Fc ops_dashboard > "$FNAME"
echo "Backup written: $FNAME"
ls -lh "$FNAME"
description: "Dump ops_dashboard DB via docker exec postgres to /srv/ops/backups/"
list_ops_backups:
cmd:
- sh
- -c
- "find /srv/ops/backups -maxdepth 1 -name '*.dump' -printf '%f\\t%s\\n' 2>/dev/null | sort -r || true"
description: "List ops_dashboard backup files (filename TAB size_bytes, newest-first)"
cleanup_ops_backups:
cmd:
- find
- /srv/ops/backups
- -name
- "*.dump"
- -mtime
- "+30"
- -delete
- -print
description: "Delete ops_dashboard backup files older than 30 days"
# ── Server-wide backup (restic + NAS + B2) ────────────────────────────────
# All wrappers live under /srv/backups/scripts/wrappers/ and read
# /etc/restic-backup.env (mode 0600 root:root) which the ops-agent user
# cannot read directly — hence the sudo prefix. See deploy/ops-agent/sudoers
# for the corresponding NOPASSWD entries.
read_backup_status:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/read-status.sh"]
description: "Read /srv/backups/status/last-run.json + last-restore-test.json (JSON)"
restic_snapshots_nas:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/restic-snapshots.sh", "nas"]
description: "Restic snapshots from the NAS repo (JSON array, newest first)"
restic_snapshots_b2:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/restic-snapshots.sh", "b2"]
description: "Restic snapshots from the B2 repo (JSON array, newest first)"
restic_stats_nas:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/restic-stats.sh", "nas"]
description: "Restic stats for the NAS repo (restore-size + raw-data + dedup ratio)"
restic_stats_b2:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/restic-stats.sh", "b2"]
description: "Restic stats for the B2 repo (restore-size + raw-data + dedup ratio)"
list_backup_logs:
cmd:
- sh
- -c
- "ls -lt /srv/backups/logs/*.log 2>/dev/null | head -10 || echo 'no logs yet'"
description: "List the 10 most recent server-backup logs"
tail_backup_log_today:
cmd:
- sh
- -c
- "f=/srv/backups/logs/server-backup-$(date +%F).log; [ -f \"$f\" ] && tail -200 \"$f\" || echo 'no log for today'"
description: "Tail the last 200 lines of today's server-backup log"
trigger_server_backup:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/trigger-backup.sh"]
description: "Trigger server-backup.service ad-hoc (refuses if already running)"
trigger_restore_test:
cmd: ["sudo", "-n", "/srv/backups/scripts/wrappers/trigger-restore-test.sh", "nas"]
description: "Run restore-test.sh against the NAS repo (non-destructive, writes /tmp/restore-test/)"