feat(server-backup): restic dual-repo backup (NAS + B2) with dashboard UI

Adds a server-wide backup capability beyond the existing ops_dashboard pg_dump flow: - Daily systemd timer (03:30) runs pg_dumpall + Forgejo dump, then restic to a local NAS repo and an offsite Backblaze B2 repo with Object Lock. Phase-based script with single-instance flock, structured statusfile, systemd hardening, and live-datadir excludes (Postgres / Forgejo) so the dumps stay authoritative. - Ops-agent gets nine new read-only/trigger commands (snapshots, stats, status, logs, plus two triggers) backed by sudoers-whitelisted wrapper scripts that source /etc/restic-backup.env so the agent never sees the restic password or B2 keys. - Two new flows (server_backup_full, server_backup_restore_test) drive the dashboard's "Backup now" and "Restore test" buttons. - /settings/backups gains a Server backup section with overall + per-phase status, NAS / B2 snapshot tables, restore-size / raw-data / dedup-ratio stats, and the last restore-test result. The existing pg_dump section is preserved unchanged. - Runbook docs/runbooks/server-backup.md follows the tailscale-setup pattern (plan + addendum) and covers B2 Object Lock + scoped keys, Forgejo subplan with isolated restore-test stack, the off-server maintenance flow for B2 prune, and the integrity-check schedule. Code-only change — installation on scrum4me-srv follows the runbook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:03:00 +02:00 · 2026-05-15 13:03:00 +02:00 · ab87c0fada
commit ab87c0fada
parent 27cba872a8
23 changed files with 2625 additions and 170 deletions
--- a/deploy/server-backup/README.md
+++ b/deploy/server-backup/README.md
@ -0,0 +1,126 @@
+# Server backup — deploy artefacten
+
+Dagelijkse server-brede backup met restic naar **NAS** (lokaal) en **Backblaze B2** (offsite, Object Lock). Inclusief structured statusfile die de ops-dashboard kan lezen.
+
+De volledige beschrijving — voorwaarden, B2 keys, Object Lock, Forgejo-restore-test, integriteits-schedule — staat in [`docs/runbooks/server-backup.md`](../../docs/runbooks/server-backup.md).
+
+## Bestanden
+
+| Bestand | Doel | Plek op host |
+|---|---|---|
+| `server-backup.sh` | hoofd-script (phase-based, flock, statusfile) | `/srv/backups/scripts/server-backup.sh` |
+| `restore-test.sh` | restore latest snapshot + check critical files | `/srv/backups/scripts/restore-test.sh` |
+| `server-backup.service` | systemd oneshot | `/etc/systemd/system/server-backup.service` |
+| `server-backup.timer` | daily 03:30 + 10 min jitter | `/etc/systemd/system/server-backup.timer` |
+| `restic-backup.env.example` | env-template (repos, B2 keys, Forgejo) | kopiëren naar `/etc/restic-backup.env` |
+
+Bovendien aan te maken (niet in deze repo, omdat het secrets zijn):
+
+- `/etc/restic-backup.password` — alleen het restic-wachtwoord (mode `0400 root:root`).
+
+## Snelle installatie (zie runbook voor alle context)
+
+```bash
+# 1. Tools en directories
+sudo apt update && sudo apt install -y restic jq
+
+sudo mkdir -p /srv/backups/scripts /srv/backups/logs /srv/backups/status \
+              /var/backups/databases
+sudo chmod 0750 /srv/backups/logs /srv/backups/status
+
+# 2. Scripts plaatsen
+sudo cp deploy/server-backup/server-backup.sh /srv/backups/scripts/
+sudo cp deploy/server-backup/restore-test.sh  /srv/backups/scripts/
+sudo chmod 0750 /srv/backups/scripts/*.sh
+sudo chown root:root /srv/backups/scripts/*.sh
+
+# 3. Env + password
+sudo cp deploy/server-backup/restic-backup.env.example /etc/restic-backup.env
+sudo chmod 0600 /etc/restic-backup.env
+sudo chown root:root /etc/restic-backup.env
+# Genereer wachtwoord — bewaar dit OOK in je password manager.
+sudo sh -c 'openssl rand -hex 24 > /etc/restic-backup.password'
+sudo chmod 0400 /etc/restic-backup.password
+
+# 4. Vul /etc/restic-backup.env (RESTIC_REPO_NAS, RESTIC_REPO_B2,
+#    B2_ACCOUNT_ID, B2_ACCOUNT_KEY, FORGEJO_*). Zie runbook deel A+B.
+
+# 5. Repos initialiseren (zie runbook deel C voor Object Lock + key-capabilities)
+sudo -E bash -c 'set -a; . /etc/restic-backup.env; set +a; \
+  export RESTIC_PASSWORD_FILE=/etc/restic-backup.password; \
+  restic -r "$RESTIC_REPO_NAS" init && \
+  restic -r "$RESTIC_REPO_B2"  init'
+
+# 6. Systemd
+sudo cp deploy/server-backup/server-backup.service /etc/systemd/system/
+sudo cp deploy/server-backup/server-backup.timer  /etc/systemd/system/
+sudo systemctl daemon-reload
+sudo systemctl enable --now server-backup.timer
+systemctl list-timers | grep server-backup
+
+# 7. Eerste run handmatig (volgen via journalctl)
+sudo systemctl start server-backup.service
+journalctl -u server-backup.service -f
+```
+
+## Verifiëren
+
+```bash
+# Statusfile
+sudo jq . /srv/backups/status/last-run.json
+
+# Snapshots
+sudo -E bash -c 'set -a; . /etc/restic-backup.env; set +a; \
+  export RESTIC_PASSWORD_FILE=/etc/restic-backup.password; \
+  restic -r "$RESTIC_REPO_NAS" snapshots; \
+  restic -r "$RESTIC_REPO_B2"  snapshots'
+
+# Restore-test (NAS, niet-destructief — restored naar /tmp/restore-test)
+sudo /srv/backups/scripts/restore-test.sh nas
+sudo jq . /srv/backups/status/last-restore-test.json
+```
+
+## Statusfile-schema
+
+Het script schrijft `/srv/backups/status/last-run.json` na elke run (success of failure), atomisch via temp + `mv`. De ops-dashboard leest deze file via `read_backup_status` (zie `ops-agent/commands.yml.example`).
+
+```json
+{
+  "schema_version": 1,
+  "overall_status": "success | partial_failure | failed",
+  "started_at": "2026-05-15T03:30:00+02:00",
+  "completed_at": "2026-05-15T03:48:21+02:00",
+  "duration_seconds": 1101,
+  "host": "scrum4me-srv",
+  "phases": {
+    "postgres_dump":   { "status": "success",  "exit_code": 0, "...": "..." },
+    "forgejo_dump":    { "status": "skipped",  "exit_code": 99, "...": "..." },
+    "forgejo_db_dump": { "status": "skipped",  "exit_code": 99 },
+    "restic_nas":      { "status": "success",  "exit_code": 0, "snapshot_id": "abc123" },
+    "restic_b2":       { "status": "degraded", "exit_code": 3, "error": "1 file unreadable" },
+    "forget_nas":      { "status": "success",  "exit_code": 0 },
+    "check_nas":       { "status": "success",  "exit_code": 0 },
+    "check_b2":        { "status": "success",  "exit_code": 0 }
+  }
+}
+```
+
+Per phase `status`:
+
+| status | betekenis | telt mee als |
+|---|---|---|
+| `success` | exit 0 | success |
+| `skipped` | exit 99 — phase niet van toepassing (bv. Forgejo niet geïnstalleerd) | success |
+| `degraded` | exit 3 — restic snapshot is gemaakt maar bepaalde files waren onleesbaar | partial_failure |
+| `failed` | andere non-zero exit | partial_failure of failed (zie `overall_status`) |
+| `pending` | phase niet gerund (script aborted vóór deze phase) | partial_failure |
+
+`overall_status` regels:
+
+- **`failed`** als `postgres_dump` faalt (DB-dump is autoritatief), of als **beide** restic repos falen.
+- **`partial_failure`** bij enige `failed` of `degraded` phase die niet kritisch is (bv. één restic repo down, of forgejo_dump faalt terwijl postgres lukt).
+- **`success`** als geen enkele phase `failed` of `degraded` is.
+
+## Volgorde tov bestaande `ops-db-backup.timer`
+
+De bestaande `deploy/ops-agent/ops-db-backup.timer` draait om **02:00** en doet alleen `pg_dump ops_dashboard` naar `/srv/ops/backups/`. Deze nieuwe `server-backup.timer` draait om **03:30** en pickt die map mee in zijn restic-backup. Beide blijven naast elkaar bestaan.
--- a/deploy/server-backup/restic-backup.env.example
+++ b/deploy/server-backup/restic-backup.env.example
@ -0,0 +1,44 @@
+# Copy to /etc/restic-backup.env on the host. Permissions: 0600 root:root.
+# RESTIC_PASSWORD lives in /etc/restic-backup.password (mode 0400 root:root)
+# — the backup script sets RESTIC_PASSWORD_FILE from there, so the password
+# never appears in the process listing or this env file.
+
+# ── Restic repositories ────────────────────────────────────────────────────
+# Local NAS path (must be mounted before the timer fires; see runbook).
+RESTIC_REPO_NAS=/mnt/backup-server/restic/scrum4me-srv
+
+# Backblaze B2 repo, format: b2:<bucket-name>:<prefix>
+# Bucket must have Object Lock (Governance) with default retention >= 30 days.
+RESTIC_REPO_B2=b2:scrum4me-srv-backup:scrum4me-srv
+
+# ── Backblaze B2 server key ────────────────────────────────────────────────
+# Capabilities REQUIRED: listBuckets, listFiles, readFiles, writeFiles
+# Capabilities FORBIDDEN: deleteFiles, deleteKeys, bypassGovernance
+# Create with:
+#   b2 application-key create \
+#     --bucket scrum4me-srv-backup \
+#     --name-prefix scrum4me-srv \
+#     server-backup-key \
+#     listBuckets,listFiles,readFiles,writeFiles
+B2_ACCOUNT_ID=REPLACE_WITH_B2_KEY_ID
+B2_ACCOUNT_KEY=REPLACE_WITH_B2_APPLICATION_KEY
+
+# ── Forgejo backup target (optional — set to skip if Forgejo not deployed) ─
+# Container name as it appears in `docker ps`. Set to "" or comment out to
+# skip the Forgejo phases entirely.
+FORGEJO_CONTAINER=forgejo
+# Path to app.ini INSIDE the Forgejo container (used by `forgejo dump -c`).
+FORGEJO_CONFIG=/data/gitea/conf/app.ini
+# Postgres database name for Forgejo (empty = use SQLite, skip forgejo_db_dump).
+FORGEJO_DB_NAME=forgejo
+# Postgres container + role for Forgejo's DB (defaults match scrum4me stack).
+FORGEJO_DB_CONTAINER=scrum4me-postgres
+FORGEJO_DB_USER=scrum4me
+
+# ── Scrum4Me Postgres (required for postgres_dump phase) ───────────────────
+PG_CONTAINER=scrum4me-postgres
+PG_DUMPALL_USER=scrum4me
+
+# ── Optional bandwidth limit for restic B2 upload (KiB/s; 0 = unlimited) ──
+# Translated by the script into `restic --limit-upload "$BACKUP_LIMIT_UPLOAD_KIB"`.
+# BACKUP_LIMIT_UPLOAD_KIB=5000
--- a/deploy/server-backup/restore-test.sh
+++ b/deploy/server-backup/restore-test.sh
@ -0,0 +1,177 @@
+#!/usr/bin/env bash
+# Restore the latest restic snapshot to /tmp/restore-test/ and assert that a
+# small set of critical files came back intact. Used by the monthly maintenance
+# check and by the dashboard's "Restore test" button.
+#
+# Usage:
+#   server-backup-restore-test.sh [nas|b2]
+#
+# Default repo is "nas" (faster, no B2 download fees).
+
+umask 077
+set -uo pipefail
+
+REPO_LABEL="${1:-nas}"
+RESTORE_DIR="${RESTORE_DIR:-/tmp/restore-test}"
+RESTIC_PASSWORD_FILE_PATH="${RESTIC_PASSWORD_FILE_PATH:-/etc/restic-backup.password}"
+STATUS_FILE="${STATUS_FILE:-/srv/backups/status/last-restore-test.json}"
+STATUS_DIR="$(dirname "$STATUS_FILE")"
+STARTED_AT="$(date -Is)"
+SECONDS=0
+
+# Load env (idempotent: ok if already in environment).
+if [ -z "${RESTIC_REPO_NAS:-}" ] && [ -r /etc/restic-backup.env ]; then
+  # shellcheck disable=SC1091
+  set -a; . /etc/restic-backup.env; set +a
+fi
+
+case "$REPO_LABEL" in
+  nas) REPO="${RESTIC_REPO_NAS:?RESTIC_REPO_NAS not set}" ;;
+  b2)  REPO="${RESTIC_REPO_B2:?RESTIC_REPO_B2 not set}" ;;
+  *)   echo "ERROR: repo label must be 'nas' or 'b2', got '$REPO_LABEL'" >&2; exit 2 ;;
+esac
+
+if [ ! -r "$RESTIC_PASSWORD_FILE_PATH" ]; then
+  echo "ERROR: restic password file $RESTIC_PASSWORD_FILE_PATH not readable" >&2
+  exit 1
+fi
+export RESTIC_PASSWORD_FILE="$RESTIC_PASSWORD_FILE_PATH"
+
+for tool in jq restic; do
+  command -v "$tool" >/dev/null 2>&1 || { echo "ERROR: '$tool' not on PATH" >&2; exit 1; }
+done
+
+mkdir -p "$STATUS_DIR"
+chmod 0750 "$STATUS_DIR"
+
+echo "════════════════════════════════════════════════════════════════"
+echo " Restore test — started $STARTED_AT"
+echo " Repo: $REPO_LABEL ($REPO)"
+echo " Target: $RESTORE_DIR"
+echo "════════════════════════════════════════════════════════════════"
+
+# Clean previous attempt to keep results unambiguous.
+rm -rf "$RESTORE_DIR"
+mkdir -p "$RESTORE_DIR"
+
+# Find latest snapshot id.
+SNAPSHOT_ID=$(restic -r "$REPO" snapshots --json --latest 1 2>/dev/null \
+  | jq -r '.[0].short_id // .[0].id // empty')
+
+if [ -z "$SNAPSHOT_ID" ]; then
+  echo "ERROR: no snapshots found in $REPO_LABEL repo"
+  jq -n \
+    --arg started "$STARTED_AT" \
+    --arg completed "$(date -Is)" \
+    --argjson duration "$SECONDS" \
+    --arg repo "$REPO_LABEL" \
+    '{
+      schema_version: 1,
+      overall_status: "failed",
+      started_at: $started,
+      completed_at: $completed,
+      duration_seconds: $duration,
+      repo: $repo,
+      snapshot_id: null,
+      error: "no snapshots in repo",
+      assertions: []
+    }' > "$STATUS_FILE"
+  chmod 0644 "$STATUS_FILE"
+  exit 1
+fi
+
+echo "Restoring snapshot $SNAPSHOT_ID …"
+RESTORE_RC=0
+restic -r "$REPO" restore "$SNAPSHOT_ID" --target "$RESTORE_DIR" || RESTORE_RC=$?
+
+if [ "$RESTORE_RC" -ne 0 ]; then
+  echo "ERROR: restic restore exited $RESTORE_RC"
+fi
+
+# Assertions: each is a path that MUST exist and be non-empty.
+# Adjust to your stack after first run (and update the runbook addendum).
+ASSERTION_PATHS=(
+  "$RESTORE_DIR/srv/scrum4me/compose/docker-compose.yml"
+  "$RESTORE_DIR/srv/scrum4me/caddy/Caddyfile"
+  "$RESTORE_DIR/etc/restic-backup.env"
+)
+
+# Latest postgres dump — match the newest file (glob may resolve to zero).
+shopt -s nullglob
+PG_DUMPS=("$RESTORE_DIR/var/backups/databases/"postgres-*.sql.gz)
+shopt -u nullglob
+if [ "${#PG_DUMPS[@]}" -gt 0 ]; then
+  # pick lexicographic last (= newest date, ISO format)
+  LATEST_PG="${PG_DUMPS[-1]}"
+  ASSERTION_PATHS+=("$LATEST_PG")
+fi
+
+ASSERTIONS_JSON='[]'
+ANY_FAILED=0
+for p in "${ASSERTION_PATHS[@]}"; do
+  if [ -s "$p" ]; then
+    status="ok"
+    bytes=$(stat -c %s "$p")
+    echo "  ✓ $p ($bytes bytes)"
+  elif [ -e "$p" ]; then
+    status="empty"
+    bytes=0
+    ANY_FAILED=1
+    echo "  ✗ $p (exists but empty)"
+  else
+    status="missing"
+    bytes=0
+    ANY_FAILED=1
+    echo "  ✗ $p (missing)"
+  fi
+  ASSERTIONS_JSON=$(jq -c \
+    --arg path "$p" \
+    --arg status "$status" \
+    --argjson bytes "$bytes" \
+    '. + [{path: $path, status: $status, bytes: $bytes}]' \
+    <<< "$ASSERTIONS_JSON")
+done
+
+if [ "$RESTORE_RC" -ne 0 ]; then
+  OVERALL="failed"
+elif [ "$ANY_FAILED" -ne 0 ]; then
+  OVERALL="partial_failure"
+else
+  OVERALL="success"
+fi
+
+jq -n \
+  --arg started "$STARTED_AT" \
+  --arg completed "$(date -Is)" \
+  --argjson duration "$SECONDS" \
+  --arg repo "$REPO_LABEL" \
+  --arg snapshot "$SNAPSHOT_ID" \
+  --arg overall "$OVERALL" \
+  --argjson restore_exit "$RESTORE_RC" \
+  --argjson assertions "$ASSERTIONS_JSON" \
+  '{
+    schema_version: 1,
+    overall_status: $overall,
+    started_at: $started,
+    completed_at: $completed,
+    duration_seconds: $duration,
+    repo: $repo,
+    snapshot_id: $snapshot,
+    restore_exit_code: $restore_exit,
+    target: "'"$RESTORE_DIR"'",
+    assertions: $assertions
+  }' > "$STATUS_FILE"
+chmod 0644 "$STATUS_FILE"
+
+echo ""
+echo "════════════════════════════════════════════════════════════════"
+echo " Restore test — finished $(date -Is)"
+echo " Overall: $OVERALL"
+echo " Status file: $STATUS_FILE"
+echo "════════════════════════════════════════════════════════════════"
+
+case "$OVERALL" in
+  success) exit 0 ;;
+  partial_failure) exit 75 ;;
+  failed|*) exit 1 ;;
+esac
--- a/deploy/server-backup/server-backup.service
+++ b/deploy/server-backup/server-backup.service
@ -0,0 +1,33 @@
+[Unit]
+Description=Server-wide backup (pg_dumpall + restic to NAS + B2)
+Documentation=file:///srv/ops/repos/ops-dashboard/docs/runbooks/server-backup.md
+After=network-online.target docker.service
+Wants=network-online.target
+
+[Service]
+Type=oneshot
+EnvironmentFile=/etc/restic-backup.env
+ExecStart=/srv/backups/scripts/server-backup.sh
+TimeoutStartSec=4h
+RuntimeMaxSec=6h
+Nice=10
+IOSchedulingClass=best-effort
+IOSchedulingPriority=7
+# Sandboxing — backup needs root for /etc + docker exec, but limit the rest.
+ProtectSystem=strict
+ReadWritePaths=/var/backups /srv/backups /run /tmp
+ProtectHome=read-only
+NoNewPrivileges=yes
+PrivateTmp=yes
+ProtectKernelTunables=yes
+ProtectKernelModules=yes
+ProtectControlGroups=yes
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=server-backup
+
+# Exit code semantics from server-backup.sh:
+#   0  = success (all phases ok)
+#   75 = partial_failure (some non-critical phase failed/degraded)
+#   1  = failed (a critical dump phase failed or both restic repos failed)
+SuccessExitStatus=75
--- a/deploy/server-backup/server-backup.sh
+++ b/deploy/server-backup/server-backup.sh
@ -0,0 +1,497 @@
+#!/usr/bin/env bash
+# Daily server-wide backup: dumps databases, runs restic to NAS + B2,
+# writes a structured statusfile that the ops-dashboard can read.
+#
+# Install:
+#   cp deploy/server-backup/server-backup.sh /srv/backups/scripts/server-backup.sh
+#   chmod 0750 /srv/backups/scripts/server-backup.sh
+#   chown root:root /srv/backups/scripts/server-backup.sh
+#
+# Requires: bash, jq, flock, restic, docker, gzip. See runbook for setup.
+
+umask 077
+set -uo pipefail
+
+# ── Configuration ──────────────────────────────────────────────────────────
+STATUS_DIR="${STATUS_DIR:-/srv/backups/status}"
+LOG_DIR="${LOG_DIR:-/srv/backups/logs}"
+DB_DUMP_DIR="${DB_DUMP_DIR:-/var/backups/databases}"
+RESTIC_PASSWORD_FILE_PATH="${RESTIC_PASSWORD_FILE_PATH:-/etc/restic-backup.password}"
+LOCKFILE="${LOCKFILE:-/run/server-backup.lock}"
+RUN_DATE="$(date +%F)"
+STARTED_AT="$(date -Is)"
+SECONDS=0
+
+# Phase order — must match write_status_json + determine_exit_code expectations.
+PHASE_ORDER=(
+  postgres_dump
+  forgejo_dump
+  forgejo_db_dump
+  restic_nas
+  restic_b2
+  forget_nas
+  check_nas
+  check_b2
+)
+
+declare -A PHASE_STATUS PHASE_EXIT PHASE_START PHASE_END PHASE_ERR PHASE_EXTRA
+OVERALL_STATUS="unknown"
+
+# ── Single-instance lock ───────────────────────────────────────────────────
+exec 9>"$LOCKFILE" || { echo "ERROR: cannot open lockfile $LOCKFILE" >&2; exit 1; }
+if ! flock -n 9; then
+  echo "ERROR: another server-backup is already running (lock $LOCKFILE held)" >&2
+  exit 75
+fi
+
+# ── Env + secret loading ───────────────────────────────────────────────────
+# When invoked via systemd, EnvironmentFile=/etc/restic-backup.env has already
+# been loaded. When invoked manually for testing, source it ourselves.
+if [ -z "${RESTIC_REPO_NAS:-}" ] && [ -r /etc/restic-backup.env ]; then
+  # shellcheck disable=SC1091
+  set -a; . /etc/restic-backup.env; set +a
+fi
+
+: "${RESTIC_REPO_NAS:?RESTIC_REPO_NAS not set (see /etc/restic-backup.env)}"
+: "${RESTIC_REPO_B2:?RESTIC_REPO_B2 not set (see /etc/restic-backup.env)}"
+
+if [ ! -r "$RESTIC_PASSWORD_FILE_PATH" ]; then
+  echo "ERROR: restic password file $RESTIC_PASSWORD_FILE_PATH not readable" >&2
+  exit 1
+fi
+export RESTIC_PASSWORD_FILE="$RESTIC_PASSWORD_FILE_PATH"
+
+# Required tooling
+for tool in jq restic docker gzip flock; do
+  if ! command -v "$tool" >/dev/null 2>&1; then
+    echo "ERROR: required tool '$tool' not on PATH" >&2
+    exit 1
+  fi
+done
+
+# ── Logging ────────────────────────────────────────────────────────────────
+mkdir -p "$LOG_DIR" "$STATUS_DIR" "$DB_DUMP_DIR"
+chmod 0750 "$LOG_DIR" "$STATUS_DIR"
+LOG_FILE="$LOG_DIR/server-backup-$RUN_DATE.log"
+# Mirror everything to LOG_FILE and the journal.
+exec > >(tee -a "$LOG_FILE") 2>&1
+
+echo "════════════════════════════════════════════════════════════════"
+echo " Server backup — started $STARTED_AT"
+echo " Host: $(hostname)"
+echo " NAS repo: $RESTIC_REPO_NAS"
+echo " B2 repo:  $RESTIC_REPO_B2"
+echo "════════════════════════════════════════════════════════════════"
+
+# ── Phase runner ───────────────────────────────────────────────────────────
+# Runs the function passed as first arg, captures stdout+stderr into a phase
+# buffer, records status / exit_code / timestamps / error tail.
+run_phase() {
+  local name="$1"; shift
+  local phase_buf
+  phase_buf=$(mktemp -t "backup-phase-${name}.XXXXXX")
+
+  echo ""
+  echo "─── phase: $name ─── $(date -Is)"
+  PHASE_START[$name]=$(date -Is)
+
+  local rc=0
+  # Run in a sub-shell so set -e inside callees doesn't kill us.
+  (
+    "$@"
+  ) 2>&1 | tee "$phase_buf"
+  rc=${PIPESTATUS[0]}
+
+  PHASE_EXIT[$name]=$rc
+  case "$rc" in
+    0)  PHASE_STATUS[$name]=success ;;
+    3)  PHASE_STATUS[$name]=degraded ;;   # restic: snapshot created but some files unreadable
+    99) PHASE_STATUS[$name]=skipped ;;    # our convention for "not applicable"
+    *)  PHASE_STATUS[$name]=failed ;;
+  esac
+
+  if [ "$rc" -ne 0 ] && [ "$rc" -ne 99 ] && [ -s "$phase_buf" ]; then
+    # Keep last few non-empty lines as a compact error summary.
+    PHASE_ERR[$name]=$(tail -n 5 "$phase_buf" | tr '\n' ' ' | head -c 500)
+  fi
+
+  PHASE_END[$name]=$(date -Is)
+  rm -f "$phase_buf"
+  echo "─── end $name (exit=$rc, status=${PHASE_STATUS[$name]})"
+}
+
+# Convention: a phase function returns 99 to mark itself "skipped" — the
+# overall outcome treats this as success.
+SKIPPED=99
+
+# ── Phase 1: pg_dumpall (Scrum4Me Postgres cluster) ────────────────────────
+dump_postgres_all() {
+  local pg_container="${PG_CONTAINER:-scrum4me-postgres}"
+  local pg_user="${PG_DUMPALL_USER:-scrum4me}"
+
+  if ! docker ps --format '{{.Names}}' | grep -qx "$pg_container"; then
+    echo "Postgres container '$pg_container' not running — cannot continue."
+    return 1
+  fi
+
+  local tmp="$DB_DUMP_DIR/.postgres-$RUN_DATE.sql.gz.tmp"
+  local final="$DB_DUMP_DIR/postgres-$RUN_DATE.sql.gz"
+  rm -f "$tmp"
+
+  set -o pipefail
+  docker exec "$pg_container" pg_dumpall -U "$pg_user" --clean --if-exists \
+    | gzip -c > "$tmp"
+  local rc=$?
+  set +o pipefail
+
+  if [ "$rc" -ne 0 ]; then
+    rm -f "$tmp"
+    return "$rc"
+  fi
+
+  mv "$tmp" "$final"
+  chmod 0640 "$final"
+  local bytes
+  bytes=$(stat -c %s "$final" 2>/dev/null || echo 0)
+  PHASE_EXTRA[postgres_dump]="output_file=$final;bytes=$bytes"
+  echo "wrote $final ($bytes bytes)"
+}
+
+# ── Phase 2: Forgejo dump (filesystem + repos) ─────────────────────────────
+dump_forgejo() {
+  local fj="${FORGEJO_CONTAINER:-}"
+  if [ -z "$fj" ]; then
+    echo "FORGEJO_CONTAINER unset — skipping Forgejo dump."
+    return "$SKIPPED"
+  fi
+  if ! docker ps --format '{{.Names}}' | grep -qx "$fj"; then
+    echo "Forgejo container '$fj' not running — skipping."
+    return "$SKIPPED"
+  fi
+
+  local config="${FORGEJO_CONFIG:-/data/gitea/conf/app.ini}"
+  local tmp="$DB_DUMP_DIR/.forgejo-$RUN_DATE.zip.tmp"
+  local final="$DB_DUMP_DIR/forgejo-$RUN_DATE.zip"
+  rm -f "$tmp"
+
+  # `forgejo dump -f -` streams the zip to stdout. We run as the `git` user
+  # inside the container (standard Forgejo image convention).
+  set -o pipefail
+  docker exec -u git "$fj" forgejo dump --skip-db -c "$config" --type zip -f - > "$tmp"
+  local rc=$?
+  set +o pipefail
+
+  if [ "$rc" -ne 0 ]; then
+    rm -f "$tmp"
+    return "$rc"
+  fi
+
+  mv "$tmp" "$final"
+  chmod 0640 "$final"
+  local bytes
+  bytes=$(stat -c %s "$final" 2>/dev/null || echo 0)
+  PHASE_EXTRA[forgejo_dump]="output_file=$final;bytes=$bytes"
+  echo "wrote $final ($bytes bytes)"
+}
+
+# ── Phase 3: Forgejo Postgres DB dump (authoritative for DB restore) ───────
+dump_forgejo_db() {
+  local db_name="${FORGEJO_DB_NAME:-}"
+  if [ -z "$db_name" ]; then
+    echo "FORGEJO_DB_NAME unset — skipping Forgejo DB dump (assume SQLite)."
+    return "$SKIPPED"
+  fi
+  local db_container="${FORGEJO_DB_CONTAINER:-scrum4me-postgres}"
+  local db_user="${FORGEJO_DB_USER:-scrum4me}"
+
+  if ! docker ps --format '{{.Names}}' | grep -qx "$db_container"; then
+    echo "DB container '$db_container' not running — skipping Forgejo DB dump."
+    return "$SKIPPED"
+  fi
+
+  local tmp="$DB_DUMP_DIR/.forgejo-db-$RUN_DATE.sql.gz.tmp"
+  local final="$DB_DUMP_DIR/forgejo-db-$RUN_DATE.sql.gz"
+  rm -f "$tmp"
+
+  set -o pipefail
+  docker exec "$db_container" pg_dump -U "$db_user" --clean --if-exists "$db_name" \
+    | gzip -c > "$tmp"
+  local rc=$?
+  set +o pipefail
+
+  if [ "$rc" -ne 0 ]; then
+    rm -f "$tmp"
+    return "$rc"
+  fi
+
+  mv "$tmp" "$final"
+  chmod 0640 "$final"
+  local bytes
+  bytes=$(stat -c %s "$final" 2>/dev/null || echo 0)
+  PHASE_EXTRA[forgejo_db_dump]="output_file=$final;bytes=$bytes"
+  echo "wrote $final ($bytes bytes)"
+}
+
+# ── Phases 4 + 5: restic backup to NAS / B2 ────────────────────────────────
+# Live Docker datadirs are excluded — dumps (above) are the authoritative
+# restore source for Postgres and Forgejo.
+RESTIC_BACKUP_PATHS=(
+  /etc
+  /home/janpeter
+  /root
+  /opt
+  /srv
+  /usr/local/bin
+  "$DB_DUMP_DIR"
+  /srv/ops/backups
+)
+RESTIC_EXCLUDES=(
+  --exclude='**/node_modules'
+  --exclude='**/.next/cache'
+  --exclude='**/.cache'
+  --exclude='**/.git/objects/pack'
+  --exclude='/srv/backups/logs'
+  --exclude='/tmp'
+  --exclude='/var/tmp'
+  --exclude='/srv/scrum4me/postgres'       # live Postgres datadir — non-authoritative
+  --exclude='/srv/forgejo/data/git'        # live Forgejo git objects — non-authoritative
+  --exclude='/srv/forgejo/data/lfs'
+  --exclude='/srv/forgejo/data/queues'
+)
+
+restic_backup_to() {
+  local repo="$1"; local label="$2"
+  local extra_args=()
+  if [ "$label" = "b2" ] && [ -n "${BACKUP_LIMIT_UPLOAD_KIB:-}" ]; then
+    extra_args+=(--limit-upload "$BACKUP_LIMIT_UPLOAD_KIB")
+  fi
+
+  # Capture restic JSON output so we can extract the snapshot id.
+  local json_out
+  json_out=$(mktemp -t "restic-backup-${label}.XXXXXX.json")
+
+  # --no-scan keeps the lockfile interaction light; --skip-if-unchanged still
+  # records a snapshot per restic semantics so the dashboard sees a daily entry.
+  restic -r "$repo" backup \
+    --tag scheduled \
+    --tag "host=$(hostname)" \
+    --json \
+    "${extra_args[@]}" \
+    "${RESTIC_EXCLUDES[@]}" \
+    "${RESTIC_BACKUP_PATHS[@]}" \
+    | tee "$json_out"
+  local rc=${PIPESTATUS[0]}
+
+  # Extract snapshot id from the final summary line (last JSON object of type=summary).
+  local snap
+  snap=$(jq -rs 'map(select(.message_type=="summary")) | last | .snapshot_id // empty' < "$json_out" 2>/dev/null || true)
+  local files_new
+  files_new=$(jq -rs 'map(select(.message_type=="summary")) | last | .files_new // empty' < "$json_out" 2>/dev/null || true)
+  local data_added
+  data_added=$(jq -rs 'map(select(.message_type=="summary")) | last | .data_added // empty' < "$json_out" 2>/dev/null || true)
+
+  if [ -n "$snap" ]; then
+    PHASE_EXTRA["restic_$label"]="snapshot_id=$snap;files_new=${files_new:-0};data_added_bytes=${data_added:-0}"
+  fi
+
+  rm -f "$json_out"
+  return "$rc"
+}
+
+# ── Phase 6: prune NAS only (B2 is Object Lock — pruning runs off-server) ──
+restic_forget_nas() {
+  restic -r "$RESTIC_REPO_NAS" forget \
+    --keep-daily 7 \
+    --keep-weekly 4 \
+    --keep-monthly 12 \
+    --prune
+}
+
+# ── Phase 7: integrity check (light daily; weekly read-data-subset on Sun) ─
+is_sunday() {
+  [ "$(date +%u)" = "7" ]
+}
+
+restic_check_nas() {
+  if is_sunday; then
+    restic -r "$RESTIC_REPO_NAS" check --read-data-subset=2.5%
+  else
+    restic -r "$RESTIC_REPO_NAS" check
+  fi
+}
+
+restic_check_b2() {
+  if is_sunday; then
+    # On B2 a read-data-subset costs bandwidth + B2 download fees. Keep the
+    # subset tiny on Sundays; deeper checks run monthly off-server.
+    restic -r "$RESTIC_REPO_B2" check --read-data-subset=1%
+  else
+    restic -r "$RESTIC_REPO_B2" check
+  fi
+}
+
+# ── Statusfile writer ──────────────────────────────────────────────────────
+# Builds a structured JSON statusfile in /srv/backups/status/last-run.json
+# atomically (write to tmp, then mv).
+write_status_json() {
+  local tmpfile
+  tmpfile=$(mktemp -t "backup-status.XXXXXX.json")
+
+  # Build the phases object incrementally with jq for safe escaping.
+  local phases_json='{}'
+  local name status exit_code started ended err extra
+  local snapshot_id files_new data_added output_file bytes
+  for name in "${PHASE_ORDER[@]}"; do
+    status="${PHASE_STATUS[$name]:-pending}"
+    exit_code="${PHASE_EXIT[$name]:-}"
+    started="${PHASE_START[$name]:-}"
+    ended="${PHASE_END[$name]:-}"
+    err="${PHASE_ERR[$name]:-}"
+    extra="${PHASE_EXTRA[$name]:-}"
+
+    snapshot_id=""
+    files_new=""
+    data_added=""
+    output_file=""
+    bytes=""
+    if [ -n "$extra" ]; then
+      # extra is a semicolon-separated list of key=value pairs
+      local pair key val
+      IFS=';' read -ra pairs <<< "$extra"
+      for pair in "${pairs[@]}"; do
+        key="${pair%%=*}"
+        val="${pair#*=}"
+        case "$key" in
+          snapshot_id) snapshot_id="$val" ;;
+          files_new) files_new="$val" ;;
+          data_added_bytes) data_added="$val" ;;
+          output_file) output_file="$val" ;;
+          bytes) bytes="$val" ;;
+        esac
+      done
+    fi
+
+    # exit_code as JSON number when present, null otherwise.
+    local exit_arg='null'
+    if [ -n "$exit_code" ]; then
+      exit_arg="$exit_code"
+    fi
+
+    phases_json=$(
+      jq -c -n \
+        --argjson base "$phases_json" \
+        --arg name "$name" \
+        --arg status "$status" \
+        --argjson exit_code "$exit_arg" \
+        --arg started "$started" \
+        --arg ended "$ended" \
+        --arg err "$err" \
+        --arg snapshot_id "$snapshot_id" \
+        --arg files_new "$files_new" \
+        --arg data_added "$data_added" \
+        --arg output_file "$output_file" \
+        --arg bytes "$bytes" \
+        '
+        $base + {
+          ($name): ({
+            status: $status,
+            exit_code: $exit_code,
+            started_at: (if $started == "" then null else $started end),
+            completed_at: (if $ended == "" then null else $ended end),
+            error: (if $err == "" then null else $err end)
+          }
+          + (if $snapshot_id != "" then { snapshot_id: $snapshot_id } else {} end)
+          + (if $files_new != ""   then { files_new: ($files_new | tonumber? // null) } else {} end)
+          + (if $data_added != ""  then { data_added_bytes: ($data_added | tonumber? // null) } else {} end)
+          + (if $output_file != "" then { output_file: $output_file } else {} end)
+          + (if $bytes != ""       then { bytes: ($bytes | tonumber? // null) } else {} end))
+        }'
+    )
+  done
+
+  jq -n \
+    --arg overall "$OVERALL_STATUS" \
+    --arg started "$STARTED_AT" \
+    --arg completed "$(date -Is)" \
+    --argjson duration "$SECONDS" \
+    --arg host "$(hostname)" \
+    --argjson phases "$phases_json" \
+    '{
+      schema_version: 1,
+      overall_status: $overall,
+      started_at: $started,
+      completed_at: $completed,
+      duration_seconds: $duration,
+      host: $host,
+      phases: $phases
+    }' > "$tmpfile"
+
+  mv "$tmpfile" "$STATUS_DIR/last-run.json"
+  chmod 0644 "$STATUS_DIR/last-run.json"
+}
+
+# ── Outcome aggregation ────────────────────────────────────────────────────
+# success         → exit 0
+# partial_failure → exit 75 (visible but distinguishable from hard failure)
+# failed          → exit 1
+determine_exit_code() {
+  local critical_failure=false
+  local has_failure=false
+  local has_degraded=false
+  local name status
+
+  for name in "${PHASE_ORDER[@]}"; do
+    status="${PHASE_STATUS[$name]:-pending}"
+    case "$status" in
+      success|skipped) ;;
+      degraded) has_degraded=true ;;
+      failed)
+        has_failure=true
+        case "$name" in
+          postgres_dump) critical_failure=true ;;   # losing the DB dump is catastrophic
+        esac
+        ;;
+    esac
+  done
+
+  # Losing BOTH restic repos is also catastrophic.
+  if [ "${PHASE_STATUS[restic_nas]:-}" = "failed" ] \
+     && [ "${PHASE_STATUS[restic_b2]:-}" = "failed" ]; then
+    critical_failure=true
+  fi
+
+  if [ "$critical_failure" = true ]; then
+    OVERALL_STATUS="failed"
+    echo 1
+  elif [ "$has_failure" = true ] || [ "$has_degraded" = true ]; then
+    OVERALL_STATUS="partial_failure"
+    echo 75
+  else
+    OVERALL_STATUS="success"
+    echo 0
+  fi
+}
+
+# ── Main sequence ──────────────────────────────────────────────────────────
+run_phase postgres_dump   dump_postgres_all
+run_phase forgejo_dump    dump_forgejo
+run_phase forgejo_db_dump dump_forgejo_db
+run_phase restic_nas      restic_backup_to "$RESTIC_REPO_NAS" nas
+run_phase restic_b2       restic_backup_to "$RESTIC_REPO_B2"  b2
+run_phase forget_nas      restic_forget_nas
+run_phase check_nas       restic_check_nas
+run_phase check_b2        restic_check_b2
+
+EXIT_CODE=$(determine_exit_code)
+write_status_json
+
+echo ""
+echo "════════════════════════════════════════════════════════════════"
+echo " Server backup — finished $(date -Is)"
+echo " Overall status: $OVERALL_STATUS (exit $EXIT_CODE)"
+echo " Duration: ${SECONDS}s"
+echo " Status file: $STATUS_DIR/last-run.json"
+echo " Log file:    $LOG_FILE"
+echo "════════════════════════════════════════════════════════════════"
+
+exit "$EXIT_CODE"
--- a/deploy/server-backup/server-backup.timer
+++ b/deploy/server-backup/server-backup.timer
@ -0,0 +1,12 @@
+[Unit]
+Description=Daily server-wide backup (timer)
+
+[Timer]
+# Daily at 03:30 local. After ops-db-backup.timer (02:00) so the ops_dashboard
+# pg_dump from /srv/ops/backups/ is fresh when restic picks it up.
+OnCalendar=*-*-* 03:30:00
+Persistent=true
+RandomizedDelaySec=600
+
+[Install]
+WantedBy=timers.target
--- a/deploy/server-backup/wrappers/read-status.sh
+++ b/deploy/server-backup/wrappers/read-status.sh
@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+# Read /srv/backups/status/last-run.json. Returns "{}" if missing, so the
+# dashboard can render an "unknown" state instead of erroring.
+
+set -uo pipefail
+
+STATUS_FILE="${STATUS_FILE:-/srv/backups/status/last-run.json}"
+RESTORE_STATUS_FILE="${RESTORE_STATUS_FILE:-/srv/backups/status/last-restore-test.json}"
+
+# We emit a small wrapper object with both files so the UI can render the
+# server-backup status AND the most recent restore-test status from one call.
+last_run='{}'
+if [ -r "$STATUS_FILE" ]; then
+  last_run=$(cat "$STATUS_FILE")
+fi
+
+last_restore='null'
+if [ -r "$RESTORE_STATUS_FILE" ]; then
+  last_restore=$(cat "$RESTORE_STATUS_FILE")
+fi
+
+jq -n \
+  --argjson last_run "$last_run" \
+  --argjson last_restore "$last_restore" \
+  '{ last_run: $last_run, last_restore_test: $last_restore }'
--- a/deploy/server-backup/wrappers/restic-check.sh
+++ b/deploy/server-backup/wrappers/restic-check.sh
@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+# Run a light restic integrity check on the given repo.
+# Usage: restic-check.sh nas|b2
+
+set -uo pipefail
+
+LABEL="${1:-}"
+if [ "$LABEL" != "nas" ] && [ "$LABEL" != "b2" ]; then
+  echo "label must be nas or b2" >&2
+  exit 2
+fi
+
+if [ -z "${RESTIC_REPO_NAS:-}" ] && [ -r /etc/restic-backup.env ]; then
+  set -a; . /etc/restic-backup.env; set +a
+fi
+
+case "$LABEL" in
+  nas) REPO="${RESTIC_REPO_NAS:?RESTIC_REPO_NAS not set}" ;;
+  b2)  REPO="${RESTIC_REPO_B2:?RESTIC_REPO_B2 not set}" ;;
+esac
+
+export RESTIC_PASSWORD_FILE="${RESTIC_PASSWORD_FILE:-/etc/restic-backup.password}"
+
+restic -r "$REPO" check
--- a/deploy/server-backup/wrappers/restic-snapshots.sh
+++ b/deploy/server-backup/wrappers/restic-snapshots.sh
@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+# List recent restic snapshots from a labelled repo. Output: JSON array.
+# Usage: restic-snapshots.sh nas|b2
+
+set -uo pipefail
+
+LABEL="${1:-}"
+if [ "$LABEL" != "nas" ] && [ "$LABEL" != "b2" ]; then
+  echo '{"error":"label must be nas or b2"}' >&2
+  exit 2
+fi
+
+# Load env (idempotent — systemd already loaded it for service contexts).
+if [ -z "${RESTIC_REPO_NAS:-}" ] && [ -r /etc/restic-backup.env ]; then
+  set -a; . /etc/restic-backup.env; set +a
+fi
+
+case "$LABEL" in
+  nas) REPO="${RESTIC_REPO_NAS:?RESTIC_REPO_NAS not set}" ;;
+  b2)  REPO="${RESTIC_REPO_B2:?RESTIC_REPO_B2 not set}" ;;
+esac
+
+export RESTIC_PASSWORD_FILE="${RESTIC_PASSWORD_FILE:-/etc/restic-backup.password}"
+
+# Show last 30 snapshots, newest first, with the fields the UI needs.
+restic -r "$REPO" snapshots --json 2>/dev/null \
+  | jq --arg repo "$LABEL" '
+      sort_by(.time) | reverse | .[0:30]
+      | map({
+          id: .id,
+          short_id: (.short_id // (.id[0:8])),
+          time: .time,
+          hostname: .hostname,
+          tags: (.tags // []),
+          paths: (.paths // []),
+          summary: (.summary // null),
+          repo: $repo
+        })
+    '
--- a/deploy/server-backup/wrappers/restic-stats.sh
+++ b/deploy/server-backup/wrappers/restic-stats.sh
@ -0,0 +1,51 @@
+#!/usr/bin/env bash
+# Repo stats: combines restic stats in two modes plus snapshot count.
+# Output: JSON object with restore_size_bytes, raw_data_bytes, dedup_ratio.
+# Usage: restic-stats.sh nas|b2
+
+set -uo pipefail
+
+LABEL="${1:-}"
+if [ "$LABEL" != "nas" ] && [ "$LABEL" != "b2" ]; then
+  echo '{"error":"label must be nas or b2"}' >&2
+  exit 2
+fi
+
+if [ -z "${RESTIC_REPO_NAS:-}" ] && [ -r /etc/restic-backup.env ]; then
+  set -a; . /etc/restic-backup.env; set +a
+fi
+
+case "$LABEL" in
+  nas) REPO="${RESTIC_REPO_NAS:?RESTIC_REPO_NAS not set}" ;;
+  b2)  REPO="${RESTIC_REPO_B2:?RESTIC_REPO_B2 not set}" ;;
+esac
+
+export RESTIC_PASSWORD_FILE="${RESTIC_PASSWORD_FILE:-/etc/restic-backup.password}"
+
+# restore-size: total bytes if every file in every snapshot were re-extracted.
+restore_json=$(restic -r "$REPO" stats --mode restore-size --json 2>/dev/null || echo '{}')
+# raw-data: total unique blob bytes after dedup + compression.
+raw_json=$(restic -r "$REPO" stats --mode raw-data --json 2>/dev/null || echo '{}')
+# Snapshot count for the same repo.
+snap_count=$(restic -r "$REPO" snapshots --json 2>/dev/null | jq 'length // 0')
+
+jq -n \
+  --arg repo "$LABEL" \
+  --argjson restore "$restore_json" \
+  --argjson raw "$raw_json" \
+  --argjson snap_count "${snap_count:-0}" \
+  '
+  {
+    repo: $repo,
+    snapshots_count: $snap_count,
+    restore_size_bytes: ($restore.total_size // null),
+    restore_size_files: ($restore.total_file_count // null),
+    raw_data_bytes: ($raw.total_size // null),
+    raw_blob_count: ($raw.total_blob_count // null),
+    dedup_ratio: (
+      if ($restore.total_size != null) and ($raw.total_size != null) and ($raw.total_size > 0)
+      then (($restore.total_size | tonumber) / ($raw.total_size | tonumber))
+      else null
+      end
+    )
+  }'
--- a/deploy/server-backup/wrappers/trigger-backup.sh
+++ b/deploy/server-backup/wrappers/trigger-backup.sh
@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+# Trigger server-backup.service ad-hoc. Refuses if a run is already active
+# (the script itself also flock's, but checking here gives a friendlier error).
+
+set -uo pipefail
+
+UNIT=server-backup.service
+
+active=$(systemctl is-active "$UNIT" 2>/dev/null || true)
+if [ "$active" = "active" ] || [ "$active" = "activating" ]; then
+  echo "ERROR: $UNIT is already $active — refusing to trigger." >&2
+  exit 75
+fi
+
+# Use --no-block so we return immediately; the dashboard will poll via
+# read-status.sh and tail the log to follow progress.
+systemctl start --no-block "$UNIT"
+echo "Triggered $UNIT. Follow with: journalctl -u $UNIT -f"
--- a/deploy/server-backup/wrappers/trigger-restore-test.sh
+++ b/deploy/server-backup/wrappers/trigger-restore-test.sh
@ -0,0 +1,15 @@
+#!/usr/bin/env bash
+# Run a non-destructive restore test against the NAS repo. Streams output to
+# stdout (so the dashboard's StreamingTerminal can render it) and writes the
+# structured result to /srv/backups/status/last-restore-test.json.
+
+set -uo pipefail
+
+REPO_LABEL="${1:-nas}"
+
+if [ ! -x /srv/backups/scripts/restore-test.sh ]; then
+  echo "ERROR: /srv/backups/scripts/restore-test.sh not installed" >&2
+  exit 1
+fi
+
+exec /srv/backups/scripts/restore-test.sh "$REPO_LABEL"