Ops-dashboard/deploy/server-backup/README.md
Madhura68 ab87c0fada feat(server-backup): restic dual-repo backup (NAS + B2) with dashboard UI
Adds a server-wide backup capability beyond the existing ops_dashboard
pg_dump flow:

- Daily systemd timer (03:30) runs pg_dumpall + Forgejo dump, then restic
  to a local NAS repo and an offsite Backblaze B2 repo with Object Lock.
  Phase-based script with single-instance flock, structured statusfile,
  systemd hardening, and live-datadir excludes (Postgres / Forgejo) so
  the dumps stay authoritative.
- Ops-agent gets nine new read-only/trigger commands (snapshots, stats,
  status, logs, plus two triggers) backed by sudoers-whitelisted wrapper
  scripts that source /etc/restic-backup.env so the agent never sees the
  restic password or B2 keys.
- Two new flows (server_backup_full, server_backup_restore_test) drive
  the dashboard's "Backup now" and "Restore test" buttons.
- /settings/backups gains a Server backup section with overall + per-phase
  status, NAS / B2 snapshot tables, restore-size / raw-data / dedup-ratio
  stats, and the last restore-test result. The existing pg_dump section
  is preserved unchanged.
- Runbook docs/runbooks/server-backup.md follows the tailscale-setup
  pattern (plan + addendum) and covers B2 Object Lock + scoped keys,
  Forgejo subplan with isolated restore-test stack, the off-server
  maintenance flow for B2 prune, and the integrity-check schedule.

Code-only change — installation on scrum4me-srv follows the runbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 13:03:00 +02:00

5.5 KiB

Server backup — deploy artefacten

Dagelijkse server-brede backup met restic naar NAS (lokaal) en Backblaze B2 (offsite, Object Lock). Inclusief structured statusfile die de ops-dashboard kan lezen.

De volledige beschrijving — voorwaarden, B2 keys, Object Lock, Forgejo-restore-test, integriteits-schedule — staat in docs/runbooks/server-backup.md.

Bestanden

Bestand Doel Plek op host
server-backup.sh hoofd-script (phase-based, flock, statusfile) /srv/backups/scripts/server-backup.sh
restore-test.sh restore latest snapshot + check critical files /srv/backups/scripts/restore-test.sh
server-backup.service systemd oneshot /etc/systemd/system/server-backup.service
server-backup.timer daily 03:30 + 10 min jitter /etc/systemd/system/server-backup.timer
restic-backup.env.example env-template (repos, B2 keys, Forgejo) kopiëren naar /etc/restic-backup.env

Bovendien aan te maken (niet in deze repo, omdat het secrets zijn):

  • /etc/restic-backup.password — alleen het restic-wachtwoord (mode 0400 root:root).

Snelle installatie (zie runbook voor alle context)

# 1. Tools en directories
sudo apt update && sudo apt install -y restic jq

sudo mkdir -p /srv/backups/scripts /srv/backups/logs /srv/backups/status \
              /var/backups/databases
sudo chmod 0750 /srv/backups/logs /srv/backups/status

# 2. Scripts plaatsen
sudo cp deploy/server-backup/server-backup.sh /srv/backups/scripts/
sudo cp deploy/server-backup/restore-test.sh  /srv/backups/scripts/
sudo chmod 0750 /srv/backups/scripts/*.sh
sudo chown root:root /srv/backups/scripts/*.sh

# 3. Env + password
sudo cp deploy/server-backup/restic-backup.env.example /etc/restic-backup.env
sudo chmod 0600 /etc/restic-backup.env
sudo chown root:root /etc/restic-backup.env
# Genereer wachtwoord — bewaar dit OOK in je password manager.
sudo sh -c 'openssl rand -hex 24 > /etc/restic-backup.password'
sudo chmod 0400 /etc/restic-backup.password

# 4. Vul /etc/restic-backup.env (RESTIC_REPO_NAS, RESTIC_REPO_B2,
#    B2_ACCOUNT_ID, B2_ACCOUNT_KEY, FORGEJO_*). Zie runbook deel A+B.

# 5. Repos initialiseren (zie runbook deel C voor Object Lock + key-capabilities)
sudo -E bash -c 'set -a; . /etc/restic-backup.env; set +a; \
  export RESTIC_PASSWORD_FILE=/etc/restic-backup.password; \
  restic -r "$RESTIC_REPO_NAS" init && \
  restic -r "$RESTIC_REPO_B2"  init'

# 6. Systemd
sudo cp deploy/server-backup/server-backup.service /etc/systemd/system/
sudo cp deploy/server-backup/server-backup.timer  /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now server-backup.timer
systemctl list-timers | grep server-backup

# 7. Eerste run handmatig (volgen via journalctl)
sudo systemctl start server-backup.service
journalctl -u server-backup.service -f

Verifiëren

# Statusfile
sudo jq . /srv/backups/status/last-run.json

# Snapshots
sudo -E bash -c 'set -a; . /etc/restic-backup.env; set +a; \
  export RESTIC_PASSWORD_FILE=/etc/restic-backup.password; \
  restic -r "$RESTIC_REPO_NAS" snapshots; \
  restic -r "$RESTIC_REPO_B2"  snapshots'

# Restore-test (NAS, niet-destructief — restored naar /tmp/restore-test)
sudo /srv/backups/scripts/restore-test.sh nas
sudo jq . /srv/backups/status/last-restore-test.json

Statusfile-schema

Het script schrijft /srv/backups/status/last-run.json na elke run (success of failure), atomisch via temp + mv. De ops-dashboard leest deze file via read_backup_status (zie ops-agent/commands.yml.example).

{
  "schema_version": 1,
  "overall_status": "success | partial_failure | failed",
  "started_at": "2026-05-15T03:30:00+02:00",
  "completed_at": "2026-05-15T03:48:21+02:00",
  "duration_seconds": 1101,
  "host": "scrum4me-srv",
  "phases": {
    "postgres_dump":   { "status": "success",  "exit_code": 0, "...": "..." },
    "forgejo_dump":    { "status": "skipped",  "exit_code": 99, "...": "..." },
    "forgejo_db_dump": { "status": "skipped",  "exit_code": 99 },
    "restic_nas":      { "status": "success",  "exit_code": 0, "snapshot_id": "abc123" },
    "restic_b2":       { "status": "degraded", "exit_code": 3, "error": "1 file unreadable" },
    "forget_nas":      { "status": "success",  "exit_code": 0 },
    "check_nas":       { "status": "success",  "exit_code": 0 },
    "check_b2":        { "status": "success",  "exit_code": 0 }
  }
}

Per phase status:

status betekenis telt mee als
success exit 0 success
skipped exit 99 — phase niet van toepassing (bv. Forgejo niet geïnstalleerd) success
degraded exit 3 — restic snapshot is gemaakt maar bepaalde files waren onleesbaar partial_failure
failed andere non-zero exit partial_failure of failed (zie overall_status)
pending phase niet gerund (script aborted vóór deze phase) partial_failure

overall_status regels:

  • failed als postgres_dump faalt (DB-dump is autoritatief), of als beide restic repos falen.
  • partial_failure bij enige failed of degraded phase die niet kritisch is (bv. één restic repo down, of forgejo_dump faalt terwijl postgres lukt).
  • success als geen enkele phase failed of degraded is.

Volgorde tov bestaande ops-db-backup.timer

De bestaande deploy/ops-agent/ops-db-backup.timer draait om 02:00 en doet alleen pg_dump ops_dashboard naar /srv/ops/backups/. Deze nieuwe server-backup.timer draait om 03:30 en pickt die map mee in zijn restic-backup. Beide blijven naast elkaar bestaan.