feat(ops): self-update script, systemd units, README install guide, recovery runbook

- deploy/ops-dashboard-updater/update.sh: git pull → docker build → force-recreate → smoke-test - deploy/ops-dashboard-updater/install.sh: installs script + systemd units to host - ops-dashboard-updater.service / .timer: oneshot + daily 03:00 scheduled trigger - README.md: Installation and Configuration sections (env files, ops-agent, updater) - docs/runbooks/recovery.md: agent-crash, DB corruption/restore, container failure, cert expiry Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 20:10:21 +02:00 · 2026-05-13 20:10:21 +02:00 · caeb5f3306
commit caeb5f3306
parent 09050d5ce7
6 changed files with 361 additions and 0 deletions
--- a/docs/runbooks/recovery.md
+++ b/docs/runbooks/recovery.md
@ -0,0 +1,173 @@
+# Recovery Runbook — Ops Dashboard
+
+This runbook covers the four most common failure scenarios. All commands must
+be run as root (or with `sudo`) on the host over SSH.
+
+---
+
+## 1. Agent crashed / ops-agent not responding
+
+**Symptoms:** The dashboard shows "Agent unreachable" or HTTP 502/504 on flow
+endpoints; `curl http://127.0.0.1:3099/healthz` times out.
+
+```bash
+# Check current status and recent logs
+systemctl status ops-agent
+journalctl -u ops-agent -n 50 --no-pager
+
+# Restart the agent
+systemctl restart ops-agent
+
+# Verify it came back
+systemctl status ops-agent
+curl -sf http://127.0.0.1:3099/healthz && echo OK
+```
+
+If the agent exits immediately on restart, the most common causes are:
+
+| Cause | Fix |
+|---|---|
+| Missing `/etc/ops-agent/secret` | `openssl rand -hex 32 \| sudo tee /etc/ops-agent/secret && sudo chown root:ops-agent /etc/ops-agent/secret && sudo chmod 0640 /etc/ops-agent/secret` |
+| Missing or invalid `commands.yml` | `cp /opt/ops-agent/commands.yml.example /etc/ops-agent/commands.yml` |
+| Port 3099 already in use | `ss -tlnp \| grep 3099` then kill the conflicting process |
+| Node.js missing | `apt install nodejs` or reinstall via `deploy/ops-agent/setup.sh` |
+
+If the binary itself is corrupt, reinstall from the repo:
+
+```bash
+cd /srv/ops/repos/ops-dashboard
+sudo deploy/ops-agent/setup.sh
+```
+
+---
+
+## 2. Database corruption (ops_dashboard DB)
+
+**Symptoms:** Dashboard shows database errors; Prisma throws `P1001` / `P1002`;
+`psql` commands fail against the `ops_dashboard` database.
+
+### 2a. Restore from latest backup
+
+Backups are stored in `/var/backups/ops-dashboard/` (default path from the
+backup flow). Each file is a plain-SQL `pg_dump` dump named
+`ops_dashboard_YYYY-MM-DDTHH-MM-SSZ.sql`.
+
+```bash
+# List available backups (newest first)
+ls -lt /var/backups/ops-dashboard/*.sql | head -10
+
+# Drop the damaged DB and restore
+BACKUP=/var/backups/ops-dashboard/<chosen-file>.sql
+sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'ops_dashboard';"
+sudo -u postgres dropdb ops_dashboard
+sudo -u postgres createdb ops_dashboard
+sudo -u postgres psql ops_dashboard < "$BACKUP"
+```
+
+Then restart the dashboard to re-open connections:
+
+```bash
+docker compose -f /srv/scrum4me/compose/docker-compose.yml restart ops-dashboard
+```
+
+### 2b. No usable backup available
+
+Run Prisma migrations to re-create the schema (data loss):
+
+```bash
+cd /srv/ops/repos/ops-dashboard
+sudo -u postgres createdb ops_dashboard   # if it was dropped
+docker compose -f /srv/scrum4me/compose/docker-compose.yml run --rm ops-dashboard \
+  npx prisma migrate deploy
+```
+
+---
+
+## 3. Container refuses to start
+
+**Symptoms:** `docker compose up ops-dashboard` exits immediately; the container
+appears in `docker ps -a` with `Exited`.
+
+```bash
+# Show exit logs
+docker logs ops-dashboard --tail 50
+
+# Inspect exit code
+docker inspect ops-dashboard --format='ExitCode: {{.State.ExitCode}}'
+```
+
+### Common causes and fixes
+
+| Exit code / symptom | Likely cause | Fix |
+|---|---|---|
+| `EACCES` / permission denied | Env file unreadable | `chmod 640 /srv/ops/ops-dashboard.env` |
+| `DATABASE_URL` connect error | DB not ready or wrong credentials | Check `docker ps` for `postgres`; verify `DATABASE_URL` in env file |
+| Port 3000 already bound | Another process on 3000 | `ss -tlnp \| grep 3000`; adjust port mapping in Compose file |
+| Invalid `AUTH_SECRET` | Secret too short for Next.js Auth | Regenerate: `openssl rand -base64 32` |
+| Image not found | `build` step skipped | `docker compose -f ... build ops-dashboard` then `up -d` |
+
+Force a clean rebuild:
+
+```bash
+docker compose -f /srv/scrum4me/compose/docker-compose.yml build --no-cache ops-dashboard
+docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d --force-recreate ops-dashboard
+```
+
+---
+
+## 4. TLS certificate expired
+
+**Symptoms:** Browser shows `ERR_CERT_DATE_INVALID`; Caddy logs show
+`certificate expired`.
+
+Caddy manages TLS automatically via Let's Encrypt. A certificate should never
+expire under normal operation because Caddy renews ~30 days before expiry.
+
+### Check certificate status
+
+```bash
+# View Caddy logs for ACME activity
+journalctl -u caddy -n 100 --no-pager | grep -i acme
+
+# Check the expiry of the live certificate
+echo | openssl s_client -connect jp-visser.nl:443 -servername jp-visser.nl 2>/dev/null \
+  | openssl x509 -noout -dates
+```
+
+### Force renewal
+
+```bash
+# Stop Caddy, clear the ACME cache, restart
+systemctl stop caddy
+rm -rf /var/lib/caddy/.local/share/caddy/certificates
+systemctl start caddy
+journalctl -u caddy -f   # watch for successful issuance
+```
+
+### If ACME fails (rate-limited or DNS not resolving)
+
+1. Confirm DNS for `jp-visser.nl` points to this server's public IP.
+2. Confirm port 80 is open inbound (required for HTTP-01 challenge).
+3. If rate-limited (>5 certificates/week for the domain), wait until the limit
+   resets or use a staging cert temporarily:
+   ```
+   # In Caddyfile, add:  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
+   # Remove after the rate-limit window passes.
+   ```
+
+### Emergency: self-signed certificate
+
+If renewal is blocked and you need access now:
+
+```bash
+# Generate a temporary self-signed cert
+openssl req -x509 -newkey rsa:4096 -keyout /etc/caddy/selfsigned.key \
+  -out /etc/caddy/selfsigned.crt -days 30 -nodes \
+  -subj "/CN=jp-visser.nl"
+
+# Point Caddyfile at the self-signed cert (tls section):
+# tls /etc/caddy/selfsigned.crt /etc/caddy/selfsigned.key
+systemctl reload caddy
+```
+
+Remember to revert to automatic TLS once the ACME issue is resolved.