# Recovery Runbook — Ops Dashboard This runbook covers the four most common failure scenarios. All commands must be run as root (or with `sudo`) on the host over SSH. --- ## 1. Agent crashed / ops-agent not responding **Symptoms:** The dashboard shows "Agent unreachable" or HTTP 502/504 on flow endpoints; `curl http://127.0.0.1:3099/healthz` times out. ```bash # Check current status and recent logs systemctl status ops-agent journalctl -u ops-agent -n 50 --no-pager # Restart the agent systemctl restart ops-agent # Verify it came back systemctl status ops-agent curl -sf http://127.0.0.1:3099/healthz && echo OK ``` If the agent exits immediately on restart, the most common causes are: | Cause | Fix | |---|---| | Missing `/etc/ops-agent/secret` | `openssl rand -hex 32 \| sudo tee /etc/ops-agent/secret && sudo chown root:ops-agent /etc/ops-agent/secret && sudo chmod 0640 /etc/ops-agent/secret` | | Missing or invalid `commands.yml` | `cp /opt/ops-agent/commands.yml.example /etc/ops-agent/commands.yml` | | Port 3099 already in use | `ss -tlnp \| grep 3099` then kill the conflicting process | | Node.js missing | `apt install nodejs` or reinstall via `deploy/ops-agent/setup.sh` | If the binary itself is corrupt, reinstall from the repo: ```bash cd /srv/ops/repos/ops-dashboard sudo deploy/ops-agent/setup.sh ``` --- ## 2. Database corruption (ops_dashboard DB) **Symptoms:** Dashboard shows database errors; Prisma throws `P1001` / `P1002`; `psql` commands fail against the `ops_dashboard` database. ### 2a. Restore from latest backup Backups are stored in `/var/backups/ops-dashboard/` (default path from the backup flow). Each file is a plain-SQL `pg_dump` dump named `ops_dashboard_YYYY-MM-DDTHH-MM-SSZ.sql`. ```bash # List available backups (newest first) ls -lt /var/backups/ops-dashboard/*.sql | head -10 # Drop the damaged DB and restore BACKUP=/var/backups/ops-dashboard/.sql sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'ops_dashboard';" sudo -u postgres dropdb ops_dashboard sudo -u postgres createdb ops_dashboard sudo -u postgres psql ops_dashboard < "$BACKUP" ``` Then restart the dashboard to re-open connections: ```bash docker compose -f /srv/scrum4me/compose/docker-compose.yml restart ops-dashboard ``` ### 2b. No usable backup available Run Prisma migrations to re-create the schema (data loss): ```bash cd /srv/ops/repos/ops-dashboard sudo -u postgres createdb ops_dashboard # if it was dropped docker compose -f /srv/scrum4me/compose/docker-compose.yml run --rm ops-dashboard \ npx prisma migrate deploy ``` --- ## 3. Container refuses to start **Symptoms:** `docker compose up ops-dashboard` exits immediately; the container appears in `docker ps -a` with `Exited`. ```bash # Show exit logs docker logs ops-dashboard --tail 50 # Inspect exit code docker inspect ops-dashboard --format='ExitCode: {{.State.ExitCode}}' ``` ### Common causes and fixes | Exit code / symptom | Likely cause | Fix | |---|---|---| | `EACCES` / permission denied | Env file unreadable | `chmod 640 /srv/ops/ops-dashboard.env` | | `DATABASE_URL` connect error | DB not ready or wrong credentials | Check `docker ps` for `postgres`; verify `DATABASE_URL` in env file | | Port 3000 already bound | Another process on 3000 | `ss -tlnp \| grep 3000`; adjust port mapping in Compose file | | Invalid `AUTH_SECRET` | Secret too short for Next.js Auth | Regenerate: `openssl rand -base64 32` | | Image not found | `build` step skipped | `docker compose -f ... build ops-dashboard` then `up -d` | Force a clean rebuild: ```bash docker compose -f /srv/scrum4me/compose/docker-compose.yml build --no-cache ops-dashboard docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d --force-recreate ops-dashboard ``` --- ## 4. TLS certificate expired **Symptoms:** Browser shows `ERR_CERT_DATE_INVALID`; Caddy logs show `certificate expired`. Caddy manages TLS automatically via Let's Encrypt. A certificate should never expire under normal operation because Caddy renews ~30 days before expiry. ### Check certificate status ```bash # View Caddy logs for ACME activity journalctl -u caddy -n 100 --no-pager | grep -i acme # Check the expiry of the live certificate echo | openssl s_client -connect jp-visser.nl:443 -servername jp-visser.nl 2>/dev/null \ | openssl x509 -noout -dates ``` ### Force renewal ```bash # Stop Caddy, clear the ACME cache, restart systemctl stop caddy rm -rf /var/lib/caddy/.local/share/caddy/certificates systemctl start caddy journalctl -u caddy -f # watch for successful issuance ``` ### If ACME fails (rate-limited or DNS not resolving) 1. Confirm DNS for `jp-visser.nl` points to this server's public IP. 2. Confirm port 80 is open inbound (required for HTTP-01 challenge). 3. If rate-limited (>5 certificates/week for the domain), wait until the limit resets or use a staging cert temporarily: ``` # In Caddyfile, add: acme_ca https://acme-staging-v02.api.letsencrypt.org/directory # Remove after the rate-limit window passes. ``` ### Emergency: self-signed certificate If renewal is blocked and you need access now: ```bash # Generate a temporary self-signed cert openssl req -x509 -newkey rsa:4096 -keyout /etc/caddy/selfsigned.key \ -out /etc/caddy/selfsigned.crt -days 30 -nodes \ -subj "/CN=jp-visser.nl" # Point Caddyfile at the self-signed cert (tls section): # tls /etc/caddy/selfsigned.crt /etc/caddy/selfsigned.key systemctl reload caddy ``` Remember to revert to automatic TLS once the ACME issue is resolved.