From caeb5f33064c42993d6625e88ec8567f7d2199cc Mon Sep 17 00:00:00 2001 From: Scrum4Me Agent <30029041+madhura68@users.noreply.github.com> Date: Wed, 13 May 2026 20:10:21 +0200 Subject: [PATCH] feat(ops): self-update script, systemd units, README install guide, recovery runbook MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - deploy/ops-dashboard-updater/update.sh: git pull → docker build → force-recreate → smoke-test - deploy/ops-dashboard-updater/install.sh: installs script + systemd units to host - ops-dashboard-updater.service / .timer: oneshot + daily 03:00 scheduled trigger - README.md: Installation and Configuration sections (env files, ops-agent, updater) - docs/runbooks/recovery.md: agent-crash, DB corruption/restore, container failure, cert expiry Co-Authored-By: Claude Sonnet 4.6 --- README.md | 74 ++++++++ deploy/ops-dashboard-updater/install.sh | 33 ++++ .../ops-dashboard-updater.service | 14 ++ .../ops-dashboard-updater.timer | 11 ++ deploy/ops-dashboard-updater/update.sh | 56 ++++++ docs/runbooks/recovery.md | 173 ++++++++++++++++++ 6 files changed, 361 insertions(+) create mode 100644 deploy/ops-dashboard-updater/install.sh create mode 100644 deploy/ops-dashboard-updater/ops-dashboard-updater.service create mode 100644 deploy/ops-dashboard-updater/ops-dashboard-updater.timer create mode 100644 deploy/ops-dashboard-updater/update.sh create mode 100644 docs/runbooks/recovery.md diff --git a/README.md b/README.md index 641ce52..62e9649 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,80 @@ Single-user ops dashboard voor jp-visser.nl. See `docs/runbooks/` for setup, deployment, and operational procedures. +## Installation + +### Prerequisites + +- Docker + Docker Compose (plugin) installed on the host +- A PostgreSQL service named `postgres` already running in the same Compose stack +- The repository cloned to `/srv/ops/repos/ops-dashboard` +- `/srv/scrum4me/compose/docker-compose.yml` as the shared Compose file + +### 1. Configure environment + +``` +cp deploy/ops-dashboard.env.example /srv/ops/ops-dashboard.env +# Edit /srv/ops/ops-dashboard.env — set DATABASE_URL, AUTH_SECRET, etc. +``` + +### 2. Install ops-agent + +``` +sudo deploy/ops-agent/setup.sh +``` + +This creates the `ops-agent` system user, installs `/opt/ops-agent`, generates +`/etc/ops-agent/secret`, and enables the systemd unit. + +Copy the generated secret into the web-app env file: + +``` +sudo cat /etc/ops-agent/secret +# Paste the value as OPS_AGENT_SECRET= in /srv/ops/ops-dashboard.env +``` + +### 3. Build and start the dashboard + +``` +sudo docker compose -f /srv/scrum4me/compose/docker-compose.yml build ops-dashboard +sudo docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d ops-dashboard +``` + +The dashboard is now reachable on `127.0.0.1:3001` (proxied by Caddy). + +### 4. Install the self-update script + +``` +sudo deploy/ops-dashboard-updater/install.sh +``` + +To enable scheduled updates (daily at 03:00): + +``` +sudo systemctl enable --now ops-dashboard-updater.timer +``` + +To trigger a manual update via SSH: + +``` +sudo systemctl start ops-dashboard-updater.service +# or: +sudo /opt/ops-dashboard-updater/update.sh +``` + +> **Never** trigger updates through the dashboard UI — the script restarts the +> container that serves the UI. + +## Configuration + +| File | Purpose | +|---|---| +| `/srv/ops/ops-dashboard.env` | Web-app environment (DATABASE_URL, AUTH_SECRET, OPS_AGENT_SECRET, …) | +| `/etc/ops-agent/secret` | Shared HMAC secret between web-app and ops-agent | +| `/etc/ops-agent/commands.yml` | Whitelist of commands the ops-agent may run | +| `/etc/ops-agent/flows/` | Flow YAML files (backup, caddy reload, etc.) | +| `/srv/scrum4me/compose/docker-compose.yml` | Main Compose file (add ops-dashboard fragment from `deploy/`) | + ## Ops-agent auth The web-app communicates with the ops-agent via a shared secret stored in diff --git a/deploy/ops-dashboard-updater/install.sh b/deploy/ops-dashboard-updater/install.sh new file mode 100644 index 0000000..17b984c --- /dev/null +++ b/deploy/ops-dashboard-updater/install.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# Install the ops-dashboard self-update script and systemd units. +# Run as root from within the repo. +set -euo pipefail + +REPO_DIR="$(cd "$(dirname "$0")/../.." && pwd)" +INSTALL_DIR=/opt/ops-dashboard-updater +SERVICE_DIR=/etc/systemd/system + +echo "==> Installing update script to ${INSTALL_DIR}" +mkdir -p "${INSTALL_DIR}" +install -m 0750 -o root -g root \ + "${REPO_DIR}/deploy/ops-dashboard-updater/update.sh" \ + "${INSTALL_DIR}/update.sh" + +echo "==> Installing systemd units" +install -m 0644 -o root -g root \ + "${REPO_DIR}/deploy/ops-dashboard-updater/ops-dashboard-updater.service" \ + "${SERVICE_DIR}/ops-dashboard-updater.service" +install -m 0644 -o root -g root \ + "${REPO_DIR}/deploy/ops-dashboard-updater/ops-dashboard-updater.timer" \ + "${SERVICE_DIR}/ops-dashboard-updater.timer" + +systemctl daemon-reload + +echo "" +echo "==> Done. To enable automatic scheduled updates:" +echo " systemctl enable --now ops-dashboard-updater.timer" +echo "" +echo " To run a manual update now:" +echo " systemctl start ops-dashboard-updater.service" +echo " # or directly:" +echo " /opt/ops-dashboard-updater/update.sh" diff --git a/deploy/ops-dashboard-updater/ops-dashboard-updater.service b/deploy/ops-dashboard-updater/ops-dashboard-updater.service new file mode 100644 index 0000000..13d9664 --- /dev/null +++ b/deploy/ops-dashboard-updater/ops-dashboard-updater.service @@ -0,0 +1,14 @@ +[Unit] +Description=Self-update ops-dashboard (oneshot, triggered by timer or SSH) +After=network.target docker.service + +[Service] +Type=oneshot +User=root +ExecStart=/opt/ops-dashboard-updater/update.sh +StandardOutput=journal +StandardError=journal +SyslogIdentifier=ops-dashboard-update + +[Install] +WantedBy=multi-user.target diff --git a/deploy/ops-dashboard-updater/ops-dashboard-updater.timer b/deploy/ops-dashboard-updater/ops-dashboard-updater.timer new file mode 100644 index 0000000..4ac171d --- /dev/null +++ b/deploy/ops-dashboard-updater/ops-dashboard-updater.timer @@ -0,0 +1,11 @@ +[Unit] +Description=Scheduled self-update for ops-dashboard (optional) + +[Timer] +# Check for updates every day at 03:00 local time. +# Disable this timer if you prefer manual-only updates via SSH. +OnCalendar=*-*-* 03:00:00 +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/deploy/ops-dashboard-updater/update.sh b/deploy/ops-dashboard-updater/update.sh new file mode 100644 index 0000000..daf9963 --- /dev/null +++ b/deploy/ops-dashboard-updater/update.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# Self-update script for ops-dashboard. +# Run as root via SSH or the systemd oneshot service below. +# Do NOT invoke this through the UI — it restarts the container serving the UI. +set -euo pipefail + +REPO_DIR=/srv/ops/repos/ops-dashboard +COMPOSE_FILE=/srv/scrum4me/compose/docker-compose.yml +SERVICE=ops-dashboard +LOG_TAG=ops-dashboard-update + +log() { echo "[$(date -u +%FT%TZ)] $*" | tee /dev/fd/1 | systemd-cat -t "$LOG_TAG" -p info 2>/dev/null || true; } +die() { echo "[$(date -u +%FT%TZ)] ERROR: $*" >&2; exit 1; } + +# ── 1. Pull latest code ──────────────────────────────────────────────────────── +log "Pulling latest code from origin..." +git -C "$REPO_DIR" fetch --prune origin +CURRENT=$(git -C "$REPO_DIR" rev-parse HEAD) +git -C "$REPO_DIR" reset --hard origin/main +NEW=$(git -C "$REPO_DIR" rev-parse HEAD) + +if [[ "$CURRENT" == "$NEW" ]]; then + log "Already up-to-date at $NEW — nothing to rebuild." + exit 0 +fi +log "Updated $CURRENT → $NEW" + +# ── 2. Build new image ───────────────────────────────────────────────────────── +log "Building Docker image..." +docker compose -f "$COMPOSE_FILE" build "$SERVICE" + +# ── 3. Restart container ─────────────────────────────────────────────────────── +log "Restarting $SERVICE with new image..." +docker compose -f "$COMPOSE_FILE" up -d --force-recreate "$SERVICE" + +# ── 4. Smoke test ────────────────────────────────────────────────────────────── +log "Waiting for container to become healthy..." +for i in $(seq 1 12); do + STATUS=$(docker inspect --format='{{.State.Health.Status}}' "$SERVICE" 2>/dev/null || true) + if [[ "$STATUS" == "healthy" ]]; then + log "Container is healthy after ${i}×5 s." + break + fi + # Fallback: accept running if no HEALTHCHECK is defined + RUNNING=$(docker inspect --format='{{.State.Running}}' "$SERVICE" 2>/dev/null || echo false) + if [[ "$RUNNING" == "true" && -z "$STATUS" ]]; then + log "Container running (no HEALTHCHECK defined)." + break + fi + if [[ $i -eq 12 ]]; then + die "Container did not become healthy within 60 s. Check: docker logs $SERVICE" + fi + sleep 5 +done + +log "Update complete — $SERVICE is running commit $NEW." diff --git a/docs/runbooks/recovery.md b/docs/runbooks/recovery.md new file mode 100644 index 0000000..0517a78 --- /dev/null +++ b/docs/runbooks/recovery.md @@ -0,0 +1,173 @@ +# Recovery Runbook — Ops Dashboard + +This runbook covers the four most common failure scenarios. All commands must +be run as root (or with `sudo`) on the host over SSH. + +--- + +## 1. Agent crashed / ops-agent not responding + +**Symptoms:** The dashboard shows "Agent unreachable" or HTTP 502/504 on flow +endpoints; `curl http://127.0.0.1:3099/healthz` times out. + +```bash +# Check current status and recent logs +systemctl status ops-agent +journalctl -u ops-agent -n 50 --no-pager + +# Restart the agent +systemctl restart ops-agent + +# Verify it came back +systemctl status ops-agent +curl -sf http://127.0.0.1:3099/healthz && echo OK +``` + +If the agent exits immediately on restart, the most common causes are: + +| Cause | Fix | +|---|---| +| Missing `/etc/ops-agent/secret` | `openssl rand -hex 32 \| sudo tee /etc/ops-agent/secret && sudo chown root:ops-agent /etc/ops-agent/secret && sudo chmod 0640 /etc/ops-agent/secret` | +| Missing or invalid `commands.yml` | `cp /opt/ops-agent/commands.yml.example /etc/ops-agent/commands.yml` | +| Port 3099 already in use | `ss -tlnp \| grep 3099` then kill the conflicting process | +| Node.js missing | `apt install nodejs` or reinstall via `deploy/ops-agent/setup.sh` | + +If the binary itself is corrupt, reinstall from the repo: + +```bash +cd /srv/ops/repos/ops-dashboard +sudo deploy/ops-agent/setup.sh +``` + +--- + +## 2. Database corruption (ops_dashboard DB) + +**Symptoms:** Dashboard shows database errors; Prisma throws `P1001` / `P1002`; +`psql` commands fail against the `ops_dashboard` database. + +### 2a. Restore from latest backup + +Backups are stored in `/var/backups/ops-dashboard/` (default path from the +backup flow). Each file is a plain-SQL `pg_dump` dump named +`ops_dashboard_YYYY-MM-DDTHH-MM-SSZ.sql`. + +```bash +# List available backups (newest first) +ls -lt /var/backups/ops-dashboard/*.sql | head -10 + +# Drop the damaged DB and restore +BACKUP=/var/backups/ops-dashboard/.sql +sudo -u postgres psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'ops_dashboard';" +sudo -u postgres dropdb ops_dashboard +sudo -u postgres createdb ops_dashboard +sudo -u postgres psql ops_dashboard < "$BACKUP" +``` + +Then restart the dashboard to re-open connections: + +```bash +docker compose -f /srv/scrum4me/compose/docker-compose.yml restart ops-dashboard +``` + +### 2b. No usable backup available + +Run Prisma migrations to re-create the schema (data loss): + +```bash +cd /srv/ops/repos/ops-dashboard +sudo -u postgres createdb ops_dashboard # if it was dropped +docker compose -f /srv/scrum4me/compose/docker-compose.yml run --rm ops-dashboard \ + npx prisma migrate deploy +``` + +--- + +## 3. Container refuses to start + +**Symptoms:** `docker compose up ops-dashboard` exits immediately; the container +appears in `docker ps -a` with `Exited`. + +```bash +# Show exit logs +docker logs ops-dashboard --tail 50 + +# Inspect exit code +docker inspect ops-dashboard --format='ExitCode: {{.State.ExitCode}}' +``` + +### Common causes and fixes + +| Exit code / symptom | Likely cause | Fix | +|---|---|---| +| `EACCES` / permission denied | Env file unreadable | `chmod 640 /srv/ops/ops-dashboard.env` | +| `DATABASE_URL` connect error | DB not ready or wrong credentials | Check `docker ps` for `postgres`; verify `DATABASE_URL` in env file | +| Port 3000 already bound | Another process on 3000 | `ss -tlnp \| grep 3000`; adjust port mapping in Compose file | +| Invalid `AUTH_SECRET` | Secret too short for Next.js Auth | Regenerate: `openssl rand -base64 32` | +| Image not found | `build` step skipped | `docker compose -f ... build ops-dashboard` then `up -d` | + +Force a clean rebuild: + +```bash +docker compose -f /srv/scrum4me/compose/docker-compose.yml build --no-cache ops-dashboard +docker compose -f /srv/scrum4me/compose/docker-compose.yml up -d --force-recreate ops-dashboard +``` + +--- + +## 4. TLS certificate expired + +**Symptoms:** Browser shows `ERR_CERT_DATE_INVALID`; Caddy logs show +`certificate expired`. + +Caddy manages TLS automatically via Let's Encrypt. A certificate should never +expire under normal operation because Caddy renews ~30 days before expiry. + +### Check certificate status + +```bash +# View Caddy logs for ACME activity +journalctl -u caddy -n 100 --no-pager | grep -i acme + +# Check the expiry of the live certificate +echo | openssl s_client -connect jp-visser.nl:443 -servername jp-visser.nl 2>/dev/null \ + | openssl x509 -noout -dates +``` + +### Force renewal + +```bash +# Stop Caddy, clear the ACME cache, restart +systemctl stop caddy +rm -rf /var/lib/caddy/.local/share/caddy/certificates +systemctl start caddy +journalctl -u caddy -f # watch for successful issuance +``` + +### If ACME fails (rate-limited or DNS not resolving) + +1. Confirm DNS for `jp-visser.nl` points to this server's public IP. +2. Confirm port 80 is open inbound (required for HTTP-01 challenge). +3. If rate-limited (>5 certificates/week for the domain), wait until the limit + resets or use a staging cert temporarily: + ``` + # In Caddyfile, add: acme_ca https://acme-staging-v02.api.letsencrypt.org/directory + # Remove after the rate-limit window passes. + ``` + +### Emergency: self-signed certificate + +If renewal is blocked and you need access now: + +```bash +# Generate a temporary self-signed cert +openssl req -x509 -newkey rsa:4096 -keyout /etc/caddy/selfsigned.key \ + -out /etc/caddy/selfsigned.crt -days 30 -nodes \ + -subj "/CN=jp-visser.nl" + +# Point Caddyfile at the self-signed cert (tls section): +# tls /etc/caddy/selfsigned.crt /etc/caddy/selfsigned.key +systemctl reload caddy +``` + +Remember to revert to automatic TLS once the ACME issue is resolved.