docs(PBI-58): add developer manual chapters under docs/manual/

Adds a 7-file English-language manual targeted at new human contributors: index, overview, statuses & transitions (with mermaid state diagrams), git workflow, MCP integration, docker, and troubleshooting. The manual is the *map* — it cross-references existing runbooks/ADRs/architecture docs rather than duplicating their content. Regenerates docs/INDEX.md and validates with check-doc-links.mjs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:37:43 +02:00 · 2026-05-07 17:37:43 +02:00 · e75bac9375
commit e75bac9375
parent e8562d4018
8 changed files with 873 additions and 0 deletions
--- a/docs/manual/05-docker.md
+++ b/docs/manual/05-docker.md
@ -0,0 +1,149 @@
+---
+title: "Docker"
+status: active
+audience: [contributor]
+language: en
+last_updated: 2026-05-07
+when_to_read: "Before running the worker locally, debugging a stuck job, or operating the Mac/NAS deployment."
+---
+
+# 05 — Docker
+
+This chapter is the contributor's tour of the Docker side of Scrum4Me. Two important up-front facts:
+
+1. **The Next.js app is not containerised.** The web UI, API routes, server actions, and database connection all run on **Vercel** (serverless functions + Edge runtime). There is no `Dockerfile` in this repo and no `docker-compose.yml`.
+2. **Only the worker is containerised.** The "worker" is a Claude Code agent in a long-running container that polls the Scrum4Me job queue via MCP and executes `TASK_IMPLEMENTATION` / `IDEA_GRILL` / `IDEA_MAKE_PLAN` / `SPRINT_IMPLEMENTATION` jobs.
+
+The container image and its supporting scripts live in a **separate repo**: [`madhura68/scrum4me-docker`](https://github.com/madhura68/scrum4me-docker). This manual documents the consumer side — what the worker is, how it relates to Scrum4Me, and how to diagnose issues. The container internals (Dockerfile, entrypoint, agent provisioning) are out of scope for this manual; see that repo's README.
+
+> **Note:** A separate sandbox repo `scrum4me-sbx` ([`SC-4`](https://github.com/madhura68/scrum4me-sbx)) exists for Docker exploration. Treat it as a scratchpad, not as the production worker.
+
+## Topology
+
+```mermaid
+flowchart LR
+    subgraph Vercel
+        App[Next.js app<br/>+ API routes]
+    end
+    subgraph Neon
+        DB[(Postgres)]
+    end
+    subgraph Mac["Mac (default) / NAS (opt-in)"]
+        Worker[Worker container<br/>Claude Code + MCP]
+    end
+    Worker -- MCP over HTTPS --> App
+    App -- Prisma --> DB
+    Worker -- git push --> GH[GitHub]
+    GH -- webhooks --> App
+```
+
+- The worker **never connects to the database directly**. All state changes go through MCP tools, which call the Vercel-hosted REST API, which writes to Neon via Prisma.
+- The worker **does** push commits directly to GitHub. GitHub then notifies Vercel and the auto-PR flow ([03 — Git Workflow](./03-git-workflow.md)) takes over.
+
+## Mac vs NAS
+
+| Flow | When to use | Status |
+|---|---|---|
+| **Mac-native (arm64)** | Default for development and small teams | Active |
+| **NAS** | Self-hosted always-on worker on a Synology / Asustor / similar | Opt-in, validated by historical smoke tests in [`docs/docker-smoke/`](../docker-smoke/) |
+
+The Mac flow is the default because it doesn't require dedicated hardware. The container runs natively on Apple Silicon (arm64) — no x86 emulation overhead.
+
+## Environment variables the worker needs
+
+The worker container needs **only** what's required to authenticate to MCP and push to GitHub:
+
+| Var | Purpose |
+|---|---|
+| `SCRUM4ME_BEARER_TOKEN` | Bearer token bound to a product. Returned by the user's API-token settings page. |
+| `SCRUM4ME_BASE_URL` | Usually `https://scrum4me.vercel.app` (or the user's domain). |
+| `GITHUB_TOKEN` | Personal access token with `repo` scope, used by `git push` and `gh pr create`. |
+| `ANTHROPIC_API_KEY` | The Claude API key used by the worker process. |
+| `MIN_QUOTA_PCT` | Optional. Worker pauses if Anthropic quota drops below this percentage. |
+
+> **Hardstop:** the worker does **not** need `DATABASE_URL`, `SESSION_SECRET`, or `CRON_SECRET`. Those belong to the Next.js app; the worker only talks to MCP. If you find yourself adding DB env vars to the worker, stop — you're solving the wrong problem.
+
+The full list and provisioning instructions live in the [`scrum4me-docker` README](https://github.com/madhura68/scrum4me-docker). **TODO:** link to specific sections of that README once it's stable.
+
+## What the worker loop does, on a single iteration
+
+```mermaid
+sequenceDiagram
+    participant W as Worker
+    participant Q as worker-quota-probe.sh
+    participant M as MCP server
+    W->>Q: probe Anthropic quota
+    Q-->>W: { pct, reset_at_iso }
+    alt pct < MIN_QUOTA_PCT
+        W->>M: worker_heartbeat(pct, last_quota_check_at)
+        W->>W: sleep until reset_at_iso (cap 1h)
+    else quota ok
+        W->>M: worker_heartbeat(pct, last_quota_check_at)
+        W->>M: wait_for_job (block ≤600s, claim atomically)
+        alt queue empty
+            W->>W: continue (no work, loop again)
+        else got job
+            W->>W: execute by `kind`
+            W->>M: update_job_status(done|failed)
+        end
+    end
+    Note over W: continue forever
+```
+
+The loop is described authoritatively in [`docs/runbooks/mcp-integration.md`](../runbooks/mcp-integration.md#batch-loop-verplichte-agent-flow) and [`docs/runbooks/worker-idempotency.md`](../runbooks/worker-idempotency.md).
+
+### Quota probe
+
+`bin/worker-quota-probe.sh` (in `scrum4me-docker`) makes a tiny call to the Anthropic API to read the current quota percentage and reset time. Cost: ~1 output token per probe (~12 tokens/hour at 5-minute intervals). The default `MIN_QUOTA_PCT` is **20%** — typically high enough on Pro/Max plans that the worker never pauses during normal day-job hours.
+
+### Heartbeat
+
+Every iteration the worker calls `worker_heartbeat({ last_quota_pct, last_quota_check_at })`. The MCP server emits an SSE event so the NavBar in the Next.js app shows the worker as live. A heartbeat older than 15 seconds is rendered as "offline" / "stand-by" in the UI.
+
+### Stale-claim recovery
+
+If a worker dies mid-job (process crash, container kill, network partition), its claimed job stays as `CLAIMED` in the database. After **30 minutes** the next `wait_for_job` call automatically requeues it (`CLAIMED → QUEUED`) before claiming a fresh one. No manual intervention is required for clean recovery.
+
+When you **do** need to manually requeue a job (e.g. you killed it intentionally and don't want to wait 30 min), the operator route is the admin board → "Requeue job" button. **TODO:** confirm the exact UI path; this is not yet documented in `docs/runbooks/`.
+
+## Running the worker locally
+
+The intended local workflow per the project's standing memory is **Mac-native Docker** (the user's `project_docker_default_target` memory). High-level steps (verify against the [scrum4me-docker README](https://github.com/madhura68/scrum4me-docker) for exact commands):
+
+1. Clone `scrum4me-docker` next to `Scrum4Me/` (so `~/Development/Scrum4Me/scrum4me-docker/`).
+2. Provision the env vars above (typically a `.env` file in that repo, **not committed**).
+3. `docker build` the image and `docker run` it with the env file mounted.
+4. Watch container logs for the heartbeat/quota cycle.
+5. Trigger a job from the UI ("Voer alle uit" on the Solo Board) and verify the worker picks it up within ~5 seconds.
+
+> **TODO:** once the `scrum4me-docker` README has stabilised, replace the bullets above with copy-paste-ready commands. Until then, defer to that repo for canonical instructions.
+
+## Debugging a stuck worker
+
+| Symptom | Likely cause | Fix |
+|---|---|---|
+| Worker shows offline in NavBar but container is running | `worker_heartbeat` not reaching MCP | Check `SCRUM4ME_BASE_URL` and `SCRUM4ME_BEARER_TOKEN`; tail container logs for HTTP errors |
+| Worker logs say "stand-by" indefinitely | `pct < MIN_QUOTA_PCT` and reset_at not reached | Lower `MIN_QUOTA_PCT` for testing, or wait for the printed `reset_at_iso` |
+| Job stuck `CLAIMED` for >30 min | Worker died mid-job | Wait — auto-requeue triggers on next `wait_for_job` |
+| Worker claims job but never updates status | Crashed before `update_job_status`; container restarted in a loop | Check `docker logs`; the next `wait_for_job` will requeue stale claims |
+| `update_job_status` returns `403` | Bearer token doesn't match `claimed_by_token_id` | The token was rotated mid-run; restart with fresh token |
+
+For deeper troubleshooting see [06 — Troubleshooting](./06-troubleshooting.md).
+
+## Smoke-test references
+
+Historical Docker smoke tests live in [`docs/docker-smoke/`](../docker-smoke/). They validated the worktree-isolation + branch-per-story flow when the Docker worker was first introduced. They are **historical** — don't expect them to be runnable as-is — but they're a useful reference when you want to verify the same flow on a new container image.
+
+## Deep links
+
+| Topic | Source |
+|---|---|
+| Container image, Dockerfile, build | [`scrum4me-docker` repo](https://github.com/madhura68/scrum4me-docker) |
+| Worker loop & quota check | [`docs/runbooks/mcp-integration.md`](../runbooks/mcp-integration.md#pre-flight-quota-check-m13) |
+| Worker idempotency / job-status protocol | [`docs/runbooks/worker-idempotency.md`](../runbooks/worker-idempotency.md) |
+| Historical smoke tests | [`docs/docker-smoke/`](../docker-smoke/) |
+| Sandbox / exploration repo | [`scrum4me-sbx` repo](https://github.com/madhura68/scrum4me-sbx) |
+
+## What's next
+
+→ [06 — Troubleshooting](./06-troubleshooting.md) covers error codes and recovery procedures across the full stack.