feat: multi-worker instance_id presence + claim routing #7

Merged
janpeter merged 10 commits from feat/multi-worker-instance-id into master 2026-05-27 10:26:21 +02:00
Owner

Hardening (Phase 1: no-new-privileges + cap_drop ALL + selected caps), entrypoint per-instance state-dir + instance-id export, run-one-job.ts passes instanceId/hostname/pid to registerWorker/startHeartbeat/tryClaimJob. Live in prod with 2 worker-idea replicas.

Hardening (Phase 1: no-new-privileges + cap_drop ALL + selected caps), entrypoint per-instance state-dir + instance-id export, run-one-job.ts passes instanceId/hostname/pid to registerWorker/startHeartbeat/tryClaimJob. Live in prod with 2 worker-idea replicas.
Brainstormed design for scaling from 1-worker QNAP-runner to N-worker
setup on scrum4me-srv (Ubuntu, Tailscale). Four parallel tracks across
scrum4me-docker (infra + compose) and scrum4me-workers (schema + UI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProductDoc MCP requires `title:` in frontmatter; without it
create_product_doc rejects with "Invalid input: expected string,
received undefined". File now matches the version pushed to both
scrum4me-docker and scrum4me-workers ProductDocs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves three review points raised during multi-product sync:
- R-DB: §4 step 0 expliciet gating-commando voor DB-target
  (scrum4me-postgres vs Neon) vóór Track-II migration
- R-TrackI: §10 Track-I als voorbereidingscommit gemarkeerd,
  geen claim van parallel-track wallclock-winst
- R-State: §9 Track-III step 5 wording verduidelijkt — bind-mount
  is gedeeld, isolatie zit in \$(hostname)-subdir-schrijfpad

Pushed to both ProductDocs (scrum4me-docker en scrum4me-workers)
als revision 3, identieke content-hash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Alle 8 punten geldig bevonden tegen lokale code, geen pushback:
- P0-1 schema: migration herschreven (started_at bestaat al,
  id @id PK behouden, @@unique(token_id) → composite unique)
- P0-2 read_only: uitgesteld naar B7 follow-up — runner schrijft \$HOME
- P1-1 instance-id: resolve in entrypoint.sh, vóór _lib.sh
- P1-2 SQL: task_id/finished_at i.p.v. story_id/completed_at, scoped joins
- P1-3 rollback: rollbackClaim + resetStaleClaimedJobs NULL-clearing
- P1-4 Track-III: runner imports + MCP_GIT_REF pin als deliverable
- P2-1 compose: gating-stap voor target-file source of truth
- P2-2 flock: /var/cache/repos uit core plan (active path = per-container \$HOME)

Pushed to beide ProductDocs als revision 4. Inhoudelijke architectuur
(1-per-container, FOR UPDATE SKIP LOCKED, additieve migrations, ops-agent
flow, scrum4me-srv host) ongewijzigd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maps spec rev 4 to four executable phases:
- Phase 0: pre-flight gating (DB target, compose source, MCP imports)
- Phase 1 (Track-I): scrum4me-docker hardening + ops-agent prep
- Phase 2 (Track-II): scrum4me-workers + scrum4me-mcp schema/code
- Phase 3 (Track-III): scrum4me-docker compose multi-worker
- Phase 4 (Track-IV): scrum4me-workers UI extension

Bite-sized TDD steps with actual code, exact paths, and verification
commands per the superpowers:writing-plans skill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 0 ran and surfaced four findings that change the spec-rev-4
phase mappings:

1. DB target = local scrum4me-postgres container (NOT Neon)
2. Compose source-of-truth = server-canonical /srv/scrum4me/compose/
   docker-compose.yml (NOT git-tracked; manual edits with .bak.*)
3. Worker code lives in scrum4me-mcp (NOT scrum4me-workers as the
   spec assumed); scrum4me-workers is Next.js UI only
4. Schema source-of-truth = scrum4me-shared/prisma/schema.prisma
   (canonical, regenerated by both consumers via
   gen-consumer-schema.sh)

Phase 2 mapping table updated. Phase 2 now requires three PRs:
scrum4me-shared (schema), scrum4me-mcp (code + migration + bump),
scrum4me-workers (bump only — UI work stays in Phase 4).

Phase 1 also clarified: compose edits target /srv/scrum4me/compose/
docker-compose.yml on the server, not the repo-local file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 Task 1.3 — documents the new env var added to
/srv/scrum4me/compose/.env on the server. Default 1 (single-worker
behavior unchanged); bumped to higher values in Phase 3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 Tasks 3.1-3.3:
- entrypoint.sh resolves AGENT_INSTANCE_ID (from WORKER_INSTANCE_ID
  env or hostname), exports per-instance AGENT_STATE_DIR and
  AGENT_LOG_DIR. Replaces the := defaults in _lib.sh which were no-ops
  because the Dockerfile already exports the parents. Also exports
  SCRUM4ME_INSTANCE_ID for scrum4me-mcp's getInstanceId().
- run-one-job.ts resolves instanceId once at start, passes through
  to registerWorker, startHeartbeat, and both tryClaimJob call sites.
- Dockerfile: comment documenting the MCP_GIT_REF pin needed during
  this rollout. Default remains 'main' for the post-merge flow.

Together with Phase 2 (instance-aware scrum4me-mcp at feat
commit 772bbc3), this completes the runtime-side of multi-worker
presence/claim. Container rebuild + redeploy happens in sub-step B.

Spec rev 4 §5 B2 (Codex P1-1). Plan rev 2 Phase 3 Tasks 3.1-3.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous feat commit (b7d2b84) called mkdir as root before gosu-drop.
With Phase 1 hardening (cap_drop ALL + only CHOWN/SETUID/SETGID,
no DAC_OVERRIDE), root inside the container cannot write to bind-
mount parents that are owned by agent (UID 1000). Result: container
in restart loop, no DB writes.

Fix: ensure_writable still chowns the PARENT dirs as root (works via
CAP_CHOWN). The per-instance subdirs are created via 'gosu agent
mkdir -p' so the agent user (which owns the parents) can do it
without needing CAP_DAC_OVERRIDE. Same fix applied to the
runs/jobs log subdirs, which sit inside AGENT_LOG_DIR.

Discovered during Phase 3 sub-B rebuild test on scrum4me-srv.
Rolled back to master without data loss; this commit retries
with the corrected order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under cap_drop ALL (no DAC_OVERRIDE), the root parent shell cannot
open files in AGENT_LOG_DIR for writing — that dir is owned by agent
(UID 1000). The redirects `> health-server.log` and `>> repo-bootstrap.log`
were processed by the root shell BEFORE gosu spawned, so they failed
with Permission denied. Worker still ran (gosu's stdout falls through
to docker logs) but the dedicated log files were never written, and
repo-bootstrap's exit code propagated as non-zero from the redirect
failure rather than from the script itself.

Fix: wrap the gosu command in `sh -c` and do the redirect via `exec >>...`
inside the agent context. Now the file open happens AS agent, which
owns the dir.

Found during Phase 3 sub-B run after the mkdir fix (ae38192).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
janpeter merged commit 216e5c94dc into master 2026-05-27 10:26:21 +02:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
janpeter/scrum4me-docker!7
No description provided.