fix(rollback): clean worktree + sprint_task_executions on rollbackClaim #19
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "fix/rollback-cleans-worktree-and-executions"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Probleem
rollbackClaiminsrc/tools/wait-for-job.tsresetten alleen declaude_jobsrow naar QUEUED. Niets cleant:/home/agent/.scrum4me-agent-worktrees/<jobId>/)sprint_task_executionsrows diegetFullJobContextnet had aangemaakt voor SPRINT-jobslease_untilopclaude_jobsResultaat: één transient failure = permanent stuck job. Elke retry hit
Worktree path already existsofUnique constraint failed on (sprint_job_id, task_id).Live observed (2026-05-27 13:05 CEST)
SPRINT_IMPLEMENTATION job
cmpnyil26000pvi7rjjncv5sz:API Error: Overloaded(Anthropic 529, infra-capacity)rollback claim ... reason=claude_exit_1_without_status_updateWorktree path already existsIs niet eenmalig — élke transient claude-fout (network blip, OOM, rate-limit) leidt tot dezelfde cascade. Op single-worker pre-multi-worker viel het minder op (geen 2e replica om de mess te zien); op multi-worker is dit production-blokkerend.
Fix
rollbackClaimdoet nu, in volgorde:lease_until = NULL(was stuk in pre-fix)prisma.sprintTaskExecution.deleteMany({ where: { sprint_job_id }})— verse retry kan dancreateManyzonder unique-violationresolveRepoRoot(product_id)+ bestaanderemoveWorktreeForJob({repoRoot, jobId})— die helper doetgit worktree remove --force(cleant bare-repo registration in/var/cache/repos/<repo>.git/worktrees/) +git branch -Dvoor de niet-gepushte branchAlle cleanup-stappen zijn best-effort (try/catch, log via
claimLog, nooit throw). Een gefaalde cleanup mag de rollback zelf niet blokkeren —claude_jobs.status = QUEUEDheeft prioriteit zodat de queue niet vastloopt.Testplan na merge
update_mcp_workerflow op scrum4me-srv (cache-busted rebuild)docker exec <worker> pkill -9 -f claudetijdens executionclaude_jobs.status='QUEUED',lease_until=NULL, 0 rows insprint_task_executions WHERE sprint_job_id=<id>docker exec <worker> ls /home/agent/.scrum4me-agent-worktrees/Niet in scope
run-agent.sh— separate PR opscrum4me-docker(voorkomt UNHEALTHY-cascade bij capaciteitspiek-momenten, zelfs met deze fix)Rollback
Revert de commit +
update_mcp_workerflow opnieuw → rollbackClaim valt terug op pre-fix gedrag (UPDATE-only). Geen schema-changes, geen migraties, geen data-impact.Pre-fix `rollbackClaim` only reset the claude_jobs row to QUEUED. The worktree at `/home/agent/.scrum4me-agent-worktrees/<jobId>` and (for SPRINT_IMPLEMENTATION) the just-created `sprint_task_executions` rows were left in place. Any retry then hit: - `Error: Worktree path already exists. Call removeWorktreeForJob first.` - `Unique constraint failed on (sprint_job_id, task_id)` from createMany = permanent stuck job after ANY transient failure. Live observed 2026-05-27 ~13:05 CEST: SPRINT job claimed, claude returned `API Error: Overloaded` (Anthropic 529 capacity), exit 1, rollback fired, then 5 retry attempts all failed on stale-state, both worker replicas marked UNHEALTHY → manual cleanup required. Fix: - Look up job's kind + product_id BEFORE the UPDATE - After UPDATE (now also clears lease_until): - SPRINT_IMPLEMENTATION → deleteMany sprint_task_executions - Any kind → resolveRepoRoot + removeWorktreeForJob (already exists, does `git worktree remove --force` so bare-repo registration is cleaned too, plus branch -D for unpushed branches) - Best-effort: cleanup errors logged via claimLog but never block the rollback itself