fix(immich): scale people/duplicate sync (txn timeout + bind-param limits) #44

Merged
janpeter merged 1 commit from fix/immich-sync-scale into main 2026-06-06 02:49:30 +02:00
Owner

Symptom: the Immich sync buttons return the generic "Immich synchronisatie mislukt" (config + CSRF are fine).

Two scale bugs:

  1. The whole job — getPeople + tens-of-thousands of per-person getPersonStatistics HTTP calls + every upsert — ran inside one Prisma interactive transaction (5s default timeout) via runWithImmichSyncLock. At this library's size (~28k people fetched, 16,100 synced; 53,278 duplicate groups) it died with "A start cannot be executed on an expired transaction (5000 ms, 14858 ms passed)".
  2. Once that was fixed, deleting stale rows via deleteMany({ <id>: { notIn: [...] } }) blew the database bind-parameter limit at 53k+ ids.

Fix:

  • Split each sync into collect* (Immich HTTP, outside any transaction, with concurrency-limited person-statistics) and persist* (DB writes only). The advisory-lock transaction now wraps only the writes; its timeout is raised to match the write volume. HTTP latency no longer counts against it.
  • Replace the top-level notIn stale deletes with synced_at < runTimestamp (every upserted row is stamped with the run timestamp), which scales without a huge parameter list. Per-group asset cleanup stays scoped.

Verified against live Immich: people 16,100 synced (~19s); duplicates 53,278 groups / 125,388 assets (~4m23s); npm run build, lint, and the rewritten Immich unit tests all green; full suite unchanged (only the pre-existing DB-integration tests fail, no Immich fails).

Follow-up worth considering (not in this PR): the duplicates write still takes ~4m synchronously behind the button — fine functionally, but a background/job-queue trigger would be better UX at this scale.

**Symptom:** the Immich sync buttons return the generic *"Immich synchronisatie mislukt"* (config + CSRF are fine). **Two scale bugs:** 1. The whole job — `getPeople` + tens-of-thousands of per-person `getPersonStatistics` HTTP calls + every upsert — ran inside **one Prisma interactive transaction** (5s default timeout) via `runWithImmichSyncLock`. At this library's size (~28k people fetched, 16,100 synced; 53,278 duplicate groups) it died with *"A start cannot be executed on an expired transaction (5000 ms, 14858 ms passed)"*. 2. Once that was fixed, deleting stale rows via `deleteMany({ <id>: { notIn: [...] } })` blew the **database bind-parameter limit** at 53k+ ids. **Fix:** - Split each sync into `collect*` (Immich HTTP, **outside** any transaction, with concurrency-limited person-statistics) and `persist*` (DB writes only). The advisory-lock transaction now wraps **only the writes**; its timeout is raised to match the write volume. HTTP latency no longer counts against it. - Replace the top-level `notIn` stale deletes with **`synced_at < runTimestamp`** (every upserted row is stamped with the run timestamp), which scales without a huge parameter list. Per-group asset cleanup stays scoped. **Verified against live Immich:** people **16,100** synced (~19s); duplicates **53,278 groups / 125,388 assets** (~4m23s); `npm run build`, lint, and the rewritten Immich unit tests all green; full suite unchanged (only the pre-existing DB-integration tests fail, no Immich fails). **Follow-up worth considering (not in this PR):** the duplicates write still takes ~4m synchronously behind the button — fine functionally, but a background/job-queue trigger would be better UX at this scale.
fix(immich): scale people/duplicate sync beyond txn timeout + param limits
Some checks failed
CI / docker-build (pull_request) Failing after 3s
d9c8cad793
The sync ran the whole job — getPeople + ~tens-of-thousands of per-person
getPersonStatistics HTTP calls + every upsert — inside ONE Prisma interactive
transaction (5s default), so at scale it died with 'A start cannot be executed
on an expired transaction'. A second limit then surfaced: deleting stale rows
via deleteMany({ <id>: { notIn: [...] } }) exceeds the DB bind-parameter limit
when Immich has tens of thousands of people / duplicate groups.

- Split each sync into collect* (Immich HTTP, OUTSIDE any transaction,
  concurrency-limited person statistics) and persist* (DB writes only). The
  advisory-lock transaction now wraps only the writes, with a raised timeout.
- Replace top-level notIn stale-row deletes with 'synced_at < runTimestamp'
  (every upserted row is stamped with the run's syncedAt) — scales without a
  huge parameter list. Per-group asset cleanup stays scoped (small).

Verified against live Immich: people 16,100 (~19s); duplicates 53,278 groups /
125,388 assets (~4m23s); build + lint + immich unit tests green.
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
janpeter/Media-Organizer!44
No description provided.