fix(wholesale-feed-sync): harden lifecycle recovery#2102
Conversation
|
Pushed one follow-up commit, While reviewing the lifecycle recovery patch, I found one remaining start/stop race: if Local verification:
|
|
Acknowledged — the Generated by Claude Code |
bokelley
left a comment
There was a problem hiding this comment.
Reviewed the lifecycle recovery changes and pushed one follow-up regression fix for the stop/start-in-flight bootstrap race. Local targeted validation passed, and all GitHub CI lanes except the still-running Argus code_review job are green.
There was a problem hiding this comment.
Almost there. One replay-safety regression to close before this ships — the bounded-retry path introduces a webhook hot-loop that the sibling bulk_change branch already knows how to avoid.
Blocker
recoverFromVersionMismatch throw escapes without markWebhookProcessed. src/lib/wholesale-feed-sync/sync.ts:387-389 (the throw new Error(...) after VERSION_MISMATCH_RECOVERY_ATTEMPTS exhaustion). Both call sites in applyWebhook use the same shape:
if (!(await this.recoverFromVersionMismatch(event, epoch))) return;
await this.markWebhookProcessed(dedupeKey, eventDedupeKey);
return;When the helper throws, control unwinds past markWebhookProcessed. The dedupe key is never recorded. The seller's webhook delivery system retries → re-enters applyWebhook → same throw. A single seller stuck on a stale wholesale_feed_version hot-loops webhook delivery and spams error events. Pre-PR this couldn't happen (the helper had no logical-condition throw path; only transport errors could escape it).
The fix is one floor up — the wholesale_feed.bulk_change branch in the same function already shows the right shape: wrap the recovery call in try { ... } catch (err) { await this.markWebhookProcessed(...); this.rememberLastWebhookEventId(event.event_id); throw err; }. Apply that envelope to both recoverFromVersionMismatch call sites and the contract is consistent. Alternative: have the helper return a sentinel and let applyWebhook decide whether to mark + throw, also fine.
Things I checked
- Changeset present,
patchis the right shape — pure internal hardening, no exported surface moved. - Atomic metadata commit (
bootstrapProducts/Signalsaccumulate intoFeedMetadata, commit viacommitProductMetadata/commitSignalMetadataonly after both feeds succeed) is spec-aligned.ad-tech-protocol-expertverdict:sound-with-caveats— the AdCP spec is silent on intra-bootstrap commit granularity but the buyer-side(cache_scope, wholesale_feed_version)cache-key contract only makes sense if version and mirror move together. - Witness-not-translator check:
mergeFeedMetadata(sync.ts:1107-1121) only overwrites when the seller emits the right primitive/enum; thecacheScope: 'public'field-init defaults (sync.ts:81, 84) are local cache keys, never echoed to the seller. No fabrication. start()epoch race the diff might suggest doesn't exist:startInner()'s body runs synchronously up to the firstawait, sothis.stop()(epoch++) commits before control returns tostart(), andthis.startPromiseEpoch = this.lifecycleEpochcaptures the post-stop value.- Atomic-commit test (
stop during a mixed bootstrap…) load-bearing — without the fix,productIndexwould hold[p2]after stop andif_wholesale_feed_versionwould advance toproducts-v2. The test catches both.
Follow-ups (non-blocking — file as issues)
- Backoff is performative.
VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5× linear attempt index = 15 ms total across all retries. Below any realistic seller-side replication window. Either document this as "fail-fast after 3 immediate retries" or raise to a real propagation budget (500/1000 ms) and consider jitter. The changeset prose says "bounded version-mismatch repair retries" — at 15 ms it's effectively three back-to-back conditional reads. bootstrap()catch swallows on stale epoch.if (!this.isLifecycleCurrent(epoch)) return false;beforesetState('error')/throwmeans arefresh()cancelled mid-fetch resolves with no signal. Adopters whoawait sync.refresh()and expect a throw on cancel see a silent no-op. Either note this in the changeset or surface a typedRefreshCancelleddiscriminator.applyWebhookdoesn't re-check epoch before tailmarkWebhookProcessed. Ifstop()is called during a recovery's awaits, the dedupe key still records under a stale lifecycle. Not catastrophic — the next bootstrap reconciles anyway — but "stop cancels in-flight work" is partially false for the dedupe side-effect.- Test
waitFor(predicate)budget. 100 × 5 ms = 500 ms total. Tight for noisy CI. The PR description itself notestest/lib/custom-headers.test.jsblowing past 60 s in batch but passing at 120 s standalone — that's existing isolation rot (likely undisposed HTTP keepalive agents) being papered over by--test-force-exit. Worth a separate issue with--trace-warnings. - Brittle assertion
calls.getProducts.length === 4in the version-mismatch test. Pin to>= VERSION_MISMATCH_RECOVERY_ATTEMPTS + 1or export the constant — counting bootstraps couples the test to the implementation, not the invariant. capabilityRefreshLoopbypassesstart(). Callsawait this.startInner()directly, leavingstartPromise/startPromiseEpochuntouched. A concurrent externalsync.start()will seestartPromise === nulland trigger a secondstartInner(). Wasted work, not incorrect.- Missing test:
stop()during therecoverFromVersionMismatchsleep. TheisLifecycleCurrent(epoch)guard at sync.ts:384 exists but is untested.
Close the blocker and I'll re-review.
…nc-lifecycle-2094
There was a problem hiding this comment.
Atomic version-token commit and lifecycle-epoch cancellation are the right shape — the pre-PR per-page advance of productWholesaleFeedVersion was a real soundness bug (advance the opaque token mid-enumeration, get correctly told unchanged: true next probe, never recover). Holding metadata in BootstrapFeedResult.metadata and committing late after both fetches + lifecycle check is the correct preservation of the (cache_scope, wholesale_feed_version) → enumerated set invariant. Witness-not-translator preserved; wire shape unchanged.
The bounded-retry recovery is where I want to push back before approving.
Main question — recoverFromVersionMismatch throws on stale/duplicate webhooks
src/lib/wholesale-feed-sync/sync.ts:377-390 throws when afterVersion === beforeVersion after 3 attempts. That condition is exactly "the seller agrees our cached version is current" — i.e. the SDK is at-or-ahead of webhook.previous_wholesale_feed_version. That is the normal stale/duplicate delivery case, not a fault.
Concretely, the new test at test/lib/wholesale-feed-sync.test.js:611-647 sets up: cache at v5, webhook claims previous: v4, version: v6, seller stays at v5, SDK throws. But that's "SDK polled ahead of the webhook" or "webhook redelivered after a newer one was applied." At-least-once delivery makes this routine.
Worse: applyWebhook doesn't catch the throw — it propagates without calling markWebhookProcessed (compare the adjacent bulk_change block at sync.ts:327-336, which catches, marks processed, then rethrows). So the seller's receiver responds with an error, the seller redelivers, the SDK throws again → poison-message loop until backoff caps drop the delivery. The UUID-v7 dedupe path at sync.ts:278 already routes obvious duplicates here, so this fires on the common case.
Two ways to flip to approve:
- Short-circuit when the SDK is ahead or in sync. If
webhook.previous_wholesale_feed_versiondoesn't matchcurrentVersionAND the SDK's cached version was set from a real wholesale fetch or a later webhook, treat it as stale:markWebhookProcessed+ return. Spec saysprevious_wholesale_feed_versionis advisory ("Receivers MAY use this to detect obvious gaps, but MUST NOT require it"), so throwing on it is a stricter contract than the spec requires. - Or: mirror the
bulk_changecatch pattern. Wrap the call site, mark processed on the exhaustion throw, emit anerrorevent, and return. Adopter receivers acknowledge instead of looping.
Things I checked
- Atomic commit point in
bootstrap(): post-fetch lifecycle re-check at the diff's L214, thencommitProductMetadata/commitSignalMetadataat L217 / L225. Correct. start()dedup with self-stop()insidestartInner(): traced.stop()is sync and runs before the firstawaitinsidestartInner, so the outerthis.startPromiseEpoch = this.lifecycleEpochreads the post-bump epoch. The "start after stop during an in-flight bootstrap" test attest/lib/wholesale-feed-sync.test.js:577-602exercises this and passes. Dedup is correct.- Single epoch check in
applyWebhookafterhasProcessedWebhook. Sufficient: subsequent mutations are synchronous (applyEvent,rememberWebhookVersion) and can't interleave with astop(). capabilityRefreshLoopswitching fromstop(); await start()tostartInner()directly — avoids a redundant epoch bump andstartPromiseround-trip. Sound.- Changeset is
patch, no public-API surface change (verifiedsrc/lib/wholesale-feed-sync/index.tsre-exports unchanged). custom-headers.test.jstimeout flagged in the PR body is pre-existing flake — no shared state withwholesale-feed-sync.ad-tech-protocol-expert: "sound-with-caveats — atomic-commit tightens an under-specified gap correctly; bounded-retry throws on what spec treats as advisory."code-reviewer: Major on the poison-message loop (citessync.ts:303-306, 322-325, 377-390), agrees on the fix shape.
Follow-ups (non-blocking, file as issues)
- Snapshot-stomp race on
commitProductMetadata/commitSignalMetadata. The metadata is snapshotted at the start ofbootstrapProducts. IfapplyWebhookadvancesproductWholesaleFeedVersionmid-bootstrap viarememberWebhookVersion, the late commit writes the older snapshot back. Race existed pre-PR (per-page writes were equally racy), but centralizing the commit makes the window more deterministic. Worth a comment documenting thatapplyWebhookshouldn't interleave withbootstrap, or a version-compare on commit. - Bootstrap cancellation only fires between pages.
bootstrapProducts/bootstrapSignalscheckisLifecycleCurrent(epoch)after eachclient.getProducts/getSignalsreturns. A multi-page wholesale fetch can fully drain before noticing astop(). Acceptable; worth a comment so it doesn't look like a missed check. - Changeset wording. Mention the new error class adopter receivers may now see:
WholesaleFeedSync: version mismatch recovery did not advance ... after 3 attempts. Receivers that catch allapplyWebhookrejections today will now surface this where they previously saw success.
Minor nits (non-blocking)
- Backoff base of 5ms.
VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5yields 5ms + 10ms before the final attempt. Reads like a test-fixture constant that leaked into production. If the intent is "absorb a tight seller-side write-vs-read race," that's fine; if the intent is "give the seller time to catch up," 5ms is too aggressive.
Holding for the throw-on-stale question. Drop a comment with intent or push the short-circuit and I'll re-review.
|
Addressed the Argus blocker in Change made:
Local verification:
|
|
Acknowledged — Generated by Claude Code |
There was a problem hiding this comment.
LGTM. Build-then-commit is the right shape — bootstrapProducts / bootstrapSignals returning a BootstrapFeedResult<T> and the outer bootstrap deferring commitProductMetadata / commitSignalMetadata until after the final epoch re-check eliminates the partial-advance bug class that motivated this stack.
Things I checked
- Epoch cancellation completeness. Read
src/lib/wholesale-feed-sync/sync.tsend to end and traced every async path. ThelifecycleEpoch++instop()is captured at the entry of every async path (start,startInner,bootstrap,bootstrap{Products,Signals},recoverFrom{VersionMismatch,BulkChange},probeLoop,capabilityRefreshLoop,resolveMode,applyWebhook,refresh) and re-checked at every await resume before any commit. Index/metadata writes only happen on the synchronous run from the lastisLifecycleCurrent(epoch)check, so V8's single-threaded execution closes the TOCTOU window. - Atomic commit.
bootstrap(sync.ts:443-486) holds bothproductResultandsignalResultuntil after the finalisLifecycleCurrent(epoch)check, then commits product metadata + index and signal metadata + index synchronously with no intervening await. A stop fired duringbootstrapSignalsproducescancelled: true, the outerbootstrapearly-returns false, andproductResult.metadatais never committed. The pre-PR partial-advance hole (products' version token mutated mid-flight while signals still in flight) is closed. startPromiseEpochrace. Walked the (start → external stop → start) sequence. The strict-equalityif (this.startPromise === promise)in the.finallycorrectly prevents a stale lifecycle from clearing a fresh registration.startPromiseEpoch === lifecycleEpochat the top ofstart()correctly forces a freshstartInnerafter a stop. Sound.setState('error')inbootstrapcatch. Guarded byif (!this.isLifecycleCurrent(epoch)) return falseBEFOREsetState('error')(sync.ts:487-491). Synchronous between check and call — no interleaving possible.stop()transition out of'bootstrapping'. Pre-PRstop()only transitioned from'syncing'; left'bootstrapping'sticky. Post-PR (sync.ts:200-209) handles both. Right fix.- Tests. Four lifecycle tests + the new "version mismatch recovery dedupes stale deliveries" test (
test/lib/wholesale-feed-sync.test.js). The "stop during mixed bootstrap" test (lines ~280-340) is the strongest witness —phase = 'verify'asserts the next conditional fetch sendsif_wholesale_feed_version: 'products-v1', proving the version token was NOT advanced when the refresh was stopped mid-flight. That's the bug. - Changeset.
.changeset/wholesale-feed-sync-lifecycle.mdispatch— correct. Internal-only stability fix, no public API change, no wire-shape change.
Follow-ups (non-blocking — file as issues)
- Silent ack on version-mismatch exhaustion (sync.ts:583-600). Both reviewers flagged this independently. Returning
trueafter 3 conditional-fetch attempts that don't advance the version token is the right default for spec-compliant sellers replaying a stale webhook — but it's also reachable when the seller's read replica genuinely lags behind its webhook bus, in which case the SDK silently drifts. Emit a typedversion_mismatch_unresolvedevent (or fireerrorHandlerwith a structured error) before the finalreturn trueso adopters can wire alerting without changing the ack behavior. The comment already names the assumption — the missing piece is making the divergence observable. VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5(sync.ts:35-36). 5ms + 10ms = 15ms total budget against three back-to-back conditional fetches. That's in-process-stub territory. Real seller-side replication lag through a CDN or regional read replica is 100ms-1s. Make bothVERSION_MISMATCH_RECOVERY_ATTEMPTSand the backoff configurable viaWholesaleFeedSyncConfig(mirroringprobeIntervalMs), and pick production-realistic defaults (~100ms base with jitter). As-is the exhaustion path above will fire against any seller that isn't an in-memory stub.capabilityRefreshLoopbypasses thestartPromisemutex (sync.ts:594-611). The mode-change branch callsawait this.startInner()directly, which means a user-initiatedsync.start()arriving mid-mode-flip will seestartPromise === nulland fire a second concurrentstartInner. Epoch checks prevent torn state — the olderstartInnerbails at its firstisLifecycleCurrentafter the innerstop()— but tworesolveMode/bootstrapround-trips fire concurrently for a short window. Wasteful, and observable as duplicatedmode_resolved/bootstrapemissions. Either factor out a helper that setsstartPromise/startPromiseEpochthe same waystart()does, or have the loop call throughstart().applyWebhookpost-recovery metadata writes are unprotected (sync.ts:316-325). Afterawait this.markWebhookProcessed(...)on the fast path,this._lastEventAt = new Date()/this._lastSyncedAt = new Date()/rememberLastWebhookEventIdrun with no lifecycle re-check. If a stop fires during themarkWebhookProcessedawait, these write under a stale epoch. Not catastrophic — no index mutation — but the only remaining "writes under stale epoch" surface on the fast path. Cheap to fix: add anisLifecycleCurrent(epoch)re-check before the metadata writes.refresh()racingstartInner'sbootstrap(sync.ts:235-238 + sync.ts:443). Adopter callsstart()(in-flight, doingbootstrap), then synchronously callsrefresh()from an event handler beforestart()resolves. Both bootstraps carry the same epoch and both pass internal checks. The later commit wins, butsetStateflips and_lastSyncedAtupdates non-monotonically. Mitigation: serialize via a separatebootstrapPromisemutex, or document the constraint in the JSDoc onrefresh()andstart().processedWebhookKeyspersistence acrossreset()(sync.ts:215-228, pre-existing).reset()doesn't clearprocessedWebhookKeys/processedWebhookEventKeys/lastWebhookEventId. With the new exhaustion-acks-the-replay path, a stale webhook seen in lifecycle 1 will be deduped in lifecycle 2 even after areset(). Likely intentional (sticky dedup is a feature) — add a comment confirming the design, or clear them inreset()if it isn't.- Test name overreach (
test/lib/wholesale-feed-sync.test.js:'capability refresh restarts through the internal lifecycle on mode changes'). The test asserts the happy-path post-condition (mode === 'auto-poll',getProducts >= 2) but doesn't witness the re-entrancy hazard the change is preventing. Either tighten it (interleave arefresh()call mid-flip and assert no torn state) or rename to match what it actually proves.
Minor nits (non-blocking)
- Empty-body response wipes the index (sync.ts:506, pre-existing).
if (!body) return { cancelled: false, unchanged: false, items: into (empty), metadata };— the outerbootstrapthen commits the emptyitemstoproductIndexbecauseunchangedis false. Pre-PR did the same — not a regression — but worth a separate ticket. The right shape is probablyunchanged: true(or a distinctnoBodysentinel) so the previous index survives a malformed seller response. bootstrapcatch swallows the original error on stale epoch (sync.ts:487-491). WhenisLifecycleCurrent(epoch)is false, the catch returns false without callingthis.errorHandler?.(error)or emitting. The error is lost. Probably correct (the lifecycle is dead, nobody is listening) but worth either a comment confirming the intent or a still-fire-the-handler tweak.- Wasted round-trip between
bootstrapProductsandbootstrapSignals(sync.ts:454-462). Ifstop()fires between the two awaits,bootstrapSignalsissues aget_signalsbefore its firstisLifecycleCurrentcheck rejects the result. Cheap mitigation: addif (!this.isLifecycleCurrent(epoch)) return false;between the two blocks, or check at the top of eachbootstrap{Products,Signals}before the first request.
Approving on the strength of the atomic-commit refactor plus the epoch-cancellation discipline. None of the follow-ups block this slice — they're either configurability (backoff), observability (silent ack), or post-merge tightening (mutex on bootstrap, lifecycle re-check before fast-path metadata writes). Ship it and file the silent-ack issue first — that's the one most likely to bite a real adopter.
Summary
Refs #2094.
This is the first, stability-only slice of the WholesaleFeedSync follow-up stack. It does not add public API surface.
What changed
stop()invalidates in-flight bootstraps and background loops before they can commit mirror state.Testing
Passed:
npm run build:libNODE_ENV=test node --test-timeout=60000 --test-force-exit --test test/lib/wholesale-feed-sync.test.jsnpm run typechecknpx prettier --check src/lib/wholesale-feed-sync/sync.ts test/lib/wholesale-feed-sync.test.js .changeset/wholesale-feed-sync-lifecycle.mdgit diff --checkNODE_ENV=test node --test-timeout=120000 --test-force-exit --test test/lib/custom-headers.test.jsAdditional broader check:
npm run test:lib:fastdid not complete under its default 60s per-file timeout becausetest/lib/custom-headers.test.jstimed out at 60000ms. The same file passes when run directly with a 120000ms timeout. The WholesaleFeedSync tests passed in the broader run before that timeout.