Skip to content

fix(wholesale-feed-sync): harden lifecycle recovery#2102

Merged
bokelley merged 4 commits into
adcontextprotocol:mainfrom
sangilish:fix/wholesale-feed-sync-lifecycle-2094
Jun 2, 2026
Merged

fix(wholesale-feed-sync): harden lifecycle recovery#2102
bokelley merged 4 commits into
adcontextprotocol:mainfrom
sangilish:fix/wholesale-feed-sync-lifecycle-2094

Conversation

@sangilish
Copy link
Copy Markdown
Contributor

Summary

Refs #2094.

This is the first, stability-only slice of the WholesaleFeedSync follow-up stack. It does not add public API surface.

What changed

  • Added a lifecycle epoch guard so stop() invalidates in-flight bootstraps and background loops before they can commit mirror state.
  • Made bootstrap commits atomic for feed indexes and version/cache metadata, so stopped or failed mixed-feed refreshes do not leave partially advanced tokens.
  • Restart capability-refresh upgrades through the internal lifecycle path.
  • Added bounded version-mismatch recovery retries with a clear failure when the affected feed version does not advance.

Testing

Passed:

  • npm run build:lib
  • NODE_ENV=test node --test-timeout=60000 --test-force-exit --test test/lib/wholesale-feed-sync.test.js
  • npm run typecheck
  • npx prettier --check src/lib/wholesale-feed-sync/sync.ts test/lib/wholesale-feed-sync.test.js .changeset/wholesale-feed-sync-lifecycle.md
  • git diff --check
  • NODE_ENV=test node --test-timeout=120000 --test-force-exit --test test/lib/custom-headers.test.js

Additional broader check:

  • npm run test:lib:fast did not complete under its default 60s per-file timeout because test/lib/custom-headers.test.js timed out at 60000ms. The same file passes when run directly with a 120000ms timeout. The WholesaleFeedSync tests passed in the broader run before that timeout.

@sangilish sangilish marked this pull request as ready for review May 29, 2026 12:41
@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented Jun 2, 2026

Pushed one follow-up commit, fix(wholesale-feed-sync): restart after stopped bootstrap.

While reviewing the lifecycle recovery patch, I found one remaining start/stop race: if stop() invalidated an in-flight initial bootstrap, a subsequent start() could still reuse the retired startPromise instead of creating a fresh lifecycle. The new test covers start(); stop(); start() while the first bootstrap is blocked and verifies only the second lifecycle publishes indexes.

Local verification:

  • npm run typecheck
  • npm run build:lib
  • NODE_ENV=test node --test-timeout=60000 --test-force-exit --test test/lib/wholesale-feed-sync.test.js
  • npx prettier --check src/lib/wholesale-feed-sync/sync.ts test/lib/wholesale-feed-sync.test.js .changeset/wholesale-feed-sync-lifecycle.md && git diff --check

@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented Jun 2, 2026

Acknowledged — the start(); stop(); start() race looks like the last gap in the lifecycle epoch guard. The new test pinning the second lifecycle as the only one that publishes indexes is exactly the right coverage shape for this pattern. No further action needed on this thread.


Generated by Claude Code

bokelley
bokelley previously approved these changes Jun 2, 2026
Copy link
Copy Markdown
Contributor

@bokelley bokelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the lifecycle recovery changes and pushed one follow-up regression fix for the stop/start-in-flight bootstrap race. Local targeted validation passed, and all GitHub CI lanes except the still-running Argus code_review job are green.

Copy link
Copy Markdown

@aao-ipr-bot aao-ipr-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. One replay-safety regression to close before this ships — the bounded-retry path introduces a webhook hot-loop that the sibling bulk_change branch already knows how to avoid.

Blocker

recoverFromVersionMismatch throw escapes without markWebhookProcessed. src/lib/wholesale-feed-sync/sync.ts:387-389 (the throw new Error(...) after VERSION_MISMATCH_RECOVERY_ATTEMPTS exhaustion). Both call sites in applyWebhook use the same shape:

if (!(await this.recoverFromVersionMismatch(event, epoch))) return;
await this.markWebhookProcessed(dedupeKey, eventDedupeKey);
return;

When the helper throws, control unwinds past markWebhookProcessed. The dedupe key is never recorded. The seller's webhook delivery system retries → re-enters applyWebhook → same throw. A single seller stuck on a stale wholesale_feed_version hot-loops webhook delivery and spams error events. Pre-PR this couldn't happen (the helper had no logical-condition throw path; only transport errors could escape it).

The fix is one floor up — the wholesale_feed.bulk_change branch in the same function already shows the right shape: wrap the recovery call in try { ... } catch (err) { await this.markWebhookProcessed(...); this.rememberLastWebhookEventId(event.event_id); throw err; }. Apply that envelope to both recoverFromVersionMismatch call sites and the contract is consistent. Alternative: have the helper return a sentinel and let applyWebhook decide whether to mark + throw, also fine.

Things I checked

  • Changeset present, patch is the right shape — pure internal hardening, no exported surface moved.
  • Atomic metadata commit (bootstrapProducts/Signals accumulate into FeedMetadata, commit via commitProductMetadata/commitSignalMetadata only after both feeds succeed) is spec-aligned. ad-tech-protocol-expert verdict: sound-with-caveats — the AdCP spec is silent on intra-bootstrap commit granularity but the buyer-side (cache_scope, wholesale_feed_version) cache-key contract only makes sense if version and mirror move together.
  • Witness-not-translator check: mergeFeedMetadata (sync.ts:1107-1121) only overwrites when the seller emits the right primitive/enum; the cacheScope: 'public' field-init defaults (sync.ts:81, 84) are local cache keys, never echoed to the seller. No fabrication.
  • start() epoch race the diff might suggest doesn't exist: startInner()'s body runs synchronously up to the first await, so this.stop() (epoch++) commits before control returns to start(), and this.startPromiseEpoch = this.lifecycleEpoch captures the post-stop value.
  • Atomic-commit test (stop during a mixed bootstrap…) load-bearing — without the fix, productIndex would hold [p2] after stop and if_wholesale_feed_version would advance to products-v2. The test catches both.

Follow-ups (non-blocking — file as issues)

  • Backoff is performative. VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5 × linear attempt index = 15 ms total across all retries. Below any realistic seller-side replication window. Either document this as "fail-fast after 3 immediate retries" or raise to a real propagation budget (500/1000 ms) and consider jitter. The changeset prose says "bounded version-mismatch repair retries" — at 15 ms it's effectively three back-to-back conditional reads.
  • bootstrap() catch swallows on stale epoch. if (!this.isLifecycleCurrent(epoch)) return false; before setState('error')/throw means a refresh() cancelled mid-fetch resolves with no signal. Adopters who await sync.refresh() and expect a throw on cancel see a silent no-op. Either note this in the changeset or surface a typed RefreshCancelled discriminator.
  • applyWebhook doesn't re-check epoch before tail markWebhookProcessed. If stop() is called during a recovery's awaits, the dedupe key still records under a stale lifecycle. Not catastrophic — the next bootstrap reconciles anyway — but "stop cancels in-flight work" is partially false for the dedupe side-effect.
  • Test waitFor(predicate) budget. 100 × 5 ms = 500 ms total. Tight for noisy CI. The PR description itself notes test/lib/custom-headers.test.js blowing past 60 s in batch but passing at 120 s standalone — that's existing isolation rot (likely undisposed HTTP keepalive agents) being papered over by --test-force-exit. Worth a separate issue with --trace-warnings.
  • Brittle assertion calls.getProducts.length === 4 in the version-mismatch test. Pin to >= VERSION_MISMATCH_RECOVERY_ATTEMPTS + 1 or export the constant — counting bootstraps couples the test to the implementation, not the invariant.
  • capabilityRefreshLoop bypasses start(). Calls await this.startInner() directly, leaving startPromise/startPromiseEpoch untouched. A concurrent external sync.start() will see startPromise === null and trigger a second startInner(). Wasted work, not incorrect.
  • Missing test: stop() during the recoverFromVersionMismatch sleep. The isLifecycleCurrent(epoch) guard at sync.ts:384 exists but is untested.

Close the blocker and I'll re-review.

Copy link
Copy Markdown

@aao-ipr-bot aao-ipr-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic version-token commit and lifecycle-epoch cancellation are the right shape — the pre-PR per-page advance of productWholesaleFeedVersion was a real soundness bug (advance the opaque token mid-enumeration, get correctly told unchanged: true next probe, never recover). Holding metadata in BootstrapFeedResult.metadata and committing late after both fetches + lifecycle check is the correct preservation of the (cache_scope, wholesale_feed_version) → enumerated set invariant. Witness-not-translator preserved; wire shape unchanged.

The bounded-retry recovery is where I want to push back before approving.

Main question — recoverFromVersionMismatch throws on stale/duplicate webhooks

src/lib/wholesale-feed-sync/sync.ts:377-390 throws when afterVersion === beforeVersion after 3 attempts. That condition is exactly "the seller agrees our cached version is current" — i.e. the SDK is at-or-ahead of webhook.previous_wholesale_feed_version. That is the normal stale/duplicate delivery case, not a fault.

Concretely, the new test at test/lib/wholesale-feed-sync.test.js:611-647 sets up: cache at v5, webhook claims previous: v4, version: v6, seller stays at v5, SDK throws. But that's "SDK polled ahead of the webhook" or "webhook redelivered after a newer one was applied." At-least-once delivery makes this routine.

Worse: applyWebhook doesn't catch the throw — it propagates without calling markWebhookProcessed (compare the adjacent bulk_change block at sync.ts:327-336, which catches, marks processed, then rethrows). So the seller's receiver responds with an error, the seller redelivers, the SDK throws again → poison-message loop until backoff caps drop the delivery. The UUID-v7 dedupe path at sync.ts:278 already routes obvious duplicates here, so this fires on the common case.

Two ways to flip to approve:

  1. Short-circuit when the SDK is ahead or in sync. If webhook.previous_wholesale_feed_version doesn't match currentVersion AND the SDK's cached version was set from a real wholesale fetch or a later webhook, treat it as stale: markWebhookProcessed + return. Spec says previous_wholesale_feed_version is advisory ("Receivers MAY use this to detect obvious gaps, but MUST NOT require it"), so throwing on it is a stricter contract than the spec requires.
  2. Or: mirror the bulk_change catch pattern. Wrap the call site, mark processed on the exhaustion throw, emit an error event, and return. Adopter receivers acknowledge instead of looping.

Things I checked

  • Atomic commit point in bootstrap(): post-fetch lifecycle re-check at the diff's L214, then commitProductMetadata/commitSignalMetadata at L217 / L225. Correct.
  • start() dedup with self-stop() inside startInner(): traced. stop() is sync and runs before the first await inside startInner, so the outer this.startPromiseEpoch = this.lifecycleEpoch reads the post-bump epoch. The "start after stop during an in-flight bootstrap" test at test/lib/wholesale-feed-sync.test.js:577-602 exercises this and passes. Dedup is correct.
  • Single epoch check in applyWebhook after hasProcessedWebhook. Sufficient: subsequent mutations are synchronous (applyEvent, rememberWebhookVersion) and can't interleave with a stop().
  • capabilityRefreshLoop switching from stop(); await start() to startInner() directly — avoids a redundant epoch bump and startPromise round-trip. Sound.
  • Changeset is patch, no public-API surface change (verified src/lib/wholesale-feed-sync/index.ts re-exports unchanged).
  • custom-headers.test.js timeout flagged in the PR body is pre-existing flake — no shared state with wholesale-feed-sync.
  • ad-tech-protocol-expert: "sound-with-caveats — atomic-commit tightens an under-specified gap correctly; bounded-retry throws on what spec treats as advisory."
  • code-reviewer: Major on the poison-message loop (cites sync.ts:303-306, 322-325, 377-390), agrees on the fix shape.

Follow-ups (non-blocking, file as issues)

  • Snapshot-stomp race on commitProductMetadata/commitSignalMetadata. The metadata is snapshotted at the start of bootstrapProducts. If applyWebhook advances productWholesaleFeedVersion mid-bootstrap via rememberWebhookVersion, the late commit writes the older snapshot back. Race existed pre-PR (per-page writes were equally racy), but centralizing the commit makes the window more deterministic. Worth a comment documenting that applyWebhook shouldn't interleave with bootstrap, or a version-compare on commit.
  • Bootstrap cancellation only fires between pages. bootstrapProducts/bootstrapSignals check isLifecycleCurrent(epoch) after each client.getProducts/getSignals returns. A multi-page wholesale fetch can fully drain before noticing a stop(). Acceptable; worth a comment so it doesn't look like a missed check.
  • Changeset wording. Mention the new error class adopter receivers may now see: WholesaleFeedSync: version mismatch recovery did not advance ... after 3 attempts. Receivers that catch all applyWebhook rejections today will now surface this where they previously saw success.

Minor nits (non-blocking)

  1. Backoff base of 5ms. VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5 yields 5ms + 10ms before the final attempt. Reads like a test-fixture constant that leaked into production. If the intent is "absorb a tight seller-side write-vs-read race," that's fine; if the intent is "give the seller time to catch up," 5ms is too aggressive.

Holding for the throw-on-stale question. Drop a comment with intent or push the short-circuit and I'll re-review.

@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented Jun 2, 2026

Addressed the Argus blocker in fix(wholesale-feed-sync): acknowledge stale version webhooks.

Change made:

  • recoverFromVersionMismatch now treats exhausted no-advance conditional reads as an acknowledged stale/duplicate delivery instead of throwing, so the existing applyWebhook call-site path marks the webhook processed and avoids seller retry poison loops.
  • Updated the stale mismatch regression to verify bounded repair probes occur once and the repeated delivery is deduped without another probe.

Local verification:

  • npm run typecheck
  • npm run build:lib
  • NODE_ENV=test node --test-timeout=60000 --test-force-exit --test test/lib/wholesale-feed-sync.test.js
  • pre-push validation passed

@bokelley
Copy link
Copy Markdown
Contributor

bokelley commented Jun 2, 2026

Acknowledged — recoverFromVersionMismatch treating exhausted no-advance reads as acknowledged stale/duplicate delivery is the right call: it closes the retry poison loop without requiring the caller to distinguish a true version advance from a stale replay. The updated regression verifying a single bounded probe + dedup on repeated delivery gives good coverage of the invariant.


Generated by Claude Code

Copy link
Copy Markdown

@aao-ipr-bot aao-ipr-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Build-then-commit is the right shape — bootstrapProducts / bootstrapSignals returning a BootstrapFeedResult<T> and the outer bootstrap deferring commitProductMetadata / commitSignalMetadata until after the final epoch re-check eliminates the partial-advance bug class that motivated this stack.

Things I checked

  • Epoch cancellation completeness. Read src/lib/wholesale-feed-sync/sync.ts end to end and traced every async path. The lifecycleEpoch++ in stop() is captured at the entry of every async path (start, startInner, bootstrap, bootstrap{Products,Signals}, recoverFrom{VersionMismatch,BulkChange}, probeLoop, capabilityRefreshLoop, resolveMode, applyWebhook, refresh) and re-checked at every await resume before any commit. Index/metadata writes only happen on the synchronous run from the last isLifecycleCurrent(epoch) check, so V8's single-threaded execution closes the TOCTOU window.
  • Atomic commit. bootstrap (sync.ts:443-486) holds both productResult and signalResult until after the final isLifecycleCurrent(epoch) check, then commits product metadata + index and signal metadata + index synchronously with no intervening await. A stop fired during bootstrapSignals produces cancelled: true, the outer bootstrap early-returns false, and productResult.metadata is never committed. The pre-PR partial-advance hole (products' version token mutated mid-flight while signals still in flight) is closed.
  • startPromiseEpoch race. Walked the (start → external stop → start) sequence. The strict-equality if (this.startPromise === promise) in the .finally correctly prevents a stale lifecycle from clearing a fresh registration. startPromiseEpoch === lifecycleEpoch at the top of start() correctly forces a fresh startInner after a stop. Sound.
  • setState('error') in bootstrap catch. Guarded by if (!this.isLifecycleCurrent(epoch)) return false BEFORE setState('error') (sync.ts:487-491). Synchronous between check and call — no interleaving possible.
  • stop() transition out of 'bootstrapping'. Pre-PR stop() only transitioned from 'syncing'; left 'bootstrapping' sticky. Post-PR (sync.ts:200-209) handles both. Right fix.
  • Tests. Four lifecycle tests + the new "version mismatch recovery dedupes stale deliveries" test (test/lib/wholesale-feed-sync.test.js). The "stop during mixed bootstrap" test (lines ~280-340) is the strongest witness — phase = 'verify' asserts the next conditional fetch sends if_wholesale_feed_version: 'products-v1', proving the version token was NOT advanced when the refresh was stopped mid-flight. That's the bug.
  • Changeset. .changeset/wholesale-feed-sync-lifecycle.md is patch — correct. Internal-only stability fix, no public API change, no wire-shape change.

Follow-ups (non-blocking — file as issues)

  • Silent ack on version-mismatch exhaustion (sync.ts:583-600). Both reviewers flagged this independently. Returning true after 3 conditional-fetch attempts that don't advance the version token is the right default for spec-compliant sellers replaying a stale webhook — but it's also reachable when the seller's read replica genuinely lags behind its webhook bus, in which case the SDK silently drifts. Emit a typed version_mismatch_unresolved event (or fire errorHandler with a structured error) before the final return true so adopters can wire alerting without changing the ack behavior. The comment already names the assumption — the missing piece is making the divergence observable.
  • VERSION_MISMATCH_RECOVERY_BACKOFF_MS = 5 (sync.ts:35-36). 5ms + 10ms = 15ms total budget against three back-to-back conditional fetches. That's in-process-stub territory. Real seller-side replication lag through a CDN or regional read replica is 100ms-1s. Make both VERSION_MISMATCH_RECOVERY_ATTEMPTS and the backoff configurable via WholesaleFeedSyncConfig (mirroring probeIntervalMs), and pick production-realistic defaults (~100ms base with jitter). As-is the exhaustion path above will fire against any seller that isn't an in-memory stub.
  • capabilityRefreshLoop bypasses the startPromise mutex (sync.ts:594-611). The mode-change branch calls await this.startInner() directly, which means a user-initiated sync.start() arriving mid-mode-flip will see startPromise === null and fire a second concurrent startInner. Epoch checks prevent torn state — the older startInner bails at its first isLifecycleCurrent after the inner stop() — but two resolveMode/bootstrap round-trips fire concurrently for a short window. Wasteful, and observable as duplicated mode_resolved / bootstrap emissions. Either factor out a helper that sets startPromise/startPromiseEpoch the same way start() does, or have the loop call through start().
  • applyWebhook post-recovery metadata writes are unprotected (sync.ts:316-325). After await this.markWebhookProcessed(...) on the fast path, this._lastEventAt = new Date() / this._lastSyncedAt = new Date() / rememberLastWebhookEventId run with no lifecycle re-check. If a stop fires during the markWebhookProcessed await, these write under a stale epoch. Not catastrophic — no index mutation — but the only remaining "writes under stale epoch" surface on the fast path. Cheap to fix: add an isLifecycleCurrent(epoch) re-check before the metadata writes.
  • refresh() racing startInner's bootstrap (sync.ts:235-238 + sync.ts:443). Adopter calls start() (in-flight, doing bootstrap), then synchronously calls refresh() from an event handler before start() resolves. Both bootstraps carry the same epoch and both pass internal checks. The later commit wins, but setState flips and _lastSyncedAt updates non-monotonically. Mitigation: serialize via a separate bootstrapPromise mutex, or document the constraint in the JSDoc on refresh() and start().
  • processedWebhookKeys persistence across reset() (sync.ts:215-228, pre-existing). reset() doesn't clear processedWebhookKeys / processedWebhookEventKeys / lastWebhookEventId. With the new exhaustion-acks-the-replay path, a stale webhook seen in lifecycle 1 will be deduped in lifecycle 2 even after a reset(). Likely intentional (sticky dedup is a feature) — add a comment confirming the design, or clear them in reset() if it isn't.
  • Test name overreach (test/lib/wholesale-feed-sync.test.js:'capability refresh restarts through the internal lifecycle on mode changes'). The test asserts the happy-path post-condition (mode === 'auto-poll', getProducts >= 2) but doesn't witness the re-entrancy hazard the change is preventing. Either tighten it (interleave a refresh() call mid-flip and assert no torn state) or rename to match what it actually proves.

Minor nits (non-blocking)

  1. Empty-body response wipes the index (sync.ts:506, pre-existing). if (!body) return { cancelled: false, unchanged: false, items: into (empty), metadata }; — the outer bootstrap then commits the empty items to productIndex because unchanged is false. Pre-PR did the same — not a regression — but worth a separate ticket. The right shape is probably unchanged: true (or a distinct noBody sentinel) so the previous index survives a malformed seller response.
  2. bootstrap catch swallows the original error on stale epoch (sync.ts:487-491). When isLifecycleCurrent(epoch) is false, the catch returns false without calling this.errorHandler?.(error) or emitting. The error is lost. Probably correct (the lifecycle is dead, nobody is listening) but worth either a comment confirming the intent or a still-fire-the-handler tweak.
  3. Wasted round-trip between bootstrapProducts and bootstrapSignals (sync.ts:454-462). If stop() fires between the two awaits, bootstrapSignals issues a get_signals before its first isLifecycleCurrent check rejects the result. Cheap mitigation: add if (!this.isLifecycleCurrent(epoch)) return false; between the two blocks, or check at the top of each bootstrap{Products,Signals} before the first request.

Approving on the strength of the atomic-commit refactor plus the epoch-cancellation discipline. None of the follow-ups block this slice — they're either configurability (backoff), observability (silent ack), or post-merge tightening (mutex on bootstrap, lifecycle re-check before fast-path metadata writes). Ship it and file the silent-ack issue first — that's the one most likely to bite a real adopter.

@bokelley bokelley merged commit 6453cee into adcontextprotocol:main Jun 2, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants