fix(sandbox): reclaim orphaned prewarm lock via pid-liveness check#441
Open
arimxyer wants to merge 1 commit into
Open
fix(sandbox): reclaim orphaned prewarm lock via pid-liveness check#441arimxyer wants to merge 1 commit into
arimxyer wants to merge 1 commit into
Conversation
Contributor
|
@arimxyer is attempting to deploy a commit to the Vercel Team on Vercel. A member of the Team first needs to authorize it. |
Signed-off-by: Ari Mayer <ari111097@gmail.com>
94142b2 to
af25d31
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
The cross-process sandbox-template prewarm lock (
withSandboxTemplatePrewarmLock/waitForSandboxTemplatePrewarmLockinsrc/execution/sandbox/template-prewarm-lock.ts) is a filesystemmkdirlock released only in afinally. If the lock-holding process is killed (SIGINT/SIGKILL, OOM, crash) thefinallynever runs and the lock directory is orphaned. The next process that needs that template then blocks — emittingwaiting for sandbox template prewarm to finish (Ns elapsed)with zero sandbox activity.Two defects compounded this:
owner.jsonrecords the holderpid, but liveness was never checked — a dead owner was indistinguishable from a live one for up to the full stale window.LOCK_TIMEOUT_MS(15 min) was shorter thanSTALE_LOCK_MS(30 min) — a fresh orphan (recent mtime) made a waiter throw at 15 min before the 30-min stale-removal could heal it.The fix
1. Tri-state PID-liveness. When waiting on a held lock, read
owner.jsonand classify the recorded holder asdead/alive/unknown, probing withprocess.kill(pid, 0)(ESRCH→ dead; success orEPERM→ alive):dead→ reclaim the lock immediately rather than waiting out the stale window.alive→ respect the lock and keep polling; the mtime-stale reclaim is not applied (a legitimate prewarm may run longer than the stale window, and reclaiming it would let two prewarms of the same template race). Overall waiting is still bounded byLOCK_TIMEOUT_MS.unknown→ fall back to the existing mtime-stale window.Same-host safety. A pid check is only valid when the holder is on the same host.
owner.jsondid not previously record a hostname, so this PR adds ahostnamefield at acquire time and only treats liveness asdead/alivewhen the recorded hostname matchesos.hostname(). The prewarm always runs on the same host as the waiter for the local/docker/microsandbox backends, so same-host pid-liveness is sound (stated in a code comment). Anything we can't verify — missing/unreadableowner.json, missing or mismatched hostname, non-positive pid — isunknownand falls back to mtime, so older locks written without a hostname remain fully backward-compatible. (The one residual false-positive — a live hostname-less local holder running past the stale window — only applies to locks predating the upgrade and is transient.)2. Timeout/stale asymmetry.
STALE_LOCK_MSis lowered to 10 min, strictly belowLOCK_TIMEOUT_MS(15 min), with a documented invariant. Because a waiter only ever waits on a lock that already exists (so the lock's mtime is at-or-before the waiter's start),STALE_LOCK_MS < LOCK_TIMEOUT_MSguarantees a freshunknownorphan self-heals via the mtime path before any waiter throws. This window is now correctly scoped to theunknown-liveness case only — same-host dead holders are reclaimed immediately by the pid check, and same-host live holders are never reclaimed by mtime.The change is surgical and backward-compatible; no unrelated code was reformatted.
Test
Added
packages/eve/src/execution/sandbox/template-prewarm-lock.integration.test.ts(vitest). It covers:process.pid, mtime backdated 20 min > the 10-min stale window; waiter stays pending) — this guards the live-holder regression where a long-running prewarm's lock would otherwise be yanked;The dead pid is produced by spawning a child,
SIGKILL-ing it, and awaiting itsexitbefore using its pid. Hostname usesos.hostname().Note on test tier: the lock is inherently filesystem-and-process based, so the test cannot run in eve's hermetic Tier-0 unit tier (the unit guard blocks real
fs/promises). It is therefore an*.integration.test.ts, matching where the repo already puts real-mkdtemptests.Validation
pnpm --filter eve exec vitest run --config vitest.integration.config.ts src/execution/sandbox/template-prewarm-lock.integration.test.ts→ 3 passed.pnpm --filter eve run typecheck→ clean.Closes #432
🤖 Generated with Claude Code