Skip to content

fix(sandbox): reclaim orphaned prewarm lock via pid-liveness check#441

Open
arimxyer wants to merge 1 commit into
vercel:mainfrom
arimxyer:fix/sandbox-prewarm-lock-pid-liveness
Open

fix(sandbox): reclaim orphaned prewarm lock via pid-liveness check#441
arimxyer wants to merge 1 commit into
vercel:mainfrom
arimxyer:fix/sandbox-prewarm-lock-pid-liveness

Conversation

@arimxyer

@arimxyer arimxyer commented Jun 30, 2026

Copy link
Copy Markdown

The bug

The cross-process sandbox-template prewarm lock (withSandboxTemplatePrewarmLock / waitForSandboxTemplatePrewarmLock in src/execution/sandbox/template-prewarm-lock.ts) is a filesystem mkdir lock released only in a finally. If the lock-holding process is killed (SIGINT/SIGKILL, OOM, crash) the finally never runs and the lock directory is orphaned. The next process that needs that template then blocks — emitting waiting for sandbox template prewarm to finish (Ns elapsed) with zero sandbox activity.

Two defects compounded this:

  1. owner.json records the holder pid, but liveness was never checked — a dead owner was indistinguishable from a live one for up to the full stale window.
  2. LOCK_TIMEOUT_MS (15 min) was shorter than STALE_LOCK_MS (30 min) — a fresh orphan (recent mtime) made a waiter throw at 15 min before the 30-min stale-removal could heal it.

The fix

1. Tri-state PID-liveness. When waiting on a held lock, read owner.json and classify the recorded holder as dead / alive / unknown, probing with process.kill(pid, 0) (ESRCH → dead; success or EPERM → alive):

  • dead → reclaim the lock immediately rather than waiting out the stale window.
  • alive → respect the lock and keep polling; the mtime-stale reclaim is not applied (a legitimate prewarm may run longer than the stale window, and reclaiming it would let two prewarms of the same template race). Overall waiting is still bounded by LOCK_TIMEOUT_MS.
  • unknown → fall back to the existing mtime-stale window.

Same-host safety. A pid check is only valid when the holder is on the same host. owner.json did not previously record a hostname, so this PR adds a hostname field at acquire time and only treats liveness as dead/alive when the recorded hostname matches os.hostname(). The prewarm always runs on the same host as the waiter for the local/docker/microsandbox backends, so same-host pid-liveness is sound (stated in a code comment). Anything we can't verify — missing/unreadable owner.json, missing or mismatched hostname, non-positive pid — is unknown and falls back to mtime, so older locks written without a hostname remain fully backward-compatible. (The one residual false-positive — a live hostname-less local holder running past the stale window — only applies to locks predating the upgrade and is transient.)

2. Timeout/stale asymmetry. STALE_LOCK_MS is lowered to 10 min, strictly below LOCK_TIMEOUT_MS (15 min), with a documented invariant. Because a waiter only ever waits on a lock that already exists (so the lock's mtime is at-or-before the waiter's start), STALE_LOCK_MS < LOCK_TIMEOUT_MS guarantees a fresh unknown orphan self-heals via the mtime path before any waiter throws. This window is now correctly scoped to the unknown-liveness case only — same-host dead holders are reclaimed immediately by the pid check, and same-host live holders are never reclaimed by mtime.

The change is surgical and backward-compatible; no unrelated code was reformatted.

Test

Added packages/eve/src/execution/sandbox/template-prewarm-lock.integration.test.ts (vitest). It covers:

  • a fresh lock whose same-host owner pid is dead is reclaimed quickly (asserts the waiter returns in < 5 s and the lock dir is gone — it does not block to the stale window);
  • a same-host owner that is alive is respected even when its mtime is stale (owner = process.pid, mtime backdated 20 min > the 10-min stale window; waiter stays pending) — this guards the live-holder regression where a long-running prewarm's lock would otherwise be yanked;
  • backward-compat: a dead owner recorded without a hostname is not reclaimed via the pid path and falls back to the mtime window (waiter stays pending).

The dead pid is produced by spawning a child, SIGKILL-ing it, and awaiting its exit before using its pid. Hostname uses os.hostname().

Note on test tier: the lock is inherently filesystem-and-process based, so the test cannot run in eve's hermetic Tier-0 unit tier (the unit guard blocks real fs/promises). It is therefore an *.integration.test.ts, matching where the repo already puts real-mkdtemp tests.

Validation

  • Test: pnpm --filter eve exec vitest run --config vitest.integration.config.ts src/execution/sandbox/template-prewarm-lock.integration.test.ts3 passed.
  • Typecheck: pnpm --filter eve run typecheckclean.

Closes #432

🤖 Generated with Claude Code

@vercel

vercel Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@arimxyer is attempting to deploy a commit to the Vercel Team on Vercel.

A member of the Team first needs to authorize it.

Signed-off-by: Ari Mayer <ari111097@gmail.com>
@arimxyer arimxyer force-pushed the fix/sandbox-prewarm-lock-pid-liveness branch from 94142b2 to af25d31 Compare June 30, 2026 18:07
@arimxyer arimxyer marked this pull request as ready for review June 30, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox template prewarm lock orphaned by killed process blocks for up to 30 min (no pid-liveness check)

1 participant