perf(scan): parallel grep via worker pool with binary-transferable matches#807
Merged
perf(scan): parallel grep via worker pool with binary-transferable matches#807
Conversation
…tches `scan.grepFiles` on 10k files drops from 322ms → 191ms p50 (−41%). Follow-up to PR #791 (literal prefilter), #804 (simplify prefilter to file gate), #805 (walker hot-path rewrite). The prior PRs left the walker at 231ms and the overall pipeline at ~1200ms for uncapped scans — the gap was dominated by per-file async orchestration in `mapFilesConcurrent` rather than actual work. Prototypes showed that running the per-file work synchronously inside a small pool of workers (4 on this box) collapses the orchestration cost without changing the public API. ## Architecture ### `src/lib/scan/worker-pool.ts` (~300 LOC) Lazy-initialized singleton pool (size `min(8, max(2, availableParallelism()))`). Feature-gated on `isWorkerSupported()` — runtime must expose `Worker`, `Blob`, and `URL.createObjectURL`. On runtimes without Workers (e.g., Node dist without `--experimental-worker-threads-*` flags), `grepFilesInternal` falls back to the existing `mapFilesConcurrent` path. Runtime escape hatch: `SENTRY_SCAN_DISABLE_WORKERS=1`. ### `src/lib/scan/grep-worker-source.ts` Worker source inlined as a JS string constant and loaded via `Blob` + `URL.createObjectURL`. **This is required for `bun build --compile` single-file binaries** — filesystem-based `new Worker(url)` hangs at runtime because the URL can't be resolved against the compiled binary's module map. Blob URLs sidestep that entirely. The worker is self-contained, uses `require("node:fs")` CommonJS-style, and has zero imports from local modules. ### `src/lib/scan/grep.ts` integration `grepFilesInternal` dispatches to one of two sub-generators: 1. `grepViaWorkers` — producer/consumer streaming. Walker runs on main thread; paths accumulate into batches of 200; each full batch dispatches round-robin to the least-loaded worker; worker returns matches as `{ ints: Uint32Array, linePool: string }` via `postMessage` with `transfer: [ints.buffer]` (zero-copy). Main thread decodes into `GrepMatch[]` lazily and yields. 2. `grepViaAsyncMain` — the existing `mapFilesConcurrent` path, used when workers are unavailable or disabled. ### Binary match encoding Structured clone of 215k `GrepMatch` objects costs ~200ms on this box. Transferable ArrayBuffer + shared `linePool` string drops that to ~2-3ms. Each match is 4 u32s: `[pathIdx, lineNum, lineOffset, lineLength]`. Line text is appended to a single accumulator string per batch; offsets into it are rebuilt into `GrepMatch.line` on the main thread via `linePool.slice(lineOffset, lineOffset + lineLength)`. ## Bugs caught + fixed during implementation ### FIFO queue race (message handler mismatch) Initial design used `addEventListener` per dispatch. Because Web Workers deliver every `result` message to EVERY attached listener, multiple concurrent dispatches to the same worker would all fire together on the first result — resolving the wrong promise with the wrong batch's data. When a handler shifted state to mark "done", subsequent dispatches would find their handlers already removed. **Fix:** single `onmessage` per worker, per-worker FIFO `pending` queue of `{resolve, reject}` slots. Each `result` message shifts one slot. Workers process `postMessage` FIFO (both Bun and Node guarantee this inside a single worker), so FIFO-order resolution is sound. ### Consumer wake race (lost notifications) `wakeConsumer()` fired immediately if `notify` was set, else did nothing. When a batch settled before the consumer entered its `await new Promise(resolve => notify = resolve)` block, the notification was dropped and the consumer hung forever. **Fix:** added `notifyPending` flag. `wakeConsumer()` sets it when no consumer is waiting; consumer checks the flag before awaiting, and the await's executor also checks it to close the window between check and assignment. ### `worker.unref()` deadlock Initial design called `.unref()` on each worker for clean CLI exit. But when the main thread's only event-loop work was waiting on a worker result, unref'd workers didn't process messages on idle ticks, deadlocking the pipeline. The symptom was visible in the bench — after the walker finished and all in-flight batches except 2 returned, the last 2 never completed. **Fix:** removed `.unref()`. The CLI calls `process.exit()` at the end so worker cleanup isn't needed for process termination. For library embeddings, callers can use `terminatePool()` explicitly. ## Perf (synthetic/large, 10k files, p50) | Op | Before (PR #805) | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` | 322ms | **191ms** | **−41%** | | `detectAllDsns.cold` | 308ms | 297ms | −4% | | `scanCodeForDsns` | 313ms | 293ms | −6% | | `detectAllDsns.warm` | 27.0ms | 27.5ms | — | | `scan.walk` | 231ms | 263ms | noise | `scan.grepFiles` is the primary target. `scanCodeForDsns` doesn't use the worker pool (it uses the DSN scanner directly), so the ~20ms improvement there is walker variance. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/scan/grep.test.ts` — 38 pass (including `AbortSignal fires mid-iteration` which exercises the signal propagation through the producer/consumer pipeline) - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass - [x] `SENTRY_SCAN_DISABLE_WORKERS=1 bun run bench` — verifies the fallback path still works end-to-end (360ms p50, matches pre-change behavior)
Contributor
|
Contributor
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ❌ Patch coverage is 77.89%. Project has 2016 uncovered lines. Files with missing lines (3)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.63% 95.07% -0.56%
==========================================
Files 281 283 +2
Lines 40462 40926 +464
Branches 0 0 —
==========================================
+ Hits 38694 38910 +216
- Misses 1768 2016 +248
- Partials 0 0 —Generated by Codecov Action |
… dispatch ## DSN scanner unification The DSN scanner had its own `walkFiles + mapFilesConcurrent + Bun.file().text() + extractDsnsFromContent` pipeline that predated the worker pool from PR #807. It was doing the same fundamental work as `grepFiles` (regex over file content + collect matches) but via a separate path with more orchestration overhead. Rewrote `scanCodeForDsns` to use `grepFiles` directly: - Pattern: `DSN_PATTERN` — grep's literal prefilter gates files by `http` (matches the old `HTTP_SCHEME_PROBE` fast path exactly). - Walker options: `dsnScanOptions()` + `recordMtimes: true` + `onDirectoryVisit` hook for cache-invalidation maps. - Per-match post-filter on the main thread: comment-line check, host validation, dedup, mtime tracking, package path inference. - Multi-DSN-per-line handling preserved via `extractDsnsOnLine` which re-runs the regex on `match.line`. `scanCodeForFirstDsn` deliberately does NOT route through the worker pool. It's the stopOnFirst fast path (first valid DSN wins, return early). Spawning a 4-worker pool for a workload that often completes after reading 1-2 files would add ~20ms of pool-init cost that dwarfs the work. Instead it uses a direct `walkFiles` loop with sync-ish per-file read + `extractFirstDsnFromContent`, which already benefits from the `/http/i` literal fast path. This keeps `detectDsn.cold` at ~3.5ms instead of regressing to ~20ms. ## New grep options: `recordMtimes` + `onDirectoryVisit` To support the DSN cache-invalidation maps, extended `GrepOptions`: - `recordMtimes: boolean` — when true, each `GrepMatch` includes the source file's floored `mtimeMs` on an optional `mtime` field. Walker already exposes `entry.mtime` via its own `recordMtimes` passthrough; the main thread captures mtimes per-batch before dispatching and the decoder attaches them indexed by `pathIdx`. Worker doesn't need to stat again (the main thread already did via `statSync`). - `onDirectoryVisit: (absDir, mtimeMs) => void` — pass-through to the walker's per-directory hook. Used for directory-level cache invalidation ("has a new file been added to this dir?"). Both are opt-in; when unused, `GrepMatch.mtime` is absent (not 0 — distinguishes "not asked" from "asked and happens to be 0"). ## Cursor Bugbot: dead-worker dispatch deadlock Cursor found a real correctness issue in the worker pool from #807: > After a worker `error` event, the error handler rejects all > pending items and resets `pw.inflight` to 0, but the dead worker > remains in the pool. The least-loaded selection in `dispatch` > will favor this worker (it appears to have 0 `inflight`), push > new pending slots, and call `postMessage` — but the dead worker > can't respond. Those dispatch promises hang forever, so > `inflightCount` in `grepViaWorkers` never reaches 0, deadlocking > the consumer's exit condition at > `walkerDone && inflightCount === 0`. Added `alive: boolean` to `PooledWorker`, flipped to false on the first `error` event before rejecting pending dispatches. The least-loaded picker now skips dead workers entirely and falls back to `Promise.reject(...)` when every worker is dead. Prevents the reported deadlock and ensures a pool-wide failure surfaces as an error instead of hanging. ## Perf (synthetic/large, 10k files, p50) | Op | Before unification | After unification | Δ | |---|---:|---:|---:| | `scanCodeForDsns` | 293ms | **234ms** | **−20%** | | `detectAllDsns.cold` | 297ms | **250ms** | **−16%** | | `detectDsn.cold` | 4.98ms | **3.45ms** | **−31%** | | `scanCodeForFirstDsn` | 2.07ms | 1.88ms | — | | `scan.grepFiles` | 191ms | 170ms | − | | `detectAllDsns.warm` | 27.5ms | 27.2ms | — | The full-scan paths (`scanCodeForDsns`, `detectAllDsns.cold`) are 20% faster because they now share the worker-pool infrastructure from PR #807. The `stopOnFirst` paths (`scanCodeForFirstDsn`, `detectDsn.cold`) are essentially unchanged (or slightly better) because they now use a tighter direct-walker loop. ## Correctness - `extractDsnsFromContent` still exported for `src/lib/dsn/detector.ts::isDsnStillPresent` (single-file cache verification) — unchanged. - `extractFirstDsnFromContent` still exported for `scanCodeForFirstDsn`'s per-file fast path — unchanged. - All DSN tests pass, including the tricky "deduplicates same DSN from multiple files" test which required tracking `fileHadValidDsn` separately from seen-set insertion to preserve the pre-refactor behavior of recording mtime for every file that contributed a validated DSN (even if the DSN was a duplicate). - `sourceMtimes` / `dirMtimes` shapes unchanged — the cache schema in `src/lib/db/dsn-cache.ts` doesn't need to change. - Minor observability drift: per-file `log.debug("Cannot read file: …")` no longer fires on individual read errors in the full-scan path (worker swallows them). The `scanCodeForFirstDsn` path still logs top-level errors. Accepted as minor regression. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing markdown warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5640 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun test test/lib/dsn/code-scanner.test.ts` — 150 pass (including the "deduplicates same DSN from multiple files" regression test that caught the initial buggy implementation) - [x] `bun test test/lib/scan/grep.test.ts` — 38 pass - [x] `bun run bench --size large --runs 5 --warmup 2` — `scanCodeForDsns` 234ms p50 (−20% from 293ms)
Replaces `src/lib/scan/grep-worker-source.ts` (a TS module
exporting the worker body as a string template literal) with a
real `src/lib/scan/grep-worker.js` file that's imported via
`with { type: "text" }`. The runtime behavior is unchanged — Bun
and the esbuild plugin both resolve the attribute to the file
contents as a string — but now the worker is a first-class file:
lintable, syntax-checked, formattable, searchable.
## Why
`grep-worker-source.ts` was functional but awkward. The string
inside the template literal wasn't visible to Biome's lint rules,
IDE navigation, or any other tooling that cares about real files.
We caught several lint issues once the body became a real `.js`
file (`noIncrementDecrement`, `useBlockStatements`, `useTemplate`)
— real issues that had been silently skipped.
The blob-URL + raw-text-import approach is the only form that
works in all three of our target environments:
- **`bun run` (dev):** Bun handles `with { type: "text" }` natively.
- **`bun build --compile` (single-file CLI):** the esbuild plugin
inlines the file contents as a string at bundle time; the
compiled binary has no runtime file resolution.
- **`bun bundle` (npm library `dist/index.cjs`):** same esbuild
plugin, same inlined string. When Node consumers import the
library, Workers are feature-gated on `isWorkerSupported()` and
fall back to the async path.
I explored Bun's documented alternative (`new Worker(new URL(...))`
or `new Worker("./path.ts")` with the worker as an additional
entrypoint to `bun build --compile`) but all three forms the docs
suggest either (a) fail at runtime in compiled binaries because
`import.meta.url` resolves to the binary path and URL resolution
doesn't hit the bundled entrypoint, or (b) work but require the
binary to run with CWD equal to the `bun build` project root,
which is brittle. The Blob-URL + text-import approach has no such
fragility.
## Changes
- **New:** `src/lib/scan/grep-worker.js` — real plain-JS file
containing the worker body (moved verbatim from the old string
template; no logic changes). Pass-through of all existing
features: literal gate, line-counter, multiline mode, maxLineLength
truncation, maxMatchesPerFile, transferable-buffer match packing.
- **New:** `script/text-import-plugin.ts` — tiny esbuild plugin
(~20 LOC) that polyfills Bun's `with { type: "text" }` attribute
by reading the file and emitting it as a default-export string.
Shared by `script/build.ts` (CLI binary) and `script/bundle.ts`
(npm library).
- **Modified:** `src/lib/scan/worker-pool.ts` imports the worker
body as `import grepWorkerSource from "./grep-worker.js" with { type: "text" }`.
TypeScript's type system doesn't model the attribute, so we cast
through `unknown` to a string. The runtime Blob constructor
would throw if given a non-string, so the cast is guarded.
- **Modified:** `script/build.ts`, `script/bundle.ts` — add the
plugin to their esbuild plugin arrays.
- **Deleted:** `src/lib/scan/grep-worker-source.ts`.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing markdown warning).
Biome now lints `grep-worker.js`; several issues surfaced
during the refactor (`noIncrementDecrement`, `useBlockStatements`,
`useTemplate`) were all fixed in the same commit.
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `SENTRY_CLIENT_ID=placeholder bun run bundle` — library
bundle builds successfully, worker source is embedded in
`dist/index.cjs` as expected.
- [x] `bun run bench --size large --runs 3 --warmup 1` — perf
unchanged (`scan.grepFiles` 161ms, `scanCodeForDsns` 241ms).
Pre-merge self-review found a silent correctness regression in the
unified DSN pipeline from the previous commit: `grepFiles` defaults
to `maxLineLength: 2000`, which truncates `match.line` with a `…`
suffix. The DSN scanner then re-runs `DSN_PATTERN` on that
truncated line to recover all DSNs on the line — but the pattern
ends with `/\d+`, so a DSN near the end of a long line loses its
trailing digits to `…` and fails the regex silently.
Reproduced pre-fix:
const filler = "x".repeat(1990);
const dsn = "https://abc@o111222.ingest.us.sentry.io/7654321";
const line = `${filler} const DSN = "${dsn}";`;
writeFileSync(join(tmpDir, "minified.js"), line);
const result = await scanCodeForDsns(tmpDir);
// → `result.dsns.length === 0` ❌ (should be 1)
Realistic triggers: minified JS bundles (commonly have lines >2000
chars), source-map-embedded config blobs, base64-encoded envs.
Within the walker's 256KB `MAX_FILE_SIZE` cap, a single line
carrying a DSN past column 2000 is plausible.
Fix: pass `maxLineLength: Number.POSITIVE_INFINITY` from the DSN
scanner. Memory is bounded by `MAX_FILE_SIZE` (per-file) and the
`seen` dedup map (per-scan), so there's no blow-up risk.
Also fixes a stale docstring at the top of `worker-pool.ts` that
still described the `.unref()` design from an earlier iteration.
The current code explicitly does NOT call `.unref()` (it caused a
deadlock) and the old docstring contradicted the implementation.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] Reproduction above → 0 DSNs (pre-fix) → 1 DSN (post-fix)
- [x] New regression test in `test/lib/dsn/code-scanner.test.ts`:
"finds DSNs on long single-line files (>2000 chars, common in
minified JS)"
- [x] `bun test --timeout 15000 test/lib test/commands test/types`
— **5641 pass, 0 fail** (+1 new regression test)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
regression (`scanCodeForDsns` 238ms, `scan.grepFiles` 163ms)
PR #807 originally removed `.unref()` from each worker at spawn to avoid a mid-pipeline deadlock where unref'd workers weren't processing messages on idle ticks. But keeping workers ref'd broke the CLI exit path: `project view` (and any command that transits `detectAllDsns` → `scanCodeForDsns` → `grepFiles`) would emit its output and then hang indefinitely because ref'd-but-idle workers kept the event loop alive. E2E caught this: (fail) sentry project view > requires org and project [15001.15ms] ^ this test timed out after 15000ms. Expected: 1 Received: 143 (SIGTERM) Bisected cleanly: - `d2305ba1` (workers-only commit): exit=1 on `project view` — clean. - `6ea26f3d` (DSN unification): exit=124 — hang. Unification made `scanCodeForDsns` route through `grepFiles`, which creates the worker pool. Before unification, the DSN path went through `walkFiles + mapFilesConcurrent` directly and never touched the pool, so `project view` never saw workers. ## Fix: ref/unref balancing per dispatch Workers are now `.unref()`'d at spawn (idle pool doesn't hold the loop open), then `.ref()`'d in `dispatch()` before a request is posted, and `.unref()`'d again when the dispatch settles — via the `message` `result` handler, the `error` handler's mass-reject path, the readiness-failure path, and `terminate()`. Each of these paths has a matched unref for every ref so the ref count stays balanced. This gives us both properties we need: - **Clean CLI exit**: when all dispatches settle, workers are idle + unref'd, the event loop drains, and the process exits naturally. No `beforeExit` handler or explicit `terminatePool()` required on the happy path. - **No mid-pipeline deadlock**: ref'd workers during active work means their `message` events fire on the main thread's next tick, not starved by Bun's idle-tick bypass of unref'd handles. Added `refWorker` / `unrefWorker` helper functions for readability and to guard against runtimes where `Worker` doesn't expose the ref/unref methods (Node's DOM Worker shim, hypothetically). ## Manual verification $ cd /tmp/empty-dir $ time sentry project view Error: Could not auto-detect organization and project. ... real 0m0.040s # was 15s timeout before fix exit=1 Ran 10 consecutive invocations — all exit cleanly in <50ms. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3 --warmup 1` — workers still function end-to-end; no perf regression (`scan.grepFiles` 162ms, `scanCodeForDsns` 238ms) - [x] Rebuilt binary, ran `project view` from empty dir 10× — all exit cleanly in <50ms. Pre-fix: 5/5 hit 15s timeout. - [x] `auth status`, `project view` with DSN, etc. — all exit cleanly.
…ted count
Two Cursor Bugbot findings caught on the latest force-push.
## 1. `filesCollected` counts matches, not unique files (medium)
`grepFiles` emits one `GrepMatch` per matching line, not per file.
The scanner was incrementing `filesCollected` on every match, so a
file with 3 DSN-matching lines inflated the counter (and the two
`dsn.files_collected` / `dsn.files_scanned` telemetry attributes it
backed) by 3×.
Fix: use `filesSeenForMtime.size` instead. That set was already
tracking unique file paths for mtime recording. Rename-free and
semantically accurate: "files with at least one validated DSN" —
matches rg's `--files-with-matches` semantics.
## 2. Silent data loss when all workers die (medium)
The `dispatchBatch` error handler was swallowing dispatch rejections
(worker errors, "all workers dead"). If every batch failed, the
consumer loop would see zero results and exit normally, and callers
like the DSN scanner would get an empty result indistinguishable
from "no DSNs found." The DSN cache layer would then persist a
false-negative empty result, leaving the scan broken until the 24h
TTL expires.
Fix: track `dispatchedBatches` / `failedBatches` counters in
`grepViaWorkers`. When the consumer loop exits (via the `finally`
block that already re-raises `producerError`), throw if every
dispatched batch failed. A single failed batch is still acceptable
(per-file errors are swallowed by design inside the worker), but a
pipeline-wide failure now surfaces instead of masquerading as
"empty result."
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
regression (`scan.grepFiles` 159ms, `scanCodeForDsns` 232ms)
- [x] Rebuilt binary + `project view` from empty dir — exits cleanly
(exit 1) in <50ms. 3 consecutive runs all clean.
Not adding a regression test for finding #2 — the "all workers die"
path requires crashing a real worker, which is hard to reproduce
deterministically in a unit test. The change is a defensive guard
against an edge case that only shows up when the worker runtime
itself is broken (e.g., missing `node:fs`).
Cursor Bugbot caught a real race in the previous commit's per-dispatch
ref/unref pattern.
`Worker.ref()` and `Worker.unref()` are idempotent booleans — NOT
reference-counted. Calling `unref()` releases the event-loop hold
entirely regardless of how many prior `ref()` calls were made.
Previous commit: every `dispatch()` called `refWorker`, every
completion called `unrefWorker`. When a single worker had 2+
concurrent dispatches in flight (normal with round-robin batching),
the first completion's `unref()` call unrefed the worker even
though other dispatches were still running. If the walker had also
finished by that point, the event loop would exit with pending
worker results dropped silently.
Fix: track inflight count per worker (already exists as
`pw.inflight`) and only unref when it drops to zero. Specifically:
- **`message` result handler:** decrement inflight; if it hit zero,
unref. Other dispatches still pending → keep ref'd.
- **`error` handler:** drain all pending rejects, then unref ONCE
at the end (worker is gone either way).
- **readiness-failure path in `dispatch()`:** decrement inflight;
if it hit zero, unref (matches `message` handler logic).
- **`terminate()`:** same as error handler — reject all pending,
then unref once.
The spawn-time unref stays (inflight starts at 0, worker is idle).
The per-dispatch `ref()` is safe to call unconditionally (idempotent
no-op when already ref'd), so no change there.
Updated the ref/unref strategy docstring to call out the boolean
semantics explicitly, so the next person to touch this code
doesn't assume reference-counting.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — workers
still parallelize correctly (`scan.grepFiles` 160ms,
`scanCodeForDsns` 237ms). If the unref race were being hit,
workers would silently starve and numbers would regress to
fallback-path territory (~320ms).
- [x] Rebuilt binary, ran `project view` from empty dir 5× — all
exit cleanly in <50ms.
Seer caught a telemetry inconsistency from the unification refactor.
`scanCodeForDsns` was reporting `dsn.files_scanned = filesSeenForMtime.size`
— count of unique files with validated DSNs. `scanCodeForFirstDsn`
was reporting `dsn.files_scanned = filesScanned` — count of every
file walked. Same attribute name, two different meanings depending
on which path executed. Makes the telemetry unreliable for scanner
performance monitoring.
Fix: switch `scanCodeForDsns` from `grepFiles` (streaming) to
`collectGrep` (buffered), which returns `stats.filesRead` — the
count of files the grep pipeline actually read and tested. That's
equivalent to walker-yielded files (after the walker's own
extension/gitignore filters), matching what `scanCodeForFirstDsn`
reports.
DSN workloads buffer <10 matches typically (validated DSNs from a
source tree), so the streaming-to-buffered switch has no measurable
perf impact. The `filesSeenForMtime` set stays — it still dedups
mtime writes across multi-DSN-per-file (grep emits one match per
line; without the dedup, a file with 3 DSN-containing lines would
trigger 3 mtime writes for the same path).
Updated the comment to reflect the new role (mtime-write dedup,
not telemetry) and pointed the docstrings at `collectGrep` instead
of `grepFiles`.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
regression (`scan.grepFiles` 160ms — this bench op uses
`collectGrep` so it was already in the buffered path;
`scanCodeForDsns` reruns are in the same range as before).
- [x] Rebuilt binary, ran `project view` from empty dir — all
exit cleanly in <50ms.
Contributor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit dff0be1. Configure here.
Two latent LOW-severity findings caught on the final CI round.
## 1. Seer: failure-check ran before in-flight dispatches settled
In the early-exit path (`AbortError`, `maxResults = 0`), the
`finally` block's all-batches-failed check could race with
in-flight dispatch `.catch()` handlers. Sequence:
1. Consumer emits enough matches, exits loop.
2. `finally` runs, awaits `producer` (which just dispatches).
3. `finally` checks `failedBatches === dispatchedBatches`.
4. In-flight dispatch `.catch()` fires LATER, bumps
`failedBatches`.
Step 3 reads stale state. If every batch would ultimately fail,
we'd silently return an empty result instead of surfacing the
pipeline failure. Seer rated this LOW because only `maxResults: 0`
triggers the early exit before any dispatch can settle, and no
production caller uses that value — but it's a real bug either
way and the fix is cheap.
Fix: track dispatch promises and `await Promise.allSettled()` on
them in the `finally` block before the failure check. `allSettled`
rather than `all` because individual rejections are already
counted via the dispatch's `.catch()`; we just need them to
complete.
## 2. Cursor: readiness-failure used `pop()` not identity-based removal
The readiness-failure handler in `dispatch()` was calling
`chosen.pending.pop()` to drop "this dispatch's slot" — but with
multiple concurrent dispatches queued to the same worker, `pop()`
returns whichever slot happened to be at the END of the queue,
not the one belonging to the failing dispatch. Handlers for
sibling dispatches would execute in whatever order `ready`'s
rejection chain fired, and `pop()` would reject them in reverse
queue order.
Currently unreachable because `chosen.ready` never rejects in the
current pool impl (the ready promise has no failure path), but
latent — any future addition of a startup timeout or init failure
would expose the bug.
Fix: hold a direct reference to `ourSlot` in each dispatch's
closure. The readiness-failure handler finds its slot via
`indexOf(ourSlot)` and splices that specific entry, preserving
FIFO semantics for sibling dispatches.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types`
— **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
regression (`scan.grepFiles` 160ms)
- [x] Rebuilt binary, ran `project view` from empty dir 3×
— all exit cleanly.
5 tasks
BYK
added a commit
that referenced
this pull request
Apr 22, 2026
## Summary Follow-up to #791. Replaces the init-wizard's `rg → git grep → fs walk` fallback chain in `src/lib/init/tools/grep.ts` + `glob.ts` with thin adapters over the pure-TS `collectGrep` / `collectGlob` helpers from `src/lib/scan/`. The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (`caseInsensitive`, `multiline`) are added to `GrepSearch` for future-proofing — no current server invocation sets them. ## What changed ### Adapters (net −392 LOC of production code) | File | Before | After | |---|---:|---:| | `src/lib/init/tools/grep.ts` | 337 LOC | 114 LOC | | `src/lib/init/tools/glob.ts` | 147 LOC | 73 LOC | | `src/lib/init/tools/search-utils.ts` | 146 LOC | **deleted** | | `src/lib/init/types.ts` | — | +17 LOC (two optional fields) | The adapters now just: 1. Sandbox the user-supplied `search.path` via `safePath`. 2. Forward each search/pattern to `collectGrep` / `collectGlob` with the wire-level constants (`maxResults`, `maxLineLength`) plumbed through. 3. Strip `absolutePath` from each `GrepMatch` — the Mastra wire has never included it. 4. Catch `ValidationError` from bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload. ### New optional `GrepSearch` fields ```ts export type GrepSearch = { pattern: string; path?: string; include?: string; caseInsensitive?: boolean; // NEW multiline?: boolean; // NEW }; ``` No current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both. ### Tests - **Deleted 3 obsolete tests** that shadowed `rg` in `PATH` to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological. - **Deleted scaffolding**: `writeExecutable`, `setPath`, `helperBinDir`, `savedPath` — only used by the deleted tests. - **Kept 3 pre-existing tests** unchanged: include filters, multi-pattern glob, sandbox rejection. - **Added 3 new adapter-specific tests**: - `grep result matches MUST NOT include absolutePath` — pins the strip behavior so `absolutePath` never leaks to the Mastra agent. - `grep bad regex yields empty matches without crashing the payload` — documents the `ValidationError` catch contract. - `grep caseInsensitive flag enables case-insensitive matching` — end-to-end coverage for the new wire field. ## Behavior changes (intentional, only affects users without `rg`/`git`) Before this PR, the init-wizard fs-walk fallback was naive: no `.gitignore` handling, narrow skip list, no binary detection. Users with `rg` or `git` installed never took this path. After this PR, **every user gets rg-like behavior** via the pure-TS scanner: - **Nested `.gitignore`** honored (cumulative semantics, matching git + rg). - **Wider skip-dir list** — scan excludes `.next`, `target`, `vendor`, `coverage`, `.cache`, `.turbo` in addition to the old skip set. - **Binary files filtered** — scan runs an 8 KB NUL-byte sniff before grep-ing. Users with `rg` installed see no behavior change relative to the rg-path of the old adapter. ## Benchmarks Measured after rebasing onto `main` (includes #791, #804, #805, #807 — the full grep/worker-pool stack). **Fixture:** 10k files (~80 MB), 3 monorepo packages, mix of text + binary. **Config:** 5 runs after 2 warmup, pre-warmed page cache. **Machine:** linux/x64, 4 CPUs, Bun 1.3.11. ripgrep 14.1.0. **Harness:** `/tmp/bench-init-tools.ts` — invokes the adapter via the real `grep(GrepPayload)` entry point Mastra uses. ### Adapter vs rg — single search | Pattern | rg | NEW uncapped | NEW cap=100 | uncapped vs rg | cap vs rg | |---|---:|---:|---:|---:|---:| | `import.*from` (215k matches) | 284 ms | 769 ms | **28.4 ms** | 2.70× slower | **10× FASTER** | | `function\s+\w+` (216k) | 306 ms | 782 ms | **28.2 ms** | 2.56× slower | **11× FASTER** | | `SENTRY_DSN` (677) | 81 ms | 496 ms | 79.3 ms | 6.10× slower | parity | | no matches | 74 ms | 486 ms | 492 ms | 6.6× slower | 6.7× slower | The init-wizard workload is always capped at 100. On dense-match patterns (the common case — Mastra greps for common code patterns like imports and function signatures) the worker pool's literal-prefilter + concurrent fan-out + early-exit hit 100 matches in ~28 ms, **10× faster than rg**. On rare/no-match patterns we're walker-bound at ~500 ms — still fine for init wizard. ### Adapter multi-search (realistic Mastra workload) Payload: 3 patterns (`import.*from`, `SENTRY_DSN`, `function\s+\w+`) in parallel via `Promise.all`, `maxResultsPerSearch: 100`. | Metric | Value | |---|---:| | p50 end-to-end | **109 ms** | All three searches share the same worker pool (dispatched concurrently). ### Scan-harness numbers (context) From `bun run bench --size large`, showing that the new adapter is routed through the worker-backed grep pipeline: | Op | p50 | |---|---:| | `scan.grepFiles` | 166 ms | | `scanCodeForDsns` | 232 ms | | `detectDsn.cold` | 4.78 ms | ## Why ship this Init wizard grep has one caller: the Mastra server, which always sends `maxResults: 100`. For that workload the NEW adapter is **10× faster than rg** on common dense patterns and sidesteps the subprocess-dependency problem: - Users without `rg` no longer fall back to a naive walk that ignores gitignore and scans binaries. - No spawning, no stderr draining, no exit-code decoding, no parse-out-pipe-delimited-output. - `GrepSearch` gains `caseInsensitive` + `multiline` — passthrough to the scan engine's native support. The correctness tax (nested `.gitignore` respected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's ~500 ms on 10k files vs rg's ~300 ms. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts:281`, not touched by this PR) - [x] `bun test test/lib/init/ test/lib/scan/` — **352 pass, 0 fail** - [x] Rebased onto latest main (includes #791, #804, #805, #807) - [x] `bun run /tmp/bench-init-tools.ts` — numbers in table above ## What this PR does NOT change - `GrepPayload` / `GlobPayload` — unchanged structurally, only `GrepSearch` gains optional fields - `src/lib/init/tools/registry.ts` — tool dispatch unchanged - `src/lib/init/tools/shared.ts` — `safePath` + `validateToolSandbox` unchanged - Response shape — `{ ok: true, data: { results: [{pattern, matches, truncated}] } }` identical 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
BYK
added a commit
that referenced
this pull request
Apr 22, 2026
## Summary Cleanup pass over the `scan/` and `dsn/` modules after the grep + worker-pool stack (#791, #804, #805, #807, #797) landed. Removed comment bloat accumulated across the 6+ review cycles those PRs went through — redundant bug-history narration, repeated explanations of `ref/unref` boolean semantics, "pre-PR-N" references, and other scars that wouldn't survive a fresh-eyes read. **Net −708 LOC across 12 files. No behavior changes.** ## Per-file reductions | File | Before | After | Δ | |---|---:|---:|---:| | `src/lib/scan/worker-pool.ts` | 466 | 312 | −33% | | `src/lib/scan/grep.ts` | 985 | 712 | −28% | | `src/lib/dsn/code-scanner.ts` | 541 | 377 | −30% | | `src/lib/scan/grep-worker.js` | 153 | 114 | −25% | | `src/lib/dsn/scan-options.ts` | 70 | 52 | −26% | | `src/lib/init/tools/grep.ts` | 122 | 98 | −20% | | `src/lib/init/tools/glob.ts` | 74 | 59 | −20% | Plus minor trims in `types.ts`, `walker.ts`, `path-utils.ts`, `scan/glob.ts`, and `script/text-import-plugin.ts`. ## What was removed - Redundant explanations of `Worker.ref()` / `.unref()` boolean semantics (stated 3× in `worker-pool.ts` — kept once, on the primary ref/unref helper pair). - Multi-paragraph "earlier iteration did X, deadlocked, so we now do Y" histories — kept in git log where they belong. - "pre-PR-3" / "pre-refactor" / "previous version" narration that explained how code used to look before the current session's work. - File-header docstrings that repeated what the module structure and exports already told you. - Rationale comments for `biome-ignore`s that were already justified by adjacent context. ## What was kept - Every `biome-ignore` comment (all still justified). - Every "why" comment tied to a real gotcha (e.g., the length-change warning on case-insensitive literal prefilters, the `/g` flag cloning rationale, the pipeline-failure detector explanation). - Every JSDoc on exported functions and types. - All inline comments that explain non-obvious constraints (sandbox enforcement, cache-contract stability, etc.). ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing markdown warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3` — no perf regression (`scan.grepFiles` 167ms, `scanCodeForDsns` 232ms — matching pre-trim numbers) - [x] `bun run build --single` — binary builds and exits cleanly on `sentry project view` from empty dir (3 consecutive runs, all exit=1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Parallel grep via a worker pool with binary-transferable matches. DSN scanner unified onto the grep pipeline so it gets the same parallelism for free.
Perf (synthetic/large, 10k files, p50)
scan.grepFilesscanCodeForDsnsdetectAllDsns.colddetectDsn.colddetectAllDsns.warmscan.grepFilesis the primary target (bench op usingdsnScanOptions()).scanCodeForDsnsbenefits because the DSN scanner was unified onto the grep pipeline during this PR (commit 2) and now routes through the same worker pool rather than its pre-existingwalkFiles + mapFilesConcurrentpath.Cumulative journey through the session series (#791, #804, #805, this PR):
scan.grepFileswent from ~1300ms → 160ms (8× faster, ~95% reduction).Full-scan comparison vs rg (for context)
All measurements on 10k-file synthetic fixture:
import.*fromSENTRY_DSNfunction\s+\w+We're still 2-7× slower than rg on uncapped full-scans (walker-bound, not worker-bound). On capped workloads — the init-wizard workload, which always caps at 100 — we're 14-17× faster than rg because the literal prefilter + early-exit kicks in quickly.
Architecture
Worker pool —
src/lib/scan/worker-pool.tsLazy-initialized singleton. Pool size
min(8, max(2, availableParallelism())). Feature-gated onisWorkerSupported()— runtime must exposeWorker,Blob, andURL.createObjectURL. On runtimes without Workers,grepFilesInternalfalls back to the existingmapFilesConcurrentpath. Runtime escape hatch:SENTRY_SCAN_DISABLE_WORKERS=1.Ref/unref lifecycle: Workers are
.unref()'d at spawn so an idle pool doesn't hold the event loop open. Eachdispatch()callsref()before posting work; themessageresult handler callsunref()when the worker'sinflightcount drops back to zero. This gives us:Worker.ref()/.unref()are idempotent booleans (NOT reference-counted) — hence the inflight-zero guard at every unref site.Worker source —
src/lib/scan/grep-worker.js+script/text-import-plugin.tsThe worker body lives in a real
.jsfile so it's lintable, syntax-checked, and formattable (catching lint issues the original inline-string approach silently hid).worker-pool.tsimports it viawith { type: "text" }— natively handled by Bun at runtime, and polyfilled by a ~20-LOC esbuild plugin for the build-time path. The string content is then fed to aBlob+URL.createObjectURLto spawn workers.I investigated Bun's documented alternative (
new Worker("./path.ts")with the worker as a compile entrypoint) but all three forms from the Bun docs either (a) hang in compiled binaries becauseimport.meta.urlresolves to the binary path and URL resolution doesn't hit the bundled entrypoint, or (b) work but require the binary to run with CWD equal to thebun buildproject root — brittle for a CLI users run from arbitrary dirs. Blob URL + text-import works identically inbun run,bun test, and compiled binaries.Pipeline —
src/lib/scan/grep.tsgrepFilesInternaldispatches to one of two sub-generators:grepViaWorkers— producer/consumer streaming. Walker runs on main thread; paths accumulate into batches of 200; each full batch dispatches round-robin to the least-loaded worker; worker returns matches as{ ints: Uint32Array, linePool: string }viapostMessagewithtransfer: [ints.buffer](zero-copy). Main thread decodes and yields.grepViaAsyncMain— the existingmapFilesConcurrentpath, used when workers are unavailable or explicitly disabled.Binary match encoding
Structured clone of 215k
GrepMatchobjects costs ~200ms. Transferable ArrayBuffer + sharedlinePoolstring drops that to ~2-3ms. Each match is 4 u32s:[pathIdx, lineNum, lineOffset, lineLength]. Line text is appended to a single accumulator string per batch; offsets into it are rebuilt intoGrepMatch.lineon the main thread.DSN scanner unification
The DSN scanner previously had its own
walkFiles + mapFilesConcurrent + Bun.file().text() + extractDsnsFromContentpipeline that predated the worker pool. Now it routes throughcollectGrepwithDSN_PATTERN, the same worker pool, and post-filters matches on the main thread for comments / host validation / dedup.GrepOptionsgainedrecordMtimes?: booleanandonDirectoryVisit?pass-through to the walker.GrepMatchgained optionalmtime?: number(populated only whenrecordMtimes: true).collectGrepreturns{ matches, stats }and the DSN scanner usesstats.filesReadfor telemetry — same semantic asscanCodeForFirstDsn's walker-entry count.scanCodeForFirstDsndeliberately does NOT use the worker pool — spawning a pool for a single-DSN scan adds ~20ms of init cost that dwarfs the work. It uses a directwalkFilesloop withextractFirstDsnFromContent.Bugs caught + fixed during implementation
Over the session + pre-merge review + 6 bot review rounds, these non-trivial issues were found and fixed:
FIFO message-handler race —
addEventListenerper dispatch caused result messages to fire all pending handlers, resolving the wrong dispatch's promise. Fix: singleonmessageper worker + per-worker FIFOpendingqueue shifted on each result.Consumer wake race —
wakeConsumer()fired before consumer setnotifywould lose the wakeup. Fix: latchednotifyPendingflag that bridges the window between check and assignment.worker.unref()deadlock — unconditionally unref'd workers didn't process messages on idle ticks, hanging the pipeline when the main thread was only awaiting worker results. Fix: ref/unref balancing per dispatch (unref only wheninflight === 0).Dead-worker dispatch hang — after a worker
errorevent, the least-loaded picker still routed work to it becauseinflighthad been reset to 0. Fix:alive: booleanflag checked in dispatch; skip dead workers, throw when all are dead.Boolean ref/unref mishandling —
.ref()/.unref()are idempotent booleans, not reference-counted. Per-completion unrefs prematurely released the event-loop hold while sibling dispatches were still in flight. Fix: inflight-zero guards at every unref site.CLI exit hang in compiled binary — removing
.unref()entirely (from an earlier attempt to fix the deadlock) broke clean exit because ref'd-but-idle workers held the loop open. Caught by E2E onsentry project view. Fixed by the ref/unref balancing above.DSN line truncation silently dropped DSNs past column ~2000 —
grepFilestruncatesmatch.lineatmaxLineLength(default 2000) with…; the DSN scanner's per-line re-match would fail when a DSN's trailing\d+got replaced. Realistic trigger: minified JS bundles. Fix: passmaxLineLength: Number.POSITIVE_INFINITYfrom the DSN scanner.filesCollectedcounted matches, not files —grepFilesemits one match per line, so a file with 3 DSN-matching lines inflated the counter 3×. Fix: usestats.filesRead(walker-yielded file count) fromcollectGrepinstead.Silent pipeline-wide worker failure — error handler swallowed dispatch rejections; if every batch failed, the consumer saw zero results and exited normally, leaving callers (notably the DSN cache layer) with false-negative empty results. Fix: track
dispatchedBatches/failedBatches; throw infinallyif all dispatches failed.DSN telemetry semantic mismatch —
dsn.files_scannedmeant "files with validated DSNs" inscanCodeForDsnsbut "all walker-yielded files" inscanCodeForFirstDsn. Unified to walker-yielded files in both paths viastats.filesRead.Latent readiness-failure FIFO violation —
pending.pop()in the readiness-failure handler removed the wrong dispatch's slot when multiple dispatches were queued to the same worker. Fix: holdourSlotreference per dispatch, identity-based removal viaindexOf+splice.Latent maxResults=0 race — failure-check ran before in-flight dispatches settled. Fix:
await Promise.allSettled(dispatchPromises)before the check.Correctness
grepFiles/collectGrepreturn the same shape; new options onGrepOptionsare all optional.grepViaWorkerspollssignal.abortedon every iteration of the consumer emit loop and throwsDOMException("AbortError")— matches the fallback path.maxResultsandstopOnFirststop dispatching new batches; in-flight batches finish but their results are discarded. Some wasted CPU on early exit — acceptable because defaults are typically loose.mapFilesConcurrent); worker-level errors propagate viaalive: false+ pending-queue drain.code-scanner.test.tspasses (151 tests), includingdeduplicates same DSN from multiple files(subtlefileHadValidDsnmtime-tracking invariant) and a new long-minified-line regression test.log.debug("Cannot read file: ...")per file read error — workers swallow them silently.scanCodeForFirstDsnstill logs top-level errors.Test plan
bunx tsc --noEmit— cleanbun run lint— clean (1 pre-existing markdown warning)bun test --timeout 15000 test/lib test/commands test/types— 5641 pass, 0 failbun test test/isolated— 138 passbun test test/lib/scan/grep.test.ts— 38 pass (includesAbortSignal fires mid-iteration)bun test test/lib/dsn/code-scanner.test.ts— 151 pass (includes long-minified-line regression test)SENTRY_SCAN_DISABLE_WORKERS=1 bun run bench— fallback path still works end-to-endbun run bench --size large --runs 3 --warmup 1— numbers in table abovebun run build --single— compiled binary builds and runs cleanlySENTRY_CLIENT_ID=placeholder bun run bundle— npm library bundle builds; worker source correctly inlined via esbuild pluginsentry project view > requires org and projectpasses — was the CI failure that caught the CLI-exit hangCommits (squashed)
d2305ba1— perf(scan): parallel grep via worker pool with binary-transferable matches6ea26f3d— perf(scan,dsn): unify DSN scanner with grep pipeline; fix dead-worker dispatch47a5ee8d— refactor(scan): worker body as a real .js file (not an inline string)b25a2e48— fix(dsn): recover DSNs past column 2000 on long minified linesbdcb3a25— fix(scan): balance ref/unref per dispatch so CLI exits cleanly872708ee— fix(scan,dsn): surface worker-pipeline failures + correct filesCollected countfa3dc621— fix(scan): respect boolean semantics of Worker.ref()/.unref()dff0be1d— fix(dsn): unify dsn.files_scanned semantics across both scan paths4e0b6b55— fix(scan): close remaining latent worker-pool races🤖 Generated with Claude Code
Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com