Skip to content

perf(scan): parallel grep via worker pool with binary-transferable matches#807

Merged
BYK merged 9 commits intomainfrom
byk/perf/grep-workers
Apr 22, 2026
Merged

perf(scan): parallel grep via worker pool with binary-transferable matches#807
BYK merged 9 commits intomainfrom
byk/perf/grep-workers

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 21, 2026

Summary

Parallel grep via a worker pool with binary-transferable matches. DSN scanner unified onto the grep pipeline so it gets the same parallelism for free.

Perf (synthetic/large, 10k files, p50)

Op Before (#805) After (this PR) Δ
scan.grepFiles 322ms 160ms −50%
scanCodeForDsns 313ms 237ms −24%
detectAllDsns.cold 308ms 251ms −18%
detectDsn.cold 4.98ms 4.60ms
detectAllDsns.warm 27.0ms 27.2ms

scan.grepFiles is the primary target (bench op using dsnScanOptions()). scanCodeForDsns benefits because the DSN scanner was unified onto the grep pipeline during this PR (commit 2) and now routes through the same worker pool rather than its pre-existing walkFiles + mapFilesConcurrent path.

Cumulative journey through the session series (#791, #804, #805, this PR): scan.grepFiles went from ~1300ms → 160ms (8× faster, ~95% reduction).

Full-scan comparison vs rg (for context)

All measurements on 10k-file synthetic fixture:

Pattern rg NEW (default opts) NEW (cap@100)
import.*from 337ms 775ms 24ms
SENTRY_DSN 83ms 559ms 82ms
function\s+\w+ 333ms 785ms 20ms

We're still 2-7× slower than rg on uncapped full-scans (walker-bound, not worker-bound). On capped workloads — the init-wizard workload, which always caps at 100 — we're 14-17× faster than rg because the literal prefilter + early-exit kicks in quickly.

Architecture

Worker pool — src/lib/scan/worker-pool.ts

Lazy-initialized singleton. Pool size min(8, max(2, availableParallelism())). Feature-gated on isWorkerSupported() — runtime must expose Worker, Blob, and URL.createObjectURL. On runtimes without Workers, grepFilesInternal falls back to the existing mapFilesConcurrent path. Runtime escape hatch: SENTRY_SCAN_DISABLE_WORKERS=1.

Ref/unref lifecycle: Workers are .unref()'d at spawn so an idle pool doesn't hold the event loop open. Each dispatch() calls ref() before posting work; the message result handler calls unref() when the worker's inflight count drops back to zero. This gives us:

  • Clean CLI exit: once all dispatches settle, workers are idle + unref'd, the loop drains, process exits naturally.
  • No mid-pipeline deadlock: ref'd workers during active work keep message events flowing (idle unref'd workers get starved by Bun's idle-tick bypass).

Worker.ref() / .unref() are idempotent booleans (NOT reference-counted) — hence the inflight-zero guard at every unref site.

Worker source — src/lib/scan/grep-worker.js + script/text-import-plugin.ts

The worker body lives in a real .js file so it's lintable, syntax-checked, and formattable (catching lint issues the original inline-string approach silently hid). worker-pool.ts imports it via with { type: "text" } — natively handled by Bun at runtime, and polyfilled by a ~20-LOC esbuild plugin for the build-time path. The string content is then fed to a Blob + URL.createObjectURL to spawn workers.

I investigated Bun's documented alternative (new Worker("./path.ts") with the worker as a compile entrypoint) but all three forms from the Bun docs either (a) hang in compiled binaries because import.meta.url resolves to the binary path and URL resolution doesn't hit the bundled entrypoint, or (b) work but require the binary to run with CWD equal to the bun build project root — brittle for a CLI users run from arbitrary dirs. Blob URL + text-import works identically in bun run, bun test, and compiled binaries.

Pipeline — src/lib/scan/grep.ts

grepFilesInternal dispatches to one of two sub-generators:

  1. grepViaWorkers — producer/consumer streaming. Walker runs on main thread; paths accumulate into batches of 200; each full batch dispatches round-robin to the least-loaded worker; worker returns matches as { ints: Uint32Array, linePool: string } via postMessage with transfer: [ints.buffer] (zero-copy). Main thread decodes and yields.

  2. grepViaAsyncMain — the existing mapFilesConcurrent path, used when workers are unavailable or explicitly disabled.

Binary match encoding

Structured clone of 215k GrepMatch objects costs ~200ms. Transferable ArrayBuffer + shared linePool string drops that to ~2-3ms. Each match is 4 u32s: [pathIdx, lineNum, lineOffset, lineLength]. Line text is appended to a single accumulator string per batch; offsets into it are rebuilt into GrepMatch.line on the main thread.

DSN scanner unification

The DSN scanner previously had its own walkFiles + mapFilesConcurrent + Bun.file().text() + extractDsnsFromContent pipeline that predated the worker pool. Now it routes through collectGrep with DSN_PATTERN, the same worker pool, and post-filters matches on the main thread for comments / host validation / dedup.

  • GrepOptions gained recordMtimes?: boolean and onDirectoryVisit? pass-through to the walker.
  • GrepMatch gained optional mtime?: number (populated only when recordMtimes: true).
  • collectGrep returns { matches, stats } and the DSN scanner uses stats.filesRead for telemetry — same semantic as scanCodeForFirstDsn's walker-entry count.
  • scanCodeForFirstDsn deliberately does NOT use the worker pool — spawning a pool for a single-DSN scan adds ~20ms of init cost that dwarfs the work. It uses a direct walkFiles loop with extractFirstDsnFromContent.

Bugs caught + fixed during implementation

Over the session + pre-merge review + 6 bot review rounds, these non-trivial issues were found and fixed:

  1. FIFO message-handler raceaddEventListener per dispatch caused result messages to fire all pending handlers, resolving the wrong dispatch's promise. Fix: single onmessage per worker + per-worker FIFO pending queue shifted on each result.

  2. Consumer wake racewakeConsumer() fired before consumer set notify would lose the wakeup. Fix: latched notifyPending flag that bridges the window between check and assignment.

  3. worker.unref() deadlock — unconditionally unref'd workers didn't process messages on idle ticks, hanging the pipeline when the main thread was only awaiting worker results. Fix: ref/unref balancing per dispatch (unref only when inflight === 0).

  4. Dead-worker dispatch hang — after a worker error event, the least-loaded picker still routed work to it because inflight had been reset to 0. Fix: alive: boolean flag checked in dispatch; skip dead workers, throw when all are dead.

  5. Boolean ref/unref mishandling.ref() / .unref() are idempotent booleans, not reference-counted. Per-completion unrefs prematurely released the event-loop hold while sibling dispatches were still in flight. Fix: inflight-zero guards at every unref site.

  6. CLI exit hang in compiled binary — removing .unref() entirely (from an earlier attempt to fix the deadlock) broke clean exit because ref'd-but-idle workers held the loop open. Caught by E2E on sentry project view. Fixed by the ref/unref balancing above.

  7. DSN line truncation silently dropped DSNs past column ~2000grepFiles truncates match.line at maxLineLength (default 2000) with ; the DSN scanner's per-line re-match would fail when a DSN's trailing \d+ got replaced. Realistic trigger: minified JS bundles. Fix: pass maxLineLength: Number.POSITIVE_INFINITY from the DSN scanner.

  8. filesCollected counted matches, not filesgrepFiles emits one match per line, so a file with 3 DSN-matching lines inflated the counter 3×. Fix: use stats.filesRead (walker-yielded file count) from collectGrep instead.

  9. Silent pipeline-wide worker failure — error handler swallowed dispatch rejections; if every batch failed, the consumer saw zero results and exited normally, leaving callers (notably the DSN cache layer) with false-negative empty results. Fix: track dispatchedBatches / failedBatches; throw in finally if all dispatches failed.

  10. DSN telemetry semantic mismatchdsn.files_scanned meant "files with validated DSNs" in scanCodeForDsns but "all walker-yielded files" in scanCodeForFirstDsn. Unified to walker-yielded files in both paths via stats.filesRead.

  11. Latent readiness-failure FIFO violationpending.pop() in the readiness-failure handler removed the wrong dispatch's slot when multiple dispatches were queued to the same worker. Fix: hold ourSlot reference per dispatch, identity-based removal via indexOf + splice.

  12. Latent maxResults=0 race — failure-check ran before in-flight dispatches settled. Fix: await Promise.allSettled(dispatchPromises) before the check.

Correctness

  • Public API unchanged. grepFiles / collectGrep return the same shape; new options on GrepOptions are all optional.
  • AbortSignal propagation. grepViaWorkers polls signal.aborted on every iteration of the consumer emit loop and throws DOMException("AbortError") — matches the fallback path.
  • Early exit. maxResults and stopOnFirst stop dispatching new batches; in-flight batches finish but their results are discarded. Some wasted CPU on early exit — acceptable because defaults are typically loose.
  • Error handling. Per-file read errors in the worker are silently swallowed (same as mapFilesConcurrent); worker-level errors propagate via alive: false + pending-queue drain.
  • DSN semantic equivalence. Every pre-refactor test in code-scanner.test.ts passes (151 tests), including deduplicates same DSN from multiple files (subtle fileHadValidDsn mtime-tracking invariant) and a new long-minified-line regression test.
  • Observability drift (minor). The full-scan path no longer emits log.debug("Cannot read file: ...") per file read error — workers swallow them silently. scanCodeForFirstDsn still logs top-level errors.

Test plan

  • bunx tsc --noEmit — clean
  • bun run lint — clean (1 pre-existing markdown warning)
  • bun test --timeout 15000 test/lib test/commands test/types5641 pass, 0 fail
  • bun test test/isolated — 138 pass
  • bun test test/lib/scan/grep.test.ts — 38 pass (includes AbortSignal fires mid-iteration)
  • bun test test/lib/dsn/code-scanner.test.ts — 151 pass (includes long-minified-line regression test)
  • SENTRY_SCAN_DISABLE_WORKERS=1 bun run bench — fallback path still works end-to-end
  • bun run bench --size large --runs 3 --warmup 1 — numbers in table above
  • bun run build --single — compiled binary builds and runs cleanly
  • SENTRY_CLIENT_ID=placeholder bun run bundle — npm library bundle builds; worker source correctly inlined via esbuild plugin
  • E2E test sentry project view > requires org and project passes — was the CI failure that caught the CLI-exit hang

Commits (squashed)

  1. d2305ba1 — perf(scan): parallel grep via worker pool with binary-transferable matches
  2. 6ea26f3d — perf(scan,dsn): unify DSN scanner with grep pipeline; fix dead-worker dispatch
  3. 47a5ee8d — refactor(scan): worker body as a real .js file (not an inline string)
  4. b25a2e48 — fix(dsn): recover DSNs past column 2000 on long minified lines
  5. bdcb3a25 — fix(scan): balance ref/unref per dispatch so CLI exits cleanly
  6. 872708ee — fix(scan,dsn): surface worker-pipeline failures + correct filesCollected count
  7. fa3dc621 — fix(scan): respect boolean semantics of Worker.ref()/.unref()
  8. dff0be1d — fix(dsn): unify dsn.files_scanned semantics across both scan paths
  9. 4e0b6b55 — fix(scan): close remaining latent worker-pool races

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

…tches

`scan.grepFiles` on 10k files drops from 322ms → 191ms p50 (−41%).
Follow-up to PR #791 (literal prefilter), #804 (simplify prefilter
to file gate), #805 (walker hot-path rewrite).

The prior PRs left the walker at 231ms and the overall pipeline at
~1200ms for uncapped scans — the gap was dominated by per-file
async orchestration in `mapFilesConcurrent` rather than actual work.
Prototypes showed that running the per-file work synchronously
inside a small pool of workers (4 on this box) collapses the
orchestration cost without changing the public API.

## Architecture

### `src/lib/scan/worker-pool.ts` (~300 LOC)

Lazy-initialized singleton pool (size `min(8, max(2, availableParallelism()))`).
Feature-gated on `isWorkerSupported()` — runtime must expose
`Worker`, `Blob`, and `URL.createObjectURL`. On runtimes without
Workers (e.g., Node dist without `--experimental-worker-threads-*`
flags), `grepFilesInternal` falls back to the existing
`mapFilesConcurrent` path. Runtime escape hatch:
`SENTRY_SCAN_DISABLE_WORKERS=1`.

### `src/lib/scan/grep-worker-source.ts`

Worker source inlined as a JS string constant and loaded via
`Blob` + `URL.createObjectURL`. **This is required for
`bun build --compile` single-file binaries** — filesystem-based
`new Worker(url)` hangs at runtime because the URL can't be
resolved against the compiled binary's module map. Blob URLs
sidestep that entirely. The worker is self-contained, uses
`require("node:fs")` CommonJS-style, and has zero imports from
local modules.

### `src/lib/scan/grep.ts` integration

`grepFilesInternal` dispatches to one of two sub-generators:

1. `grepViaWorkers` — producer/consumer streaming. Walker runs on
   main thread; paths accumulate into batches of 200; each full
   batch dispatches round-robin to the least-loaded worker; worker
   returns matches as `{ ints: Uint32Array, linePool: string }` via
   `postMessage` with `transfer: [ints.buffer]` (zero-copy). Main
   thread decodes into `GrepMatch[]` lazily and yields.

2. `grepViaAsyncMain` — the existing `mapFilesConcurrent` path,
   used when workers are unavailable or disabled.

### Binary match encoding

Structured clone of 215k `GrepMatch` objects costs ~200ms on this
box. Transferable ArrayBuffer + shared `linePool` string drops that
to ~2-3ms. Each match is 4 u32s: `[pathIdx, lineNum, lineOffset,
lineLength]`. Line text is appended to a single accumulator string
per batch; offsets into it are rebuilt into `GrepMatch.line` on the
main thread via `linePool.slice(lineOffset, lineOffset + lineLength)`.

## Bugs caught + fixed during implementation

### FIFO queue race (message handler mismatch)

Initial design used `addEventListener` per dispatch. Because Web
Workers deliver every `result` message to EVERY attached listener,
multiple concurrent dispatches to the same worker would all fire
together on the first result — resolving the wrong promise with
the wrong batch's data. When a handler shifted state to mark "done",
subsequent dispatches would find their handlers already removed.

**Fix:** single `onmessage` per worker, per-worker FIFO `pending`
queue of `{resolve, reject}` slots. Each `result` message shifts
one slot. Workers process `postMessage` FIFO (both Bun and Node
guarantee this inside a single worker), so FIFO-order resolution is
sound.

### Consumer wake race (lost notifications)

`wakeConsumer()` fired immediately if `notify` was set, else did
nothing. When a batch settled before the consumer entered its
`await new Promise(resolve => notify = resolve)` block, the
notification was dropped and the consumer hung forever.

**Fix:** added `notifyPending` flag. `wakeConsumer()` sets it when
no consumer is waiting; consumer checks the flag before awaiting,
and the await's executor also checks it to close the window
between check and assignment.

### `worker.unref()` deadlock

Initial design called `.unref()` on each worker for clean CLI
exit. But when the main thread's only event-loop work was waiting
on a worker result, unref'd workers didn't process messages on
idle ticks, deadlocking the pipeline. The symptom was visible in
the bench — after the walker finished and all in-flight batches
except 2 returned, the last 2 never completed.

**Fix:** removed `.unref()`. The CLI calls `process.exit()` at the
end so worker cleanup isn't needed for process termination. For
library embeddings, callers can use `terminatePool()` explicitly.

## Perf (synthetic/large, 10k files, p50)

| Op | Before (PR #805) | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` | 322ms | **191ms** | **−41%** |
| `detectAllDsns.cold` | 308ms | 297ms | −4% |
| `scanCodeForDsns` | 313ms | 293ms | −6% |
| `detectAllDsns.warm` | 27.0ms | 27.5ms | — |
| `scan.walk` | 231ms | 263ms | noise |

`scan.grepFiles` is the primary target. `scanCodeForDsns` doesn't
use the worker pool (it uses the DSN scanner directly), so the
~20ms improvement there is walker variance.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
      `src/lib/formatters/markdown.ts` unrelated)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun test test/lib/scan/grep.test.ts` — 38 pass (including
      `AbortSignal fires mid-iteration` which exercises the signal
      propagation through the producer/consumer pipeline)
- [x] `bun test test/lib/dsn/code-scanner.test.ts` — 52 pass
- [x] `SENTRY_SCAN_DISABLE_WORKERS=1 bun run bench` — verifies the
      fallback path still works end-to-end (360ms p50, matches
      pre-change behavior)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-807/

Built to branch gh-pages at 2026-04-22 08:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

❌ Patch coverage is 77.89%. Project has 2016 uncovered lines.
❌ Project coverage is 95.07%. Comparing base (base) to head (head).

Files with missing lines (3)
File Patch % Lines
src/lib/scan/worker-pool.ts 66.17% ⚠️ 68 Missing
src/lib/scan/grep.ts 84.15% ⚠️ 29 Missing
src/lib/dsn/code-scanner.ts 89.90% ⚠️ 10 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.63%    95.07%    -0.56%
==========================================
  Files          281       283        +2
  Lines        40462     40926      +464
  Branches         0         0         —
==========================================
+ Hits         38694     38910      +216
- Misses        1768      2016      +248
- Partials         0         0         —

Generated by Codecov Action

Comment thread src/lib/scan/worker-pool.ts
… dispatch

## DSN scanner unification

The DSN scanner had its own `walkFiles + mapFilesConcurrent +
Bun.file().text() + extractDsnsFromContent` pipeline that predated
the worker pool from PR #807. It was doing the same fundamental
work as `grepFiles` (regex over file content + collect matches) but
via a separate path with more orchestration overhead.

Rewrote `scanCodeForDsns` to use `grepFiles` directly:

- Pattern: `DSN_PATTERN` — grep's literal prefilter gates files by
  `http` (matches the old `HTTP_SCHEME_PROBE` fast path exactly).
- Walker options: `dsnScanOptions()` + `recordMtimes: true` +
  `onDirectoryVisit` hook for cache-invalidation maps.
- Per-match post-filter on the main thread: comment-line check,
  host validation, dedup, mtime tracking, package path inference.
- Multi-DSN-per-line handling preserved via `extractDsnsOnLine`
  which re-runs the regex on `match.line`.

`scanCodeForFirstDsn` deliberately does NOT route through the worker
pool. It's the stopOnFirst fast path (first valid DSN wins, return
early). Spawning a 4-worker pool for a workload that often completes
after reading 1-2 files would add ~20ms of pool-init cost that
dwarfs the work. Instead it uses a direct `walkFiles` loop with
sync-ish per-file read + `extractFirstDsnFromContent`, which already
benefits from the `/http/i` literal fast path. This keeps
`detectDsn.cold` at ~3.5ms instead of regressing to ~20ms.

## New grep options: `recordMtimes` + `onDirectoryVisit`

To support the DSN cache-invalidation maps, extended `GrepOptions`:

- `recordMtimes: boolean` — when true, each `GrepMatch` includes
  the source file's floored `mtimeMs` on an optional `mtime` field.
  Walker already exposes `entry.mtime` via its own `recordMtimes`
  passthrough; the main thread captures mtimes per-batch before
  dispatching and the decoder attaches them indexed by `pathIdx`.
  Worker doesn't need to stat again (the main thread already did
  via `statSync`).
- `onDirectoryVisit: (absDir, mtimeMs) => void` — pass-through to
  the walker's per-directory hook. Used for directory-level cache
  invalidation ("has a new file been added to this dir?").

Both are opt-in; when unused, `GrepMatch.mtime` is absent (not 0 —
distinguishes "not asked" from "asked and happens to be 0").

## Cursor Bugbot: dead-worker dispatch deadlock

Cursor found a real correctness issue in the worker pool from #807:

> After a worker `error` event, the error handler rejects all
> pending items and resets `pw.inflight` to 0, but the dead worker
> remains in the pool. The least-loaded selection in `dispatch`
> will favor this worker (it appears to have 0 `inflight`), push
> new pending slots, and call `postMessage` — but the dead worker
> can't respond. Those dispatch promises hang forever, so
> `inflightCount` in `grepViaWorkers` never reaches 0, deadlocking
> the consumer's exit condition at
> `walkerDone && inflightCount === 0`.

Added `alive: boolean` to `PooledWorker`, flipped to false on the
first `error` event before rejecting pending dispatches. The
least-loaded picker now skips dead workers entirely and falls back
to `Promise.reject(...)` when every worker is dead. Prevents the
reported deadlock and ensures a pool-wide failure surfaces as an
error instead of hanging.

## Perf (synthetic/large, 10k files, p50)

| Op | Before unification | After unification | Δ |
|---|---:|---:|---:|
| `scanCodeForDsns` | 293ms | **234ms** | **−20%** |
| `detectAllDsns.cold` | 297ms | **250ms** | **−16%** |
| `detectDsn.cold` | 4.98ms | **3.45ms** | **−31%** |
| `scanCodeForFirstDsn` | 2.07ms | 1.88ms | — |
| `scan.grepFiles` | 191ms | 170ms | − |
| `detectAllDsns.warm` | 27.5ms | 27.2ms | — |

The full-scan paths (`scanCodeForDsns`, `detectAllDsns.cold`) are
20% faster because they now share the worker-pool infrastructure
from PR #807. The `stopOnFirst` paths (`scanCodeForFirstDsn`,
`detectDsn.cold`) are essentially unchanged (or slightly better)
because they now use a tighter direct-walker loop.

## Correctness

- `extractDsnsFromContent` still exported for
  `src/lib/dsn/detector.ts::isDsnStillPresent` (single-file cache
  verification) — unchanged.
- `extractFirstDsnFromContent` still exported for
  `scanCodeForFirstDsn`'s per-file fast path — unchanged.
- All DSN tests pass, including the tricky "deduplicates same DSN
  from multiple files" test which required tracking
  `fileHadValidDsn` separately from seen-set insertion to preserve
  the pre-refactor behavior of recording mtime for every file that
  contributed a validated DSN (even if the DSN was a duplicate).
- `sourceMtimes` / `dirMtimes` shapes unchanged — the cache schema
  in `src/lib/db/dsn-cache.ts` doesn't need to change.
- Minor observability drift: per-file `log.debug("Cannot read file: …")`
  no longer fires on individual read errors in the full-scan path
  (worker swallows them). The `scanCodeForFirstDsn` path still logs
  top-level errors. Accepted as minor regression.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing markdown warning)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun test test/lib/dsn/code-scanner.test.ts` — 150 pass
      (including the "deduplicates same DSN from multiple files"
      regression test that caught the initial buggy implementation)
- [x] `bun test test/lib/scan/grep.test.ts` — 38 pass
- [x] `bun run bench --size large --runs 5 --warmup 2` —
      `scanCodeForDsns` 234ms p50 (−20% from 293ms)
Comment thread src/lib/dsn/code-scanner.ts Outdated
Comment thread src/lib/dsn/code-scanner.ts Outdated
Comment thread src/lib/scan/worker-pool.ts Outdated
BYK added 2 commits April 21, 2026 23:24
Replaces `src/lib/scan/grep-worker-source.ts` (a TS module
exporting the worker body as a string template literal) with a
real `src/lib/scan/grep-worker.js` file that's imported via
`with { type: "text" }`. The runtime behavior is unchanged — Bun
and the esbuild plugin both resolve the attribute to the file
contents as a string — but now the worker is a first-class file:
lintable, syntax-checked, formattable, searchable.

## Why

`grep-worker-source.ts` was functional but awkward. The string
inside the template literal wasn't visible to Biome's lint rules,
IDE navigation, or any other tooling that cares about real files.
We caught several lint issues once the body became a real `.js`
file (`noIncrementDecrement`, `useBlockStatements`, `useTemplate`)
— real issues that had been silently skipped.

The blob-URL + raw-text-import approach is the only form that
works in all three of our target environments:

- **`bun run` (dev):** Bun handles `with { type: "text" }` natively.
- **`bun build --compile` (single-file CLI):** the esbuild plugin
  inlines the file contents as a string at bundle time; the
  compiled binary has no runtime file resolution.
- **`bun bundle` (npm library `dist/index.cjs`):** same esbuild
  plugin, same inlined string. When Node consumers import the
  library, Workers are feature-gated on `isWorkerSupported()` and
  fall back to the async path.

I explored Bun's documented alternative (`new Worker(new URL(...))`
or `new Worker("./path.ts")` with the worker as an additional
entrypoint to `bun build --compile`) but all three forms the docs
suggest either (a) fail at runtime in compiled binaries because
`import.meta.url` resolves to the binary path and URL resolution
doesn't hit the bundled entrypoint, or (b) work but require the
binary to run with CWD equal to the `bun build` project root,
which is brittle. The Blob-URL + text-import approach has no such
fragility.

## Changes

- **New:** `src/lib/scan/grep-worker.js` — real plain-JS file
  containing the worker body (moved verbatim from the old string
  template; no logic changes). Pass-through of all existing
  features: literal gate, line-counter, multiline mode, maxLineLength
  truncation, maxMatchesPerFile, transferable-buffer match packing.
- **New:** `script/text-import-plugin.ts` — tiny esbuild plugin
  (~20 LOC) that polyfills Bun's `with { type: "text" }` attribute
  by reading the file and emitting it as a default-export string.
  Shared by `script/build.ts` (CLI binary) and `script/bundle.ts`
  (npm library).
- **Modified:** `src/lib/scan/worker-pool.ts` imports the worker
  body as `import grepWorkerSource from "./grep-worker.js" with { type: "text" }`.
  TypeScript's type system doesn't model the attribute, so we cast
  through `unknown` to a string. The runtime Blob constructor
  would throw if given a non-string, so the cast is guarded.
- **Modified:** `script/build.ts`, `script/bundle.ts` — add the
  plugin to their esbuild plugin arrays.
- **Deleted:** `src/lib/scan/grep-worker-source.ts`.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing markdown warning).
      Biome now lints `grep-worker.js`; several issues surfaced
      during the refactor (`noIncrementDecrement`, `useBlockStatements`,
      `useTemplate`) were all fixed in the same commit.
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5640 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `SENTRY_CLIENT_ID=placeholder bun run bundle` — library
      bundle builds successfully, worker source is embedded in
      `dist/index.cjs` as expected.
- [x] `bun run bench --size large --runs 3 --warmup 1` — perf
      unchanged (`scan.grepFiles` 161ms, `scanCodeForDsns` 241ms).
Pre-merge self-review found a silent correctness regression in the
unified DSN pipeline from the previous commit: `grepFiles` defaults
to `maxLineLength: 2000`, which truncates `match.line` with a `…`
suffix. The DSN scanner then re-runs `DSN_PATTERN` on that
truncated line to recover all DSNs on the line — but the pattern
ends with `/\d+`, so a DSN near the end of a long line loses its
trailing digits to `…` and fails the regex silently.

Reproduced pre-fix:

    const filler = "x".repeat(1990);
    const dsn = "https://abc@o111222.ingest.us.sentry.io/7654321";
    const line = `${filler} const DSN = "${dsn}";`;
    writeFileSync(join(tmpDir, "minified.js"), line);

    const result = await scanCodeForDsns(tmpDir);
    // → `result.dsns.length === 0` ❌ (should be 1)

Realistic triggers: minified JS bundles (commonly have lines >2000
chars), source-map-embedded config blobs, base64-encoded envs.
Within the walker's 256KB `MAX_FILE_SIZE` cap, a single line
carrying a DSN past column 2000 is plausible.

Fix: pass `maxLineLength: Number.POSITIVE_INFINITY` from the DSN
scanner. Memory is bounded by `MAX_FILE_SIZE` (per-file) and the
`seen` dedup map (per-scan), so there's no blow-up risk.

Also fixes a stale docstring at the top of `worker-pool.ts` that
still described the `.unref()` design from an earlier iteration.
The current code explicitly does NOT call `.unref()` (it caused a
deadlock) and the old docstring contradicted the implementation.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] Reproduction above → 0 DSNs (pre-fix) → 1 DSN (post-fix)
- [x] New regression test in `test/lib/dsn/code-scanner.test.ts`:
      "finds DSNs on long single-line files (>2000 chars, common in
      minified JS)"
- [x] `bun test --timeout 15000 test/lib test/commands test/types`
      — **5641 pass, 0 fail** (+1 new regression test)
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
      regression (`scanCodeForDsns` 238ms, `scan.grepFiles` 163ms)
Comment thread src/lib/scan/grep.ts
PR #807 originally removed `.unref()` from each worker at spawn to
avoid a mid-pipeline deadlock where unref'd workers weren't
processing messages on idle ticks. But keeping workers ref'd broke
the CLI exit path: `project view` (and any command that transits
`detectAllDsns` → `scanCodeForDsns` → `grepFiles`) would emit its
output and then hang indefinitely because ref'd-but-idle workers
kept the event loop alive. E2E caught this:

    (fail) sentry project view > requires org and project [15001.15ms]
      ^ this test timed out after 15000ms.
    Expected: 1
    Received: 143 (SIGTERM)

Bisected cleanly:

- `d2305ba1` (workers-only commit): exit=1 on `project view` — clean.
- `6ea26f3d` (DSN unification): exit=124 — hang.

Unification made `scanCodeForDsns` route through `grepFiles`, which
creates the worker pool. Before unification, the DSN path went
through `walkFiles + mapFilesConcurrent` directly and never touched
the pool, so `project view` never saw workers.

## Fix: ref/unref balancing per dispatch

Workers are now `.unref()`'d at spawn (idle pool doesn't hold the
loop open), then `.ref()`'d in `dispatch()` before a request is
posted, and `.unref()`'d again when the dispatch settles — via the
`message` `result` handler, the `error` handler's mass-reject path,
the readiness-failure path, and `terminate()`. Each of these paths
has a matched unref for every ref so the ref count stays balanced.

This gives us both properties we need:

- **Clean CLI exit**: when all dispatches settle, workers are idle +
  unref'd, the event loop drains, and the process exits naturally.
  No `beforeExit` handler or explicit `terminatePool()` required on
  the happy path.

- **No mid-pipeline deadlock**: ref'd workers during active work
  means their `message` events fire on the main thread's next tick,
  not starved by Bun's idle-tick bypass of unref'd handles.

Added `refWorker` / `unrefWorker` helper functions for readability
and to guard against runtimes where `Worker` doesn't expose the
ref/unref methods (Node's DOM Worker shim, hypothetically).

## Manual verification

    $ cd /tmp/empty-dir
    $ time sentry project view
    Error: Could not auto-detect organization and project.
    ...
    real    0m0.040s   # was 15s timeout before fix
    exit=1

Ran 10 consecutive invocations — all exit cleanly in <50ms.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — workers
      still function end-to-end; no perf regression
      (`scan.grepFiles` 162ms, `scanCodeForDsns` 238ms)
- [x] Rebuilt binary, ran `project view` from empty dir 10×
      — all exit cleanly in <50ms. Pre-fix: 5/5 hit 15s timeout.
- [x] `auth status`, `project view` with DSN, etc. — all exit
      cleanly.
Comment thread src/lib/scan/worker-pool.ts Outdated
…ted count

Two Cursor Bugbot findings caught on the latest force-push.

## 1. `filesCollected` counts matches, not unique files (medium)

`grepFiles` emits one `GrepMatch` per matching line, not per file.
The scanner was incrementing `filesCollected` on every match, so a
file with 3 DSN-matching lines inflated the counter (and the two
`dsn.files_collected` / `dsn.files_scanned` telemetry attributes it
backed) by 3×.

Fix: use `filesSeenForMtime.size` instead. That set was already
tracking unique file paths for mtime recording. Rename-free and
semantically accurate: "files with at least one validated DSN" —
matches rg's `--files-with-matches` semantics.

## 2. Silent data loss when all workers die (medium)

The `dispatchBatch` error handler was swallowing dispatch rejections
(worker errors, "all workers dead"). If every batch failed, the
consumer loop would see zero results and exit normally, and callers
like the DSN scanner would get an empty result indistinguishable
from "no DSNs found." The DSN cache layer would then persist a
false-negative empty result, leaving the scan broken until the 24h
TTL expires.

Fix: track `dispatchedBatches` / `failedBatches` counters in
`grepViaWorkers`. When the consumer loop exits (via the `finally`
block that already re-raises `producerError`), throw if every
dispatched batch failed. A single failed batch is still acceptable
(per-file errors are swallowed by design inside the worker), but a
pipeline-wide failure now surfaces instead of masquerading as
"empty result."

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
      regression (`scan.grepFiles` 159ms, `scanCodeForDsns` 232ms)
- [x] Rebuilt binary + `project view` from empty dir — exits cleanly
      (exit 1) in <50ms. 3 consecutive runs all clean.

Not adding a regression test for finding #2 — the "all workers die"
path requires crashing a real worker, which is hard to reproduce
deterministically in a unit test. The change is a defensive guard
against an edge case that only shows up when the worker runtime
itself is broken (e.g., missing `node:fs`).
Comment thread src/lib/scan/worker-pool.ts
Cursor Bugbot caught a real race in the previous commit's per-dispatch
ref/unref pattern.

`Worker.ref()` and `Worker.unref()` are idempotent booleans — NOT
reference-counted. Calling `unref()` releases the event-loop hold
entirely regardless of how many prior `ref()` calls were made.

Previous commit: every `dispatch()` called `refWorker`, every
completion called `unrefWorker`. When a single worker had 2+
concurrent dispatches in flight (normal with round-robin batching),
the first completion's `unref()` call unrefed the worker even
though other dispatches were still running. If the walker had also
finished by that point, the event loop would exit with pending
worker results dropped silently.

Fix: track inflight count per worker (already exists as
`pw.inflight`) and only unref when it drops to zero. Specifically:

- **`message` result handler:** decrement inflight; if it hit zero,
  unref. Other dispatches still pending → keep ref'd.
- **`error` handler:** drain all pending rejects, then unref ONCE
  at the end (worker is gone either way).
- **readiness-failure path in `dispatch()`:** decrement inflight;
  if it hit zero, unref (matches `message` handler logic).
- **`terminate()`:** same as error handler — reject all pending,
  then unref once.

The spawn-time unref stays (inflight starts at 0, worker is idle).
The per-dispatch `ref()` is safe to call unconditionally (idempotent
no-op when already ref'd), so no change there.

Updated the ref/unref strategy docstring to call out the boolean
semantics explicitly, so the next person to touch this code
doesn't assume reference-counting.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — workers
      still parallelize correctly (`scan.grepFiles` 160ms,
      `scanCodeForDsns` 237ms). If the unref race were being hit,
      workers would silently starve and numbers would regress to
      fallback-path territory (~320ms).
- [x] Rebuilt binary, ran `project view` from empty dir 5× — all
      exit cleanly in <50ms.
Comment thread src/lib/dsn/code-scanner.ts
Seer caught a telemetry inconsistency from the unification refactor.

`scanCodeForDsns` was reporting `dsn.files_scanned = filesSeenForMtime.size`
— count of unique files with validated DSNs. `scanCodeForFirstDsn`
was reporting `dsn.files_scanned = filesScanned` — count of every
file walked. Same attribute name, two different meanings depending
on which path executed. Makes the telemetry unreliable for scanner
performance monitoring.

Fix: switch `scanCodeForDsns` from `grepFiles` (streaming) to
`collectGrep` (buffered), which returns `stats.filesRead` — the
count of files the grep pipeline actually read and tested. That's
equivalent to walker-yielded files (after the walker's own
extension/gitignore filters), matching what `scanCodeForFirstDsn`
reports.

DSN workloads buffer <10 matches typically (validated DSNs from a
source tree), so the streaming-to-buffered switch has no measurable
perf impact. The `filesSeenForMtime` set stays — it still dedups
mtime writes across multi-DSN-per-file (grep emits one match per
line; without the dedup, a file with 3 DSN-containing lines would
trigger 3 mtime writes for the same path).

Updated the comment to reflect the new role (mtime-write dedup,
not telemetry) and pointed the docstrings at `collectGrep` instead
of `grepFiles`.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
      **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
      regression (`scan.grepFiles` 160ms — this bench op uses
      `collectGrep` so it was already in the buffered path;
      `scanCodeForDsns` reruns are in the same range as before).
- [x] Rebuilt binary, ran `project view` from empty dir — all
      exit cleanly in <50ms.
Comment thread src/lib/scan/grep.ts
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit dff0be1. Configure here.

Comment thread src/lib/scan/worker-pool.ts
Two latent LOW-severity findings caught on the final CI round.

## 1. Seer: failure-check ran before in-flight dispatches settled

In the early-exit path (`AbortError`, `maxResults = 0`), the
`finally` block's all-batches-failed check could race with
in-flight dispatch `.catch()` handlers. Sequence:

  1. Consumer emits enough matches, exits loop.
  2. `finally` runs, awaits `producer` (which just dispatches).
  3. `finally` checks `failedBatches === dispatchedBatches`.
  4. In-flight dispatch `.catch()` fires LATER, bumps
     `failedBatches`.

Step 3 reads stale state. If every batch would ultimately fail,
we'd silently return an empty result instead of surfacing the
pipeline failure. Seer rated this LOW because only `maxResults: 0`
triggers the early exit before any dispatch can settle, and no
production caller uses that value — but it's a real bug either
way and the fix is cheap.

Fix: track dispatch promises and `await Promise.allSettled()` on
them in the `finally` block before the failure check. `allSettled`
rather than `all` because individual rejections are already
counted via the dispatch's `.catch()`; we just need them to
complete.

## 2. Cursor: readiness-failure used `pop()` not identity-based removal

The readiness-failure handler in `dispatch()` was calling
`chosen.pending.pop()` to drop "this dispatch's slot" — but with
multiple concurrent dispatches queued to the same worker, `pop()`
returns whichever slot happened to be at the END of the queue,
not the one belonging to the failing dispatch. Handlers for
sibling dispatches would execute in whatever order `ready`'s
rejection chain fired, and `pop()` would reject them in reverse
queue order.

Currently unreachable because `chosen.ready` never rejects in the
current pool impl (the ready promise has no failure path), but
latent — any future addition of a startup timeout or init failure
would expose the bug.

Fix: hold a direct reference to `ourSlot` in each dispatch's
closure. The readiness-failure handler finds its slot via
`indexOf(ourSlot)` and splices that specific entry, preserving
FIFO semantics for sibling dispatches.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean
- [x] `bun test --timeout 15000 test/lib test/commands test/types`
      — **5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3 --warmup 1` — no perf
      regression (`scan.grepFiles` 160ms)
- [x] Rebuilt binary, ran `project view` from empty dir 3×
      — all exit cleanly.
@BYK BYK merged commit 35d4d2d into main Apr 22, 2026
26 checks passed
@BYK BYK deleted the byk/perf/grep-workers branch April 22, 2026 08:34
BYK added a commit that referenced this pull request Apr 22, 2026
## Summary

Follow-up to #791. Replaces the init-wizard's `rg → git grep → fs walk`
fallback chain in `src/lib/init/tools/grep.ts` + `glob.ts` with thin
adapters over the pure-TS `collectGrep` / `collectGlob` helpers from
`src/lib/scan/`.

The Mastra wire contract is preserved byte-identical on all existing
fields. Two optional fields (`caseInsensitive`, `multiline`) are added
to `GrepSearch` for future-proofing — no current server invocation sets
them.

## What changed

### Adapters (net −392 LOC of production code)

| File | Before | After |
|---|---:|---:|
| `src/lib/init/tools/grep.ts` | 337 LOC | 114 LOC |
| `src/lib/init/tools/glob.ts` | 147 LOC | 73 LOC |
| `src/lib/init/tools/search-utils.ts` | 146 LOC | **deleted** |
| `src/lib/init/types.ts` | — | +17 LOC (two optional fields) |

The adapters now just:
1. Sandbox the user-supplied `search.path` via `safePath`.
2. Forward each search/pattern to `collectGrep` / `collectGlob` with the
wire-level constants (`maxResults`, `maxLineLength`) plumbed through.
3. Strip `absolutePath` from each `GrepMatch` — the Mastra wire has
never included it.
4. Catch `ValidationError` from bad regex so a single bad pattern
surfaces as an empty per-search row rather than aborting the whole
payload.

### New optional `GrepSearch` fields

```ts
export type GrepSearch = {
  pattern: string;
  path?: string;
  include?: string;
  caseInsensitive?: boolean;  // NEW
  multiline?: boolean;         // NEW
};
```

No current Mastra server invocation sets these. Adding them now means
the server can start sending them without a CLI update. The underlying
scan engine natively supports both.

### Tests

- **Deleted 3 obsolete tests** that shadowed `rg` in `PATH` to force the
fallback chain. With the pure-TS adapter there's no fallback chain to
exercise; the tests had become tautological.
- **Deleted scaffolding**: `writeExecutable`, `setPath`, `helperBinDir`,
`savedPath` — only used by the deleted tests.
- **Kept 3 pre-existing tests** unchanged: include filters,
multi-pattern glob, sandbox rejection.
- **Added 3 new adapter-specific tests**:
- `grep result matches MUST NOT include absolutePath` — pins the strip
behavior so `absolutePath` never leaks to the Mastra agent.
- `grep bad regex yields empty matches without crashing the payload` —
documents the `ValidationError` catch contract.
- `grep caseInsensitive flag enables case-insensitive matching` —
end-to-end coverage for the new wire field.

## Behavior changes (intentional, only affects users without `rg`/`git`)

Before this PR, the init-wizard fs-walk fallback was naive: no
`.gitignore` handling, narrow skip list, no binary detection. Users with
`rg` or `git` installed never took this path. After this PR, **every
user gets rg-like behavior** via the pure-TS scanner:

- **Nested `.gitignore`** honored (cumulative semantics, matching git +
rg).
- **Wider skip-dir list** — scan excludes `.next`, `target`, `vendor`,
`coverage`, `.cache`, `.turbo` in addition to the old skip set.
- **Binary files filtered** — scan runs an 8 KB NUL-byte sniff before
grep-ing.

Users with `rg` installed see no behavior change relative to the rg-path
of the old adapter.

## Benchmarks

Measured after rebasing onto `main` (includes #791, #804, #805, #807 —
the full grep/worker-pool stack).

**Fixture:** 10k files (~80 MB), 3 monorepo packages, mix of text +
binary.
**Config:** 5 runs after 2 warmup, pre-warmed page cache.
**Machine:** linux/x64, 4 CPUs, Bun 1.3.11. ripgrep 14.1.0.
**Harness:** `/tmp/bench-init-tools.ts` — invokes the adapter via the
real `grep(GrepPayload)` entry point Mastra uses.

### Adapter vs rg — single search

| Pattern | rg | NEW uncapped | NEW cap=100 | uncapped vs rg | cap vs rg
|
|---|---:|---:|---:|---:|---:|
| `import.*from` (215k matches) | 284 ms | 769 ms | **28.4 ms** | 2.70×
slower | **10× FASTER** |
| `function\s+\w+` (216k) | 306 ms | 782 ms | **28.2 ms** | 2.56× slower
| **11× FASTER** |
| `SENTRY_DSN` (677) | 81 ms | 496 ms | 79.3 ms | 6.10× slower | parity
|
| no matches | 74 ms | 486 ms | 492 ms | 6.6× slower | 6.7× slower |

The init-wizard workload is always capped at 100. On dense-match
patterns (the common case — Mastra greps for common code patterns like
imports and function signatures) the worker pool's literal-prefilter +
concurrent fan-out + early-exit hit 100 matches in ~28 ms, **10× faster
than rg**. On rare/no-match patterns we're walker-bound at ~500 ms —
still fine for init wizard.

### Adapter multi-search (realistic Mastra workload)

Payload: 3 patterns (`import.*from`, `SENTRY_DSN`, `function\s+\w+`) in
parallel via `Promise.all`, `maxResultsPerSearch: 100`.

| Metric | Value |
|---|---:|
| p50 end-to-end | **109 ms** |

All three searches share the same worker pool (dispatched concurrently).

### Scan-harness numbers (context)

From `bun run bench --size large`, showing that the new adapter is
routed through the worker-backed grep pipeline:

| Op | p50 |
|---|---:|
| `scan.grepFiles` | 166 ms |
| `scanCodeForDsns` | 232 ms |
| `detectDsn.cold` | 4.78 ms |

## Why ship this

Init wizard grep has one caller: the Mastra server, which always sends
`maxResults: 100`. For that workload the NEW adapter is **10× faster
than rg** on common dense patterns and sidesteps the
subprocess-dependency problem:

- Users without `rg` no longer fall back to a naive walk that ignores
gitignore and scans binaries.
- No spawning, no stderr draining, no exit-code decoding, no
parse-out-pipe-delimited-output.
- `GrepSearch` gains `caseInsensitive` + `multiline` — passthrough to
the scan engine's native support.

The correctness tax (nested `.gitignore` respected, wider skip list,
binaries filtered) is paid once per scan regardless of match density. On
the init wizard's capped workload it's noise; on exhaustive scans it's
~500 ms on 10k files vs rg's ~300 ms.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts:281`, not touched by this PR)
- [x] `bun test test/lib/init/ test/lib/scan/` — **352 pass, 0 fail**
- [x] Rebased onto latest main (includes #791, #804, #805, #807)
- [x] `bun run /tmp/bench-init-tools.ts` — numbers in table above

## What this PR does NOT change

- `GrepPayload` / `GlobPayload` — unchanged structurally, only
`GrepSearch` gains optional fields
- `src/lib/init/tools/registry.ts` — tool dispatch unchanged
- `src/lib/init/tools/shared.ts` — `safePath` + `validateToolSandbox`
unchanged
- Response shape — `{ ok: true, data: { results: [{pattern, matches,
truncated}] } }` identical

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BYK added a commit that referenced this pull request Apr 22, 2026
## Summary

Cleanup pass over the `scan/` and `dsn/` modules after the grep +
worker-pool stack (#791, #804, #805, #807, #797) landed. Removed comment
bloat accumulated across the 6+ review cycles those PRs went through —
redundant bug-history narration, repeated explanations of `ref/unref`
boolean semantics, "pre-PR-N" references, and other scars that wouldn't
survive a fresh-eyes read.

**Net −708 LOC across 12 files. No behavior changes.**

## Per-file reductions

| File | Before | After | Δ |
|---|---:|---:|---:|
| `src/lib/scan/worker-pool.ts` | 466 | 312 | −33% |
| `src/lib/scan/grep.ts` | 985 | 712 | −28% |
| `src/lib/dsn/code-scanner.ts` | 541 | 377 | −30% |
| `src/lib/scan/grep-worker.js` | 153 | 114 | −25% |
| `src/lib/dsn/scan-options.ts` | 70 | 52 | −26% |
| `src/lib/init/tools/grep.ts` | 122 | 98 | −20% |
| `src/lib/init/tools/glob.ts` | 74 | 59 | −20% |

Plus minor trims in `types.ts`, `walker.ts`, `path-utils.ts`,
`scan/glob.ts`, and `script/text-import-plugin.ts`.

## What was removed

- Redundant explanations of `Worker.ref()` / `.unref()` boolean
semantics (stated 3× in `worker-pool.ts` — kept once, on the primary
ref/unref helper pair).
- Multi-paragraph "earlier iteration did X, deadlocked, so we now do Y"
histories — kept in git log where they belong.
- "pre-PR-3" / "pre-refactor" / "previous version" narration that
explained how code used to look before the current session's work.
- File-header docstrings that repeated what the module structure and
exports already told you.
- Rationale comments for `biome-ignore`s that were already justified by
adjacent context.

## What was kept

- Every `biome-ignore` comment (all still justified).
- Every "why" comment tied to a real gotcha (e.g., the length-change
warning on case-insensitive literal prefilters, the `/g` flag cloning
rationale, the pipeline-failure detector explanation).
- Every JSDoc on exported functions and types.
- All inline comments that explain non-obvious constraints (sandbox
enforcement, cache-contract stability, etc.).

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing markdown warning)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3` — no perf regression
(`scan.grepFiles` 167ms, `scanCodeForDsns` 232ms — matching pre-trim
numbers)
- [x] `bun run build --single` — binary builds and exits cleanly on
`sentry project view` from empty dir (3 consecutive runs, all exit=1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant