refactor(init): migrate grep/glob tools to src/lib/scan/#797
Merged
Conversation
Contributor
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛
Documentation 📚
Internal Changes 🔧Init
Other
🤖 This preview updates automatically when you update the PR. |
Contributor
|
Contributor
Codecov Results 📊✅ 138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ✅ Patch coverage is 100.00%. Project has 2023 uncovered lines. Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.06% 95.02% -0.04%
==========================================
Files 283 282 -1
Lines 40924 40660 -264
Branches 0 0 —
==========================================
+ Hits 38904 38637 -267
- Misses 2020 2023 +3
- Partials 0 0 —Generated by Codecov Action |
betegon
approved these changes
Apr 21, 2026
Member
betegon
left a comment
There was a problem hiding this comment.
thanks for this. works great with init!
6 tasks
BYK
added a commit
that referenced
this pull request
Apr 21, 2026
## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the rg → git grep → fs walk fallback chain in the init-wizard grep and glob tools with direct calls to the pure-TS `collectGrep` / `collectGlob` helpers from `src/lib/scan/` (shipped in PR #791). The Mastra wire contract is preserved verbatim on all existing fields. ## Changes - **`src/lib/init/tools/grep.ts`**: 299 → 114 LOC. Drops `rgGrepSearch`, `gitGrepSearch`, `fsGrepSearch`, `parseRgGrepOutput`, `parseGrepOutput`, `findRegexMatches`, `readSearchableFile`, `truncateMatchLine`, `limitMatches`, `compilePattern`. Replaces with a thin adapter that forwards each search to `collectGrep`, strips the `absolutePath` field (not part of the wire contract), and catches `ValidationError` from bad regexes so a single bad pattern doesn't abort the whole payload. - **`src/lib/init/tools/glob.ts`**: 145 → 73 LOC. Drops `rgGlobSearch`, `gitLsFiles`, `fsGlobSearch`. Replaces with a thin adapter that calls `collectGlob` per pattern (preserving per-pattern attribution, which `collectGlob`'s unioning would lose). - **`src/lib/init/tools/search-utils.ts`**: DELETED (146 LOC). Zero callers after the grep/glob rewrites. - **`src/lib/init/types.ts::GrepSearch`**: two optional fields added: `caseInsensitive?: boolean` and `multiline?: boolean`. No current Mastra server invocation sets them; both default to what the scan engine defaults to (case-sensitive, line-boundary anchoring — rg semantics). Future-proofing. - **`src/lib/scan/glob.ts`**: updated a stale doc comment that referenced the deleted `search-utils.ts::matchGlob`. ## Test changes - Drop 3 obsolete subprocess-fallback tests that shadowed `rg` in `PATH` to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise and the tests had become tautological. - Drop the `writeExecutable`, `setPath`, `helperBinDir`, `savedPath` scaffolding those tests depended on. - Keep 3 pre-existing wire-behavior/sandbox tests unchanged. - Add 3 adapter-specific tests: - `grep result matches MUST NOT include absolutePath` — pins the adapter's strip behavior (scan returns it; Mastra must not see it). - `grep bad regex yields empty matches without crashing the payload` — documents the `ValidationError` catch contract. - `grep caseInsensitive flag enables case-insensitive matching` — end-to-end coverage of the new wire field. ## Behavior changes (intentional, from scan module defaults) Three behavior shifts land for the pre-PR-791 fs-walk fallback (users without `rg` or `git` installed): - **Nested `.gitignore` now honored.** Old fs-walk fallback ignored gitignore entirely; scan respects cumulative gitignore semantics (matches what `rg` / git grep already did). - **Wider skip-dir list.** Scan skips `.next`, `target`, `vendor`, `coverage`, `.cache`, `.turbo` in addition to the old skip set — matches rg's built-in skips. - **Binary files filtered.** Scan runs an 8 KB NUL-byte sniff before emitting a file to grep; binary matches (e.g. inside a `.png`) no longer appear. Again, matches `rg`'s default. Users with `rg` installed see zero change — they never took the fallback path anyway. Users without rg/git get rg-like behavior instead of the old naive fs walk. ## Net LOC Before: 299 + 145 + 146 = 590 LOC across three files. After: 114 + 73 + 0 (+17 on types.ts) = 204 LOC across two files. **Net: −386 LOC of production code.** ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `markdown.ts`) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5507 pass, 0 fail** (+1 net: −3 obsolete, +4 new adapter tests, +absolutePath regression) - [x] `bun test test/isolated` — 138 pass - [x] Manual: `bun test test/lib/init/tools/` — 30 pass, wire contract preserved end-to-end via `executeTool` Follow-up to PR #791.
04d7527 to
1d98bc2
Compare
6 tasks
BYK
added a commit
that referenced
this pull request
Apr 22, 2026
## Summary Cleanup pass over the `scan/` and `dsn/` modules after the grep + worker-pool stack (#791, #804, #805, #807, #797) landed. Removed comment bloat accumulated across the 6+ review cycles those PRs went through — redundant bug-history narration, repeated explanations of `ref/unref` boolean semantics, "pre-PR-N" references, and other scars that wouldn't survive a fresh-eyes read. **Net −708 LOC across 12 files. No behavior changes.** ## Per-file reductions | File | Before | After | Δ | |---|---:|---:|---:| | `src/lib/scan/worker-pool.ts` | 466 | 312 | −33% | | `src/lib/scan/grep.ts` | 985 | 712 | −28% | | `src/lib/dsn/code-scanner.ts` | 541 | 377 | −30% | | `src/lib/scan/grep-worker.js` | 153 | 114 | −25% | | `src/lib/dsn/scan-options.ts` | 70 | 52 | −26% | | `src/lib/init/tools/grep.ts` | 122 | 98 | −20% | | `src/lib/init/tools/glob.ts` | 74 | 59 | −20% | Plus minor trims in `types.ts`, `walker.ts`, `path-utils.ts`, `scan/glob.ts`, and `script/text-import-plugin.ts`. ## What was removed - Redundant explanations of `Worker.ref()` / `.unref()` boolean semantics (stated 3× in `worker-pool.ts` — kept once, on the primary ref/unref helper pair). - Multi-paragraph "earlier iteration did X, deadlocked, so we now do Y" histories — kept in git log where they belong. - "pre-PR-3" / "pre-refactor" / "previous version" narration that explained how code used to look before the current session's work. - File-header docstrings that repeated what the module structure and exports already told you. - Rationale comments for `biome-ignore`s that were already justified by adjacent context. ## What was kept - Every `biome-ignore` comment (all still justified). - Every "why" comment tied to a real gotcha (e.g., the length-change warning on case-insensitive literal prefilters, the `/g` flag cloning rationale, the pipeline-failure detector explanation). - Every JSDoc on exported functions and types. - All inline comments that explain non-obvious constraints (sandbox enforcement, cache-contract stability, etc.). ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing markdown warning) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5641 pass, 0 fail** - [x] `bun test test/isolated` — 138 pass - [x] `bun run bench --size large --runs 3` — no perf regression (`scan.grepFiles` 167ms, `scanCodeForDsns` 232ms — matching pre-trim numbers) - [x] `bun run build --single` — binary builds and exits cleanly on `sentry project view` from empty dir (3 consecutive runs, all exit=1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #791. Replaces the init-wizard's
rg → git grep → fs walkfallback chain insrc/lib/init/tools/grep.ts+glob.tswith thin adapters over the pure-TScollectGrep/collectGlobhelpers fromsrc/lib/scan/.The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (
caseInsensitive,multiline) are added toGrepSearchfor future-proofing — no current server invocation sets them.What changed
Adapters (net −392 LOC of production code)
src/lib/init/tools/grep.tssrc/lib/init/tools/glob.tssrc/lib/init/tools/search-utils.tssrc/lib/init/types.tsThe adapters now just:
search.pathviasafePath.collectGrep/collectGlobwith the wire-level constants (maxResults,maxLineLength) plumbed through.absolutePathfrom eachGrepMatch— the Mastra wire has never included it.ValidationErrorfrom bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload.New optional
GrepSearchfieldsNo current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both.
Tests
rginPATHto force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological.writeExecutable,setPath,helperBinDir,savedPath— only used by the deleted tests.grep result matches MUST NOT include absolutePath— pins the strip behavior soabsolutePathnever leaks to the Mastra agent.grep bad regex yields empty matches without crashing the payload— documents theValidationErrorcatch contract.grep caseInsensitive flag enables case-insensitive matching— end-to-end coverage for the new wire field.Behavior changes (intentional, only affects users without
rg/git)Before this PR, the init-wizard fs-walk fallback was naive: no
.gitignorehandling, narrow skip list, no binary detection. Users withrgorgitinstalled never took this path. After this PR, every user gets rg-like behavior via the pure-TS scanner:.gitignorehonored (cumulative semantics, matching git + rg)..next,target,vendor,coverage,.cache,.turboin addition to the old skip set.Users with
rginstalled see no behavior change relative to the rg-path of the old adapter.Benchmarks
Measured after rebasing onto
main(includes #791, #804, #805, #807 — the full grep/worker-pool stack).Fixture: 10k files (~80 MB), 3 monorepo packages, mix of text + binary.
Config: 5 runs after 2 warmup, pre-warmed page cache.
Machine: linux/x64, 4 CPUs, Bun 1.3.11. ripgrep 14.1.0.
Harness:
/tmp/bench-init-tools.ts— invokes the adapter via the realgrep(GrepPayload)entry point Mastra uses.Adapter vs rg — single search
import.*from(215k matches)function\s+\w+(216k)SENTRY_DSN(677)The init-wizard workload is always capped at 100. On dense-match patterns (the common case — Mastra greps for common code patterns like imports and function signatures) the worker pool's literal-prefilter + concurrent fan-out + early-exit hit 100 matches in ~28 ms, 10× faster than rg. On rare/no-match patterns we're walker-bound at ~500 ms — still fine for init wizard.
Adapter multi-search (realistic Mastra workload)
Payload: 3 patterns (
import.*from,SENTRY_DSN,function\s+\w+) in parallel viaPromise.all,maxResultsPerSearch: 100.All three searches share the same worker pool (dispatched concurrently).
Scan-harness numbers (context)
From
bun run bench --size large, showing that the new adapter is routed through the worker-backed grep pipeline:scan.grepFilesscanCodeForDsnsdetectDsn.coldWhy ship this
Init wizard grep has one caller: the Mastra server, which always sends
maxResults: 100. For that workload the NEW adapter is 10× faster than rg on common dense patterns and sidesteps the subprocess-dependency problem:rgno longer fall back to a naive walk that ignores gitignore and scans binaries.GrepSearchgainscaseInsensitive+multiline— passthrough to the scan engine's native support.The correctness tax (nested
.gitignorerespected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's ~500 ms on 10k files vs rg's ~300 ms.Test plan
bunx tsc --noEmit— cleanbun run lint— clean (1 pre-existing warning insrc/lib/formatters/markdown.ts:281, not touched by this PR)bun test test/lib/init/ test/lib/scan/— 352 pass, 0 failbun run /tmp/bench-init-tools.ts— numbers in table aboveWhat this PR does NOT change
GrepPayload/GlobPayload— unchanged structurally, onlyGrepSearchgains optional fieldssrc/lib/init/tools/registry.ts— tool dispatch unchangedsrc/lib/init/tools/shared.ts—safePath+validateToolSandboxunchanged{ ok: true, data: { results: [{pattern, matches, truncated}] } }identical🤖 Generated with Claude Code
Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com