Skip to content

refactor(init): migrate grep/glob tools to src/lib/scan/#797

Merged
BYK merged 1 commit intomainfrom
byk/feat/ripgrep-init-wizard
Apr 22, 2026
Merged

refactor(init): migrate grep/glob tools to src/lib/scan/#797
BYK merged 1 commit intomainfrom
byk/feat/ripgrep-init-wizard

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Apr 21, 2026

Summary

Follow-up to #791. Replaces the init-wizard's rg → git grep → fs walk fallback chain in src/lib/init/tools/grep.ts + glob.ts with thin adapters over the pure-TS collectGrep / collectGlob helpers from src/lib/scan/.

The Mastra wire contract is preserved byte-identical on all existing fields. Two optional fields (caseInsensitive, multiline) are added to GrepSearch for future-proofing — no current server invocation sets them.

What changed

Adapters (net −392 LOC of production code)

File Before After
src/lib/init/tools/grep.ts 337 LOC 114 LOC
src/lib/init/tools/glob.ts 147 LOC 73 LOC
src/lib/init/tools/search-utils.ts 146 LOC deleted
src/lib/init/types.ts +17 LOC (two optional fields)

The adapters now just:

  1. Sandbox the user-supplied search.path via safePath.
  2. Forward each search/pattern to collectGrep / collectGlob with the wire-level constants (maxResults, maxLineLength) plumbed through.
  3. Strip absolutePath from each GrepMatch — the Mastra wire has never included it.
  4. Catch ValidationError from bad regex so a single bad pattern surfaces as an empty per-search row rather than aborting the whole payload.

New optional GrepSearch fields

export type GrepSearch = {
  pattern: string;
  path?: string;
  include?: string;
  caseInsensitive?: boolean;  // NEW
  multiline?: boolean;         // NEW
};

No current Mastra server invocation sets these. Adding them now means the server can start sending them without a CLI update. The underlying scan engine natively supports both.

Tests

  • Deleted 3 obsolete tests that shadowed rg in PATH to force the fallback chain. With the pure-TS adapter there's no fallback chain to exercise; the tests had become tautological.
  • Deleted scaffolding: writeExecutable, setPath, helperBinDir, savedPath — only used by the deleted tests.
  • Kept 3 pre-existing tests unchanged: include filters, multi-pattern glob, sandbox rejection.
  • Added 3 new adapter-specific tests:
    • grep result matches MUST NOT include absolutePath — pins the strip behavior so absolutePath never leaks to the Mastra agent.
    • grep bad regex yields empty matches without crashing the payload — documents the ValidationError catch contract.
    • grep caseInsensitive flag enables case-insensitive matching — end-to-end coverage for the new wire field.

Behavior changes (intentional, only affects users without rg/git)

Before this PR, the init-wizard fs-walk fallback was naive: no .gitignore handling, narrow skip list, no binary detection. Users with rg or git installed never took this path. After this PR, every user gets rg-like behavior via the pure-TS scanner:

  • Nested .gitignore honored (cumulative semantics, matching git + rg).
  • Wider skip-dir list — scan excludes .next, target, vendor, coverage, .cache, .turbo in addition to the old skip set.
  • Binary files filtered — scan runs an 8 KB NUL-byte sniff before grep-ing.

Users with rg installed see no behavior change relative to the rg-path of the old adapter.

Benchmarks

Measured after rebasing onto main (includes #791, #804, #805, #807 — the full grep/worker-pool stack).

Fixture: 10k files (~80 MB), 3 monorepo packages, mix of text + binary.
Config: 5 runs after 2 warmup, pre-warmed page cache.
Machine: linux/x64, 4 CPUs, Bun 1.3.11. ripgrep 14.1.0.
Harness: /tmp/bench-init-tools.ts — invokes the adapter via the real grep(GrepPayload) entry point Mastra uses.

Adapter vs rg — single search

Pattern rg NEW uncapped NEW cap=100 uncapped vs rg cap vs rg
import.*from (215k matches) 284 ms 769 ms 28.4 ms 2.70× slower 10× FASTER
function\s+\w+ (216k) 306 ms 782 ms 28.2 ms 2.56× slower 11× FASTER
SENTRY_DSN (677) 81 ms 496 ms 79.3 ms 6.10× slower parity
no matches 74 ms 486 ms 492 ms 6.6× slower 6.7× slower

The init-wizard workload is always capped at 100. On dense-match patterns (the common case — Mastra greps for common code patterns like imports and function signatures) the worker pool's literal-prefilter + concurrent fan-out + early-exit hit 100 matches in ~28 ms, 10× faster than rg. On rare/no-match patterns we're walker-bound at ~500 ms — still fine for init wizard.

Adapter multi-search (realistic Mastra workload)

Payload: 3 patterns (import.*from, SENTRY_DSN, function\s+\w+) in parallel via Promise.all, maxResultsPerSearch: 100.

Metric Value
p50 end-to-end 109 ms

All three searches share the same worker pool (dispatched concurrently).

Scan-harness numbers (context)

From bun run bench --size large, showing that the new adapter is routed through the worker-backed grep pipeline:

Op p50
scan.grepFiles 166 ms
scanCodeForDsns 232 ms
detectDsn.cold 4.78 ms

Why ship this

Init wizard grep has one caller: the Mastra server, which always sends maxResults: 100. For that workload the NEW adapter is 10× faster than rg on common dense patterns and sidesteps the subprocess-dependency problem:

  • Users without rg no longer fall back to a naive walk that ignores gitignore and scans binaries.
  • No spawning, no stderr draining, no exit-code decoding, no parse-out-pipe-delimited-output.
  • GrepSearch gains caseInsensitive + multiline — passthrough to the scan engine's native support.

The correctness tax (nested .gitignore respected, wider skip list, binaries filtered) is paid once per scan regardless of match density. On the init wizard's capped workload it's noise; on exhaustive scans it's ~500 ms on 10k files vs rg's ~300 ms.

Test plan

What this PR does NOT change

  • GrepPayload / GlobPayload — unchanged structurally, only GrepSearch gains optional fields
  • src/lib/init/tools/registry.ts — tool dispatch unchanged
  • src/lib/init/tools/shared.tssafePath + validateToolSandbox unchanged
  • Response shape — { ok: true, data: { results: [{pattern, matches, truncated}] } } identical

🤖 Generated with Claude Code

Co-authored-by: Claude Opus 4.7 (1M context) noreply@anthropic.com

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

Bug Fixes 🐛

Documentation 📚

  • Fix auth token precedence, update stale architecture tree, and documentation audit report by cursor in #783

Internal Changes 🔧

Init

  • Migrate grep/glob tools to src/lib/scan/ by BYK in #797
  • Trim deprecated --features help entries by MathurAditya724 in #781

Other

  • (issue) Skip redundant API lookups via project+issue-org caches by BYK in #794
  • Regenerate docs by github-actions[bot] in 58a84035

🤖 This preview updates automatically when you update the PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-797/

Built to branch gh-pages at 2026-04-22 08:56 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@BYK BYK requested a review from betegon April 21, 2026 11:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codecov Results 📊

138 passed | Total: 138 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

✅ Patch coverage is 100.00%. Project has 2023 uncovered lines.
❌ Project coverage is 95.02%. Comparing base (base) to head (head).

Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.06%    95.02%    -0.04%
==========================================
  Files          283       282        -1
  Lines        40924     40660      -264
  Branches         0         0         —
==========================================
+ Hits         38904     38637      -267
- Misses        2020      2023        +3
- Partials         0         0         —

Generated by Codecov Action

Copy link
Copy Markdown
Member

@betegon betegon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this. works great with init!

BYK added a commit that referenced this pull request Apr 21, 2026
## Summary

Two regex-level optimizations to narrow the perf gap with ripgrep on our
pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797.

- **Literal prefilter** — ripgrep-style: extract a literal substring
from the regex source (e.g., `import` from `import.*from`), scan the
buffer with `indexOf` to locate candidate lines, only invoke the regex
engine on lines that contain the literal. V8's `indexOf` is roughly
SIMD-speed; skipping the regex engine on non-candidate lines is where
most of the win comes from.
- **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n",
cursor)` hops. 2-5× faster on the line-counting sub-loop because V8
implements `indexOf` in C++ without per-iteration JS interop.

## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)

| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** |
| `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** |
| `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** |

The DSN workloads improve because `DSN_PATTERN` extracts `http` as its
literal — most source files don't contain `http` at all, so the
prefilter short-circuits before the regex runs.

No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer
path unchanged.

## What changed

### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)

Conservative literal extractor. Walks a regex source looking for the
longest contiguous run of literal bytes that every match must contain.
Bails out safely on top-level alternation, character classes, groups,
lookarounds, quantifiers, and escape classes.

Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields
`foo` (escape `\b` is an anchor, not a literal `b`).

Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or null if
no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
literal with no metacharacters. Used by the grep pipeline to route
pure-literals to the whole-buffer path (V8's regex engine is
hyper-optimized for pure-literal patterns; the prefilter adds overhead
without benefit there).

### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)

Three-way dispatch in `readAndGrep` based on the extracted literal:

1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal +
`multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs
the regex engine only on those. This is the main perf win.
2. **`grepByWholeBuffer`** — existing path, used for:
   - Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level
alternation)
   - `multiline: false` mode (the fast path requires per-line semantics)

Also: replaced the `charCodeAt`-walk that counted newlines char-by-char
with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx,
bounds)` as a shared helper to bundle the match-construction arguments.

### Tests added

- `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the
extractor's rules (escape handling, quantifier drop, alternation bail,
case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast
path: correctness vs whole-buffer, escaped-literal extraction,
case-insensitive flag, zero-literal-hit short-circuit, routing of pure
literals to whole-buffer, and alternation routing.

## Why this approach

From the ripgrep research (attached to PR #791): rg's central perf trick
is extracting a literal from each regex and prefiltering with SIMD
memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop with
internal SIMD on x64 — functionally equivalent for our use case.

Three of the five techniques in the Loggly regex-perf guide were
evaluated:
- **Character classes over `.*`** — `DSN_PATTERN` already uses
`[a-z0-9]+`, no change needed.
- **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is
already correctly ordered (`.` more common than `:` in DSN hosts);
swapping regressed perf by noise.
- **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed*
perf 2.8× on our workload. V8's existing fast character-mismatch
rejection on the first byte outperforms the boundary check overhead.

The remaining gap with rg is now primarily orchestration overhead
(async/await, `mapFilesConcurrent`, walker correctness features) rather
than regex speed. A worker-pool exploration may follow.

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated to this PR)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5610 pass, 0 fail** (+58 new)
- [x] `bun test test/isolated` — 138 pass, 0 fail
- [x] `bun run bench --size large --runs 5` — all scan ops at or below
previous baseline
- [x] Manually verified semantic parity: `collectGrep` returns identical
`GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the
prefilter fires

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the rg → git grep → fs walk fallback chain in the init-wizard
grep and glob tools with direct calls to the pure-TS `collectGrep` /
`collectGlob` helpers from `src/lib/scan/` (shipped in PR #791). The
Mastra wire contract is preserved verbatim on all existing fields.

## Changes

- **`src/lib/init/tools/grep.ts`**: 299 → 114 LOC. Drops
  `rgGrepSearch`, `gitGrepSearch`, `fsGrepSearch`, `parseRgGrepOutput`,
  `parseGrepOutput`, `findRegexMatches`, `readSearchableFile`,
  `truncateMatchLine`, `limitMatches`, `compilePattern`. Replaces with
  a thin adapter that forwards each search to `collectGrep`, strips
  the `absolutePath` field (not part of the wire contract), and
  catches `ValidationError` from bad regexes so a single bad pattern
  doesn't abort the whole payload.

- **`src/lib/init/tools/glob.ts`**: 145 → 73 LOC. Drops
  `rgGlobSearch`, `gitLsFiles`, `fsGlobSearch`. Replaces with a thin
  adapter that calls `collectGlob` per pattern (preserving
  per-pattern attribution, which `collectGlob`'s unioning would
  lose).

- **`src/lib/init/tools/search-utils.ts`**: DELETED (146 LOC). Zero
  callers after the grep/glob rewrites.

- **`src/lib/init/types.ts::GrepSearch`**: two optional fields added:
  `caseInsensitive?: boolean` and `multiline?: boolean`. No current
  Mastra server invocation sets them; both default to what the scan
  engine defaults to (case-sensitive, line-boundary anchoring — rg
  semantics). Future-proofing.

- **`src/lib/scan/glob.ts`**: updated a stale doc comment that
  referenced the deleted `search-utils.ts::matchGlob`.

## Test changes

- Drop 3 obsolete subprocess-fallback tests that shadowed `rg` in
  `PATH` to force the fallback chain. With the pure-TS adapter there's
  no fallback chain to exercise and the tests had become tautological.
- Drop the `writeExecutable`, `setPath`, `helperBinDir`, `savedPath`
  scaffolding those tests depended on.
- Keep 3 pre-existing wire-behavior/sandbox tests unchanged.
- Add 3 adapter-specific tests:
  - `grep result matches MUST NOT include absolutePath` — pins the
    adapter's strip behavior (scan returns it; Mastra must not see
    it).
  - `grep bad regex yields empty matches without crashing the payload`
    — documents the `ValidationError` catch contract.
  - `grep caseInsensitive flag enables case-insensitive matching` —
    end-to-end coverage of the new wire field.

## Behavior changes (intentional, from scan module defaults)

Three behavior shifts land for the pre-PR-791 fs-walk fallback
(users without `rg` or `git` installed):

- **Nested `.gitignore` now honored.** Old fs-walk fallback ignored
  gitignore entirely; scan respects cumulative gitignore semantics
  (matches what `rg` / git grep already did).
- **Wider skip-dir list.** Scan skips `.next`, `target`, `vendor`,
  `coverage`, `.cache`, `.turbo` in addition to the old skip set —
  matches rg's built-in skips.
- **Binary files filtered.** Scan runs an 8 KB NUL-byte sniff before
  emitting a file to grep; binary matches (e.g. inside a `.png`) no
  longer appear. Again, matches `rg`'s default.

Users with `rg` installed see zero change — they never took the
fallback path anyway. Users without rg/git get rg-like behavior
instead of the old naive fs walk.

## Net LOC

Before: 299 + 145 + 146 = 590 LOC across three files.
After: 114 + 73 + 0 (+17 on types.ts) = 204 LOC across two files.
**Net: −386 LOC of production code.**

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in `markdown.ts`)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5507 pass, 0 fail** (+1 net: −3 obsolete, +4 new adapter tests, +absolutePath regression)
- [x] `bun test test/isolated` — 138 pass
- [x] Manual: `bun test test/lib/init/tools/` — 30 pass, wire contract preserved end-to-end via `executeTool`

Follow-up to PR #791.
@BYK BYK force-pushed the byk/feat/ripgrep-init-wizard branch from 04d7527 to 1d98bc2 Compare April 22, 2026 08:55
@BYK BYK merged commit a8bceda into main Apr 22, 2026
26 checks passed
@BYK BYK deleted the byk/feat/ripgrep-init-wizard branch April 22, 2026 09:08
BYK added a commit that referenced this pull request Apr 22, 2026
## Summary

Cleanup pass over the `scan/` and `dsn/` modules after the grep +
worker-pool stack (#791, #804, #805, #807, #797) landed. Removed comment
bloat accumulated across the 6+ review cycles those PRs went through —
redundant bug-history narration, repeated explanations of `ref/unref`
boolean semantics, "pre-PR-N" references, and other scars that wouldn't
survive a fresh-eyes read.

**Net −708 LOC across 12 files. No behavior changes.**

## Per-file reductions

| File | Before | After | Δ |
|---|---:|---:|---:|
| `src/lib/scan/worker-pool.ts` | 466 | 312 | −33% |
| `src/lib/scan/grep.ts` | 985 | 712 | −28% |
| `src/lib/dsn/code-scanner.ts` | 541 | 377 | −30% |
| `src/lib/scan/grep-worker.js` | 153 | 114 | −25% |
| `src/lib/dsn/scan-options.ts` | 70 | 52 | −26% |
| `src/lib/init/tools/grep.ts` | 122 | 98 | −20% |
| `src/lib/init/tools/glob.ts` | 74 | 59 | −20% |

Plus minor trims in `types.ts`, `walker.ts`, `path-utils.ts`,
`scan/glob.ts`, and `script/text-import-plugin.ts`.

## What was removed

- Redundant explanations of `Worker.ref()` / `.unref()` boolean
semantics (stated 3× in `worker-pool.ts` — kept once, on the primary
ref/unref helper pair).
- Multi-paragraph "earlier iteration did X, deadlocked, so we now do Y"
histories — kept in git log where they belong.
- "pre-PR-3" / "pre-refactor" / "previous version" narration that
explained how code used to look before the current session's work.
- File-header docstrings that repeated what the module structure and
exports already told you.
- Rationale comments for `biome-ignore`s that were already justified by
adjacent context.

## What was kept

- Every `biome-ignore` comment (all still justified).
- Every "why" comment tied to a real gotcha (e.g., the length-change
warning on case-insensitive literal prefilters, the `/g` flag cloning
rationale, the pipeline-failure detector explanation).
- Every JSDoc on exported functions and types.
- All inline comments that explain non-obvious constraints (sandbox
enforcement, cache-contract stability, etc.).

## Test plan

- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing markdown warning)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5641 pass, 0 fail**
- [x] `bun test test/isolated` — 138 pass
- [x] `bun run bench --size large --runs 3` — no perf regression
(`scan.grepFiles` 167ms, `scanCodeForDsns` 232ms — matching pre-trim
numbers)
- [x] `bun run build --single` — binary builds and exits cleanly on
`sentry project view` from empty dir (3 consecutive runs, all exit=1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants