Skip to content

perf+chore: V1 speed milestone — Rust beats Java on 5 of 6 cells (PSM-zero-regression)#36

Merged
ypriverol merged 23 commits into
devfrom
feat/quality-perf-id-rate
May 27, 2026
Merged

perf+chore: V1 speed milestone — Rust beats Java on 5 of 6 cells (PSM-zero-regression)#36
ypriverol merged 23 commits into
devfrom
feat/quality-perf-id-rate

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented May 26, 2026

V1 speed milestone — Rust beats Java on 5 of 6 dataset/mode cells

This PR is the cumulative V1 speed milestone for msgf-rust. Five profile-driven perf wins + the post-cutover code-quality cleanup are stacked on top of dev, gated throughout by a bit-identical PIN regression test — so Percolator @1% PSM counts are unchanged on every dataset/mode cell vs the pre-PR dev baseline.

Headline numbers

Wall time — Rust vs Java (same VM, same load, three reference datasets)

Dataset Mode Java Rust Δ vs Java Speedup
LFQ (PXD001819) cal=off 1:22 0:43 -47% 1.9×
LFQ cal=auto 1:22 0:53 -36% 1.6×
Astral cal=off 5:40 5:26 -4% 1.04×
Astral cal=auto 5:40 5:52 +3.5% 0.97×
TMT cal=off 2:55 2:07 -27% 1.4×
TMT cal=auto 2:55 2:19 -20% 1.2×

Wall time — cumulative dev → PR HEAD

Dataset Mode dev baseline This PR Δ
LFQ cal=off 0:46 0:43 -7%
LFQ cal=auto 1:00 0:53 -12%
Astral cal=off 6:12 5:26 -12.4%
Astral cal=auto 6:54 5:52 -14.9%
TMT cal=off 2:59 2:07 -28.9%
TMT cal=auto 3:21 2:19 -30.8%

PSMs @1% FDR (Percolator) — Rust vs Java

PSM counts are identical to the pre-PR dev baseline on every cell (PINs sorted-row identical). The PR is a pure perf change.

Dataset Mode Java Rust Δ vs Java
LFQ cal=off / auto 15,088 14,755 -2.2%
Astral cal=off 36,271 36,138 -0.4%
Astral cal=auto 36,271 36,715 +1.2%
TMT cal=off 10,212 9,364 -8.3%
TMT cal=auto 10,212 9,605 -5.9%

The Astral cal=auto +1.2% PSM gain over Java is from MassCalibrator (--precursor-cal auto, shipped in PR #33). The TMT 6-8% gap is the known scoring-distribution divergence tracked separately on the I5 trace investigation (PR #37, research-only).

What ships in this PR

Five shipped perf commits + post-cutover code-quality cleanup:

Commit Group Change Win
a8ad6ddd pre-cleanup Remove BUG_REVIEW.md; move CLI_MIGRATION.md to docs/ hygiene
84f83295 cleanup Scrub 33 dangling Xxx.java:LINE references post-port hygiene
ba4c6b34 cleanup Neutralize "port of MS-GF+" framing in headers/CLI hygiene
f0831b03 cleanup Rename MSGFRUST_RSS_PROBEMSGF_RSS_PROBE naming
20da1b4e + 67316e56 cleanup Fix all clippy warnings + lift CI gate to required code quality
542ab6e2 iter1 (S1) HashMapFxHashMap on hot scoring Param tables -8..-16% wall
319af81e iter2 HashMapFxHashMap on AminoAcidSet hot tables small gain
096dbca9 iter3 Skip per-position Vec<AminoAcid> clone for interior positions in mod expansion -10% TMT
a8f2beca iter4 lto = "fat" + codegen-units = 1 on release profile -17% binary size
a96a64bf + af010d5c iter5 Set target-cpu=sandybridge in .cargo/config.toml (scoped to x86_64 via cfg(target_arch)) — AVX 256-bit f64 ops vectorize the chunked GF DP inner loop; FMA deliberately disabled to preserve bit-identity -7..-18% wall, biggest single iter
741fec5b review nits Module-scope use, dedup dev-dep, comment accuracy post-review polish

Reverted experiments (kept on branch as record)

n=10+ project audit pattern: changes that modify scoring distributions or cache locality without unambiguous wins regress Percolator. Reverts kept in history as evidence.

  • 9a6607a1 → reverted d6a869de: branchless f32/f64 rounding — within noise.
  • 67002e42 → reverted 09824bde: S2 MassCalibrator threshold fallback — only +22 LFQ PSMs vs +50 gate.
  • 9c10429c → reverted 689b4f38: iter6 per-edge cache in compute_inner — post-iter5 LTO+AVX micro-regression from extra L1 pressure.

Why this stays bit-identical to dev

Three flavors of change, none of which alter semantics:

  1. Hash map flavor swap (FxHash vs SipHash13): same .get() semantics; iteration order differs but no production hot path iterates the map.
  2. Allocation / layout (Vec clone elimination, LTO=fat, codegen-units=1): no semantic change.
  3. Vectorization with FMA off (Sandy Bridge baseline): AVX vaddpd/vmulpd produce bit-identical f64 results to scalar +/*. FMA is the only AVX feature that changes intermediate rounding — explicitly disabled. The chunked GF inner loop writes to 4 independent destination indices per chunk (no reduction), so vectorization is safe by construction.

The regression test precursor_cal_off_pin_tsv_match_golden_after_sort is green on every commit. End-to-end sorted-row PIN diff vs S1b baseline is identical on all 6 dataset/mode cells.

CPU baseline note

.cargo/config.toml sets target-cpu=sandybridge (Intel Q1 2011) scoped to x86_64 only via cfg(target_arch = "x86_64"). macOS-aarch64 / ARM Linux / ARM Windows keep their architecture-default baseline. The baseline is justified because:

  • AVX 256-bit f64 ops unlock the biggest single win in this PR (iter5)
  • Sandy Bridge is 14 years old; universal across modern proteomics workstations and cloud VMs
  • The previous cargo default x86-64 (2003 / SSE2) limited LLVM to 128-bit vector ops
  • The [profile.release] LTO + cgu flags scope to release builds; the workspace-wide target-cpu means cargo test runs under the same SIMD codegen as the shipped binary, so any bit-identity regression surfaces in CI

Users on pre-2011 x86 hardware can override the baseline by editing .cargo/config.toml locally.

Verification — final state

Local tests (PR HEAD 741fec5b)

  • cargo test --release --workspace -- <CI skip list>593 passed, 0 failed
  • cargo clippy --workspace --all-targets -- -D warnings → clean
  • precursor_cal_off_pin_tsv_match_golden_after_sort (bit-identical PIN gate) → green

CI on 741fec5b — all checks pass:

Check Status Duration
Lint (clippy + rustfmt) ✅ pass 29s
Test (ubuntu-latest, x86_64) ✅ pass 4m13s
Test (macos-latest, aarch64) ✅ pass 4m50s
Test (windows-latest, x86_64) ✅ pass 5m27s
CodeRabbit ✅ pass

The bit-identical PIN gate is NOT in the CI skip list and runs on both x86_64 runners (ubuntu + windows) — covering the exact codegen path the Sandy Bridge baseline affects.

Code review

A thorough automated code review pass was run on the final state. No correctness bugs found. Three actionable style nits were applied in 741fec5b:

  1. Lifted use model::modification::ModLocation; from inside two functions to module scope in crates/search/src/candidate_gen.rs.
  2. Removed redundant rustc-hash = "2" from crates/scoring/Cargo.toml [dev-dependencies] (already a regular dependency).
  3. Clarified the [profile.release] comment in Cargo.toml to document the release-only vs workspace-wide scoping.

The reviewer specifically verified:

  • Candidate-enumeration refactor preserves the DFS emission order and slice contents for every position type (interior, terminal-with-fixed-mod, terminal-with-variable-mod).
  • .cargo/config.toml cfg-scoping is correct; aarch64 builds receive the architecture default.
  • FxHashMap iteration order does not leak into any production hot path (only msgf-trace diagnostic dumps see a different "first" element pick).
  • Bit-identical claim is structurally sound: chunked GF loop has no reductions, FMA is explicitly off, golden-PIN test gates every commit.

Out of scope (next PRs)

ypriverol added 13 commits May 26, 2026 09:04
- BUG_REVIEW.md was a transient artifact from the post-merge bug-hunt
  review; the actual fixes shipped in PR #32 and are documented in the
  PR description / commit history.
- CLI_MIGRATION.md is an audience-specific guide (Java MS-GF+ users
  porting CLI invocations) — belongs under docs/, not at root.
- DOCS.md stays at root as the primary single-file reference (per the
  iter39 docs-rewrite design).
- Updated inbound references in README.md and DOCS.md.
Adds the design spec for PR-Q1 (post-cutover code quality sweep) and
finalizes the inbound-reference updates from the prior commit
(docs/CLI_MIGRATION.md links) that weren't staged at that point.

PR-Q1 is the first of three sequential sub-projects (quality -> speed
-> ID-rate +5% per dataset). Decomposed during the 2026-05-26
brainstorm because the three concerns differ in risk, scope, and time
profile; the ID-rate target is a multi-PR research project, not a
single ship gate.

Scope: 7 groups (6 in-PR + 1 out-of-repo memory update). Dangling
.java:LINE refs (42), stale "port of MS-GF+" framing, identifier
renames (MSGFRUST_RSS_PROBE etc.), 26 clippy warnings, lift CI lint
to required, remove shipped design specs.
The Java source tree was removed in commit b4565b8 during the
Rust-cutover; the inline citations to specific Java line numbers now
point at code that does not exist in this repo. Replace each citation
with intent-only "Java parity" comments. Preserves semantic meaning;
removes the broken hyperlinks.

Parity-test files (tests/*_java_parity.rs, tests/gf_bsa_parity.rs,
tests/*_match_java.rs) untouched — their identity is Java parity and
the citations are load-bearing documentation.

8 non-test files touched, 33 refs replaced, 0 functional changes.
The codebase is post-cutover; new contributors should read crate-lib
top-of-file doc comments as descriptions of what each crate does, not
as port-bookkeeping. CLI --help and enum doc comments that compared
behavior to Java's command-line options now describe behavior directly.

KEEPS user-facing provenance:
- README.md and DOCS.md project-lineage sections
- Legacy numeric flag values (Java MS-GF+ -X) in --help (user migration)
- (Java -precursorCal) in precursor_cal doc (exact flag name we mirror)
- docs/parity-analysis/** content
- Parity test files

Touched: 5 crate-lib headers + msgf-rust CLI framing.
The "MSGFRUST_" prefix dates from an early iter-era naming and does
not match the binary's identity (msgf-rust). Switch the primary name
to MSGF_RSS_PROBE and accept the legacy name for this release with a
one-line deprecation warning on stderr. The legacy name will be
removed in the next quality cleanup.

Side-effect-only env var; no functional change to search/scoring.
Brings the workspace to clippy-clean on stable 1.87.0 so the CI lint
job can be lifted from advisory to required in the next commit.

Changes by class:
- map_or simplifications: mechanical rewrite via clippy --fix
- complex-type aliases: SegmentPartitionCache, SegmentPartitionSlice,
  DeconvResult, and RankKeptCtx struct in
  crates/scoring/src/scoring/scored_spectrum.rs
- too_many_arguments: RankKeptCtx context struct in scored_spectrum.rs;
  #[allow] with reason for directional_node_score_inner, write_tsv,
  write_tsv_to, write_spectrum_rows, and compute_cleavage_credit
- doc-list indentation: add blank line after list / fix continuation
  indent at 15 sites in msgf-rust.rs and scored_spectrum.rs
- unused_mut, ? rewrite, manual split_once, loop-counter: via clippy --fix
- needless_range_loop: suppressed with reason (seg indexes cache AND
  serves as fallback partition_for arg)

No functional behavior change; PIN/TSV bit-identical regression gate
in tree (precursor_cal_bit_identical) is the verification.
After PR-Q1 Task 4 left the workspace clippy-clean on --lib targets,
remove continue-on-error from the lint job's clippy step and extend
the lint command to --all-targets (covers tests + examples + bin in
addition to lib).

Also addresses 5 residual warnings in test/example targets that the
--lib-only fix in Task 4 didn't reach:
- crates/scoring/examples/dump_main_ion.rs: struct field shorthand
- crates/scoring/examples/dump_prefix_cache.rs: needless_range_loop
- crates/scoring/tests/add_prob_dist_chunked_parity.rs: unnecessary parens

Rustfmt remains advisory (~11k lines of fmt churn pending; separate
cleanup).

Lint job now blocks PRs on clippy regressions.
Deletes the iter39 docs-rewrite spec and plan (shipped 2026-05-23 via
PR #30; the rewrite is in dev and being relied on, so the design docs
no longer need to be discoverable in the repo). Their lineage is in
git history.

Tracks the in-flight PR-Q1 implementation plan alongside its design
spec (committed in 55cff3f).

Future protocol: when a docs/specs design file references a feature
that has fully shipped and closed any deferred gate, remove it in the
next quality cleanup.
Three non-blocking observations from the final code review:

1. DOCS.md §97 documented only the legacy MSGFRUST_RSS_PROBE name.
   Now mentions MSGF_RSS_PROBE as the canonical with the legacy noted.
2. crates/model/src/amino_acid.rs:13 inline comment referenced the
   legacy name; updated to MSGF_RSS_PROBE.
3. log_rss deprecation warning fired on every call when only the
   legacy env var was set. Guard with std::sync::Once so it prints
   exactly once per process invocation.

All non-functional; verification: deprecation warning count is now 1
under MSGFRUST_RSS_PROBE=1 + multiple log_rss checkpoints.
After PR #35 (PR-Q1) closed unmerged for not delivering measurable
wins, pivot strategy: stack 3 loosely-coupled sub-features on top
of the cleanup commits and ship ONE PR with bench-gated value.

Sub-features:
- S1: profile-guided Astral wall reduction (gate: -5% wall)
- S2: LFQ calibrator threshold fallback 1e-6 -> 1e-5 (gate: +50 PSMs)
- S3: additive PrecursorErrorPpmSquared PIN column (gate: +50 PSMs
  on any one dataset)

Each sub-feature ships only if its bench gate passes; failures get
dropped before merge.
Profile (perf record on Astral cal=off, 285K samples) identified
36.82% of CPU in HashMap<Partition, ...> lookups using SipHash13.
The hot scoring path (compute_inner, directional_node_score_inner,
rank_scorer::error_score) repeatedly looks up Partition keys for
rank_dist_table, frag_off_table, ion_existence_table, etc.

Switch the 7 Param hot tables to FxHashMap (rustc-hash). SipHash13's
state-init + 13-round mix is unnecessary for non-cryptographic keys
on a single-process search; FxHasher is a single multiply-and-xor.
Same .get/.insert API; only iteration order differs (and no hot path
iterates these tables).

Expected: ~25-30% reduction in match_spectra wall on Astral cal=off.
Bench gate (S1) requires >= 5% Astral wall reduction to ship.
If the strict SpecEValue threshold (1e-6) does not qualify
MIN_CONFIDENT_PSMS (200) in the cal pre-pass, retry once at 1e-5
before giving up. Preserves Java parity on datasets where 1e-6 already
succeeds (Astral, TMT); recovers LFQ-shaped distributions where Rust's
SpecE-tail drift leaves the cal pre-pass a few PSMs short (LFQ ships
at 193/200 in PR-A).

Median-of-residuals + MAD-based robust sigma are robust to one decade
of noisier outliers; the fallback is a one-shot retry, not a baseline
threshold change.

Bench gate (S2): LFQ auto @1% FDR >= 14,805 (baseline 14,755 + 50).
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 84dca057-4536-470e-9d19-ba595903f007

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/quality-perf-id-rate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ypriverol added 4 commits May 27, 2026 08:59
Target 1 (roundf elimination): the perf flamegraph (Astral cal=off,
post-PR-V1 FxHashMap) shows 5.15% CPU in libc roundf called from
f32::round() / f64::round() in scoring hot paths. Replace with the
branchless `(x + 0.5.copysign(x)) as i32` idiom.

Mathematically equivalent for finite non-edge values:
  - x > 0: (x + 0.5) truncated == round_half_away_from_zero
  - x < 0: (x - 0.5) truncated == round_half_away_from_zero
  - x = 0: 0 + 0 = 0 in both forms

Skipped: match_engine.rs:256,257,679,680 (`(tol - 0.4999).round()`)
which is a window-widening adjustment with a different semantic, not
the round-to-nearest idiom.

Target 2 (GF DP inner-loop): compute_inner is already tightly written;
added a comment documenting why no structural change was made and noting
the next opportunities (prev_idx caching alongside valid_edges, SIMD
widening of the inner multiply loop).

No functional behavior change; bit-identical PIN regression gate green.
Re-profile on the PR-V1 binary (Astral cal=off, 245K samples) showed
the FxHashMap swap in scoring::Param did NOT close the dominant
hashing hotspot. 39.35% of CPU was still in the FnMut::call_mut
chain, which traces to:
  expand_mod_combinations -> AminoAcidSet::variants_for
  -> HashMap<(u8, ModLocation), Vec<AminoAcid>>::get
  -> hashbrown find_inner with std::hash::random::RandomState (SipHash13)

PR-V1 missed this map because it lives in the `model` crate, not
`scoring`. The candidate enumeration calls variants_for once per
peptide position per candidate; on Astral this dominates wall.

This commit:
- Adds rustc-hash as a model-crate dependency.
- Switches AminoAcidSet::table and aa_lists_cache from HashMap to
  FxHashMap. Same hashbrown internals; FxHasher is a multiply-and-xor
  vs SipHash13's 13-round mix.

No functional behavior change; PIN bit-identical regression gate
green. Tests pass.

Expected wall reduction on Astral cal=off: -10..-25%, depending on
how much of the 39% chain comes from the SipHash vs the surrounding
Arc clones and Vec extends.
…enum

A perf trace on the PR-V1 binary (Astral cal=off, 245K samples) showed
12.63%+ of CPU under `to_vec<AminoAcid>` and `Arc<Modification>::clone`
within the `expand_mod_combinations` -> `variants_for` chain. The
candidate enumerator was cloning the full Anywhere-variants Vec for
every interior position of every candidate peptide. For a length-L
peptide, L-2 of L positions had no terminal merging to do (typically
~87% of positions on real peptides) and the clone was pure waste.

Refactor: only the 1-2 terminal positions (pos 0 and pos L-1) build
owned merged variants via a new `build_terminal_variants` helper.
Interior positions reference the AminoAcidSet's Anywhere-variants slice
directly. `expand_recursive`'s signature changes from
`&[Vec<AminoAcid>]` to `&[&[AminoAcid]]`.

No functional behavior change; bit-identical regression gate green.

Expected wall reduction on Astral cal=off: -5..-15%, depending on
how much of the 12.63% chain is the `to_vec` vs the surrounding
recursion overhead.
@ypriverol
Copy link
Copy Markdown
Member Author

iter2 + iter3 added on top of S1

Two more profile-driven commits stacked on the PR. Both are bit-identical to S1's output (sorted-row PIN match on all 6 dataset/mode cells); they only change how the work runs.

Commits

Commit What
319af81e iter2HashMap -> FxHashMap on AminoAcidSet hot tables (table, aa_lists_cache). Found via re-profile after S1: PR-V1 swapped the scoring crate's maps but missed this one in model::aa_set.
096dbca9 iter3 — eliminate per-interior-position Vec<AminoAcid> clone in candidate enumeration. expand_mod_combinations was cloning the same AminoAcidSet::Anywhere slice once per interior position (~87% of L positions). New build_terminal_variants helper builds only the 2 terminal Vecs; interior positions borrow the underlying slice via &[AminoAcid]. Biggest gain on TMT (densest mod-variant counts).

Bench (iter3 HEAD vs S1b baseline, both runs on the same VM)

Dataset Mode S1b wall iter3 wall Δ
LFQ (PXD001819) off 0:46.78 0:44.04 -5.9%
LFQ auto 1:01.08 0:57.92 -5.2%
Astral off 5:32.53 5:44.80 +3.7% (Java load +1.3% noisier; net ~+2.4%)
Astral auto 6:19.09 6:16.80 -0.6%
TMT off 2:31.29 2:15.88 -10.2%
TMT auto 2:50.50 2:35.17 -9.0%

Cumulative dev → iter3 HEAD wall

Dataset Mode dev iter3 Δ
LFQ off 0:46 0:44 -4%
LFQ auto 1:00 0:58 -3%
Astral off 6:12 5:45 -7.3%
Astral auto 6:54 6:17 -8.9%
TMT off 2:59 2:16 -24.0%
TMT auto 3:21 2:35 -22.9%

PSMs

All 6 dataset/mode cells produce PIN files that are sorted-row identical to S1b. PSM counts are inherited unchanged: Astral 36,138/36,715 · LFQ 14,755/14,755 · TMT 9,364/9,605.

Hot paths cross crate boundaries (search -> scoring -> model). Default
release profile uses codegen-units=16 with no LTO, so LLVM cannot inline
across crates and small leaf functions (AminoAcid::nominal_mass,
Enzyme::is_cleavable, FxHasher::write_u32, etc.) stay as function calls
across the hot search loop.

LTO=fat + cgu=1 give LLVM a whole-binary view at link time so it can
inline cross-crate, deduplicate similar generic instantiations, and pick
better register allocations on hot paths. Build time goes up (~2-3x) and
binary size grows slightly, but this is a release-time cost only.

No `target-cpu=native` here -- the released binary must run on a
baseline x86_64 (e.g. older bench VMs). Future bench-only builds can
opt in with `RUSTFLAGS="-C target-cpu=native"`.

The bit-identical regression gate (precursor_cal_off_pin_tsv_match_golden_after_sort)
is green under the new profile -- LLVM's whole-program view does not
change float semantics.
@ypriverol
Copy link
Copy Markdown
Member Author

iter4 — LTO=fat + codegen-units=1 (build profile, no code change)

Pure release-profile change in Cargo.toml. The previous release build used cargo defaults (codegen-units=16, no LTO), so cross-crate inlining was disabled. Now LLVM gets a whole-binary view at link time.

Verification

  • Bit-identical to S1b on all 6 dataset/mode cells (sorted-row PIN match) → Percolator @1% PSM counts unchanged
  • Binary size: 2.3M → 1.9M (-17%)
  • Build time: 12s incremental → 33s clean LTO link

Bench (iter4 vs iter3, same VM, Java drift ≤ 1.2%)

Dataset Mode iter3 iter4 Δ
Astral java 5:41.22 5:45.02 +1.1% (load)
Astral rust-off 5:44.80 6:00.11 +4.4% ⚠️
Astral rust-auto 6:16.80 5:58.14 -5.0%
PXD001819 java 1:23.44 1:22.84 -0.7%
PXD001819 rust-off 0:44.04 0:43.82 flat
PXD001819 rust-auto 0:57.92 0:53.68 -7.3%
TMT java 2:58.16 2:56.75 -0.8%
TMT rust-off 2:15.88 2:16.39 flat
TMT rust-auto 2:35.17 2:31.45 -2.4%

Read

Mixed. Slight wins on cal=auto modes (-2% to -7%), Astral cal=off regressed +3% beyond Java's load drift. Build-config changes shouldn't slow steady-state code; this is most likely noise on the longer Astral run, but the lack of a clean win means LTO isn't the next-step lever for the wall-time target.

Keeping iter4 because:

  1. Binary is 17% smaller (unambiguous ship win)
  2. Production-quality release settings — industry standard
  3. Zero correctness risk (bit-identical PIN gate green)

Next: re-profile suggests GF DP compute_inner at 16% of leaf time is the largest single target. Lower-priority leaves: directional_node_score_inner 9%, roundf libm calls 6%, PrimitiveAaGraph::build_in_place 5%. The 4-wide chunked inner loop in compute_inner is already vectorization-friendly — target-cpu=native (AVX2/FMA) would unlock it but risks bit-identity to S1b's golden PINs.

Project-level `.cargo/config.toml` sets `target-cpu=sandybridge` so LLVM
can use AVX 256-bit f64 vector ops to auto-vectorize the chunked GF DP
inner loop in `crates/scoring/src/gf/generating_function.rs`. The loop
spends ~16% of leaf time on Astral, and the chunked 4-wide layout was
already vectorization-friendly -- it just needed an enabled target.

Sandy Bridge (Intel, Q1 2011) is the right baseline:
 - AVX is widely available (any x86 CPU from 2011+)
 - AVX 256-bit double ops vectorize the GF chunked loop
 - NO AVX2 (Haswell+) -- preserves portability to older hypervisors
 - NO FMA -- fused multiply-add changes intermediate rounding; we need
   bit-identity with Java's separate-mul-add path for the parity gate

The default `x86-64` baseline is 22 years old (2003, SSE2 only). The
hardware floor it targets is no longer realistic for the workloads
this tool runs against. Users on pre-2011 hardware can override with
their own RUSTFLAGS.

Bench (iter5 vs S1b baseline, same VM, Java drift < 1.5%):

| Dataset    | Mode | S1b     | iter5   | Δ       |
|------------|------|---------|---------|---------|
| Astral     | off  | 5:32.53 | 5:26.38 | -1.8%   |
| Astral     | auto | 6:19.09 | 5:52.15 | -7.1%   |
| LFQ        | off  | 0:46.78 | 0:42.71 | -8.7%   |
| LFQ        | auto | 1:01.08 | 0:53.24 | -12.8%  |
| TMT        | off  | 2:31.29 | 2:07.36 | -15.8%  |
| TMT        | auto | 2:50.50 | 2:19.40 | -18.3%  |

Rust now beats Java on Astral cal=off (5:26 vs 5:40, +4%) and TMT (by
20-27%). PINs sorted-row identical to S1b across all 6 dataset/mode
cells; Percolator @1% PSM counts unchanged.

The `.cargo/config.toml` entry is committed -- `.gitignore` updated to
keep `.cargo/` private by default but allow `config.toml` through.
@ypriverol
Copy link
Copy Markdown
Member Author

iter5 — Sandy Bridge CPU baseline (AVX, no FMA) — biggest single-iter win

Astral cal=auto -7.1%, LFQ cal=auto -12.8%, TMT cal=auto -18.3%, PIN bit-identical.

Change

  • .cargo/config.toml sets target-cpu=sandybridge for all release builds
  • .gitignore updated to keep .cargo/ private by default but allow config.toml through
  • Cargo.toml comment refreshed to point at the new config

What it unlocks

The chunked 4-wide f64 loop in compute_inner (crates/scoring/src/gf/generating_function.rs:551) was already written for vectorization but the cargo default x86-64 baseline (2003 / SSE2) limited LLVM to 128-bit vector ops. Sandy Bridge enables AVX → 256-bit vmulpd / vaddpd → 1 instruction per 4-wide chunk. AVX2/FMA stay disabled deliberately (FMA changes intermediate rounding → would break the bit-identical PIN gate vs the Java reference).

Bench (iter5 vs S1b, same VM, Java drift < 1.5%)

Dataset Mode S1b iter5 Δ
LFQ off 0:46.78 0:42.71 -8.7%
LFQ auto 1:01.08 0:53.24 -12.8%
Astral off 5:32.53 5:26.38 -1.8%
Astral auto 6:19.09 5:52.15 -7.1%
TMT off 2:31.29 2:07.36 -15.8%
TMT auto 2:50.50 2:19.40 -18.3%

vs Java

Dataset Mode Java Rust iter5 Rust vs Java
LFQ auto 1:22.83 0:53.24 -36% (Rust 1.6× faster)
Astral off 5:40.18 5:26.38 -4% (Rust ahead)
Astral auto 5:40.18 5:52.15 +3.5% (Rust slower)
TMT off 2:55.11 2:07.36 -27% (Rust 1.4× faster)
TMT auto 2:55.11 2:19.40 -20% (Rust 1.2× faster)

Rust now beats or matches Java on 5 of 6 cells. The remaining Astral cal=auto gap (+12s) is calibration overhead specific to Rust's MassCalibrator pass.

Compatibility note

Sandy Bridge is Intel Q1 2011 / AMD Bulldozer late-2011. Universal across modern proteomics workstations and cloud VMs. Users on pre-2011 hardware can override the baseline in their local .cargo/config.toml.

Cumulative dev → iter5 (this PR HEAD)

Dataset Mode dev baseline iter5 Δ
LFQ off 0:46 0:43 -7%
LFQ auto 1:00 0:53 -12%
Astral off 6:12 5:26 -12.4%
Astral auto 6:54 5:52 -14.9%
TMT off 2:59 2:07 -28.9%
TMT auto 3:21 2:19 -30.8%

All PSM counts unchanged across all 6 cells (PIN bit-identical to dev).

ypriverol added 3 commits May 27, 2026 15:14
The `[build] rustflags = ["-C", "target-cpu=sandybridge"]` introduced
in the previous commit was unconditional, which fails on aarch64
targets (macOS Apple Silicon, ARM Linux) because sandybridge is an
x86-only LLVM target-cpu.

Scope via `[target.'cfg(target_arch = "x86_64")']` so the x86_64
release still gets AVX 256-bit vectorization while aarch64 builds
fall back to their NEON-by-default baseline. macOS-latest CI is now
aarch64-only on GitHub Actions, so this unblocks the macos test job.
The per-node loop in `compute_inner` scans incoming edges twice: once
to derive `cur_min_score`/`cur_max_score`, then once to accumulate
probabilities into `cur_slice`. The first pass already computed every
field the second pass needed (`prev_idx`, `prev_hdr`, `combined_score`,
`aa_prob`) but threw all of it away, storing only the raw edge index
`e`. The second pass redid the work — `node_index_for_mass` lookup,
`arena.headers[prev_idx]` load, and the `combined_score` add — per
edge in the hot loop.

Replace `valid_edges: Vec<usize>` with `Vec<ValidEdge>` carrying the
cached fields. `NodeSlice` is already `Copy` (16 bytes) so the storage
cost is small relative to the saved arithmetic + indexed loads.

Bit-identical regression gate green; full workspace tests pass under
the standard CI skip list. No semantic change — purely caching.
@ypriverol
Copy link
Copy Markdown
Member Author

iter6 → reverted (9c10429c + revert 689b4f38)

Hypothesis tested: cache prev_hdr / combined_score / aa_prob per edge in ValidEdge to avoid redoing node_index_for_mass lookup + arena.headers[] load + arithmetic in the GF DP second pass.

Bench (iter6 vs iter5, same VM, Java drift ≤ 1.2%):

Dataset Mode iter5 iter6 Δ
LFQ off 0:42.71 0:41.55 -2.7%
LFQ auto 0:53.24 0:52.87 flat
Astral off 5:26.38 5:32.15 +1.8% ⚠️
Astral auto 5:52.15 5:37.17 -4.3%
TMT off 2:07.36 2:07.28 flat
TMT auto 2:19.40 2:29.19 +7.0% ⚠️

PIN bit-identical across all 6 cells (PSM counts unchanged). The compute_inner second pass was already well-optimized post-LTO+AVX — LLVM had likely register-allocated the prev_idx lookup chain efficiently. The 16-byte ValidEdge struct probably added more L1 pressure on the hot scratch buffer than the saved arithmetic recouped.

n=10 audit pattern: micro-optimizations that don't move PIN distributions can still regress wall via cache effects. Reverted; iter5 stays as the PR HEAD.

Current PR HEAD: 689b4f38 (revert) — effectively iter5 (af010d5c).

@ypriverol ypriverol changed the title perf+chore: FxHashMap on hot scoring tables (-10% Astral, -15% TMT) + post-cutover cleanup perf+chore: V1 speed milestone — Rust beats Java on 5 of 6 cells (PSM-zero-regression) May 27, 2026
Addresses three review observations on the V1 speed milestone PR:

1. `crates/search/src/candidate_gen.rs` — `use model::modification::ModLocation;`
   was inlined inside `expand_mod_combinations` and again inside
   `build_terminal_variants`. Lifted to module scope and removed both
   inline `use`s.

2. `crates/scoring/Cargo.toml` — `rustc-hash = "2"` was listed in both
   `[dependencies]` and `[dev-dependencies]`. Removed the dev-dep entry;
   the regular dependency is in scope for tests already.

3. `Cargo.toml` — clarified that `[profile.release]` flags (`lto`,
   `codegen-units`) are release-only, while the CPU baseline in
   `.cargo/config.toml` is workspace-wide on purpose — so the
   bit-identical PIN gate runs under the same SIMD codegen as the
   shipped binary.

No behavior change; cargo clippy --workspace --all-targets -D warnings
clean; bit-identical PIN gate green.
@ypriverol
Copy link
Copy Markdown
Member Author

Code review applied (741fec5b)

Ran a thorough code review on the milestone. No correctness bugs. Applied the actionable style nits:

  1. Lifted use model::modification::ModLocation; from inside expand_mod_combinations and build_terminal_variants to module scope in crates/search/src/candidate_gen.rs. Dedupes the import.
  2. Removed redundant rustc-hash = "2" from crates/scoring/Cargo.toml [dev-dependencies] (already in [dependencies]).
  3. Updated the Cargo.toml [profile.release] comment to clarify the LTO/codegen flags are release-only, while the .cargo/config.toml CPU baseline is workspace-wide on purpose — so the bit-identical PIN gate runs under the same SIMD codegen as the shipped binary.

Review-fitness observations (not blocking)

  • CI golden-PIN coverage is correct. precursor_cal_off_pin_tsv_match_golden_after_sort is NOT in the workflow's --skip list and runs under both ubuntu-latest and windows-latest (x86_64) — the exact codegen path Sandy Bridge affects.
  • One latent-risk note for future readers: FxHasher's known weak avalanche on small-integer-tuple keys is fine for the current Partition cardinality (~50 entries, charge ∈ {2..5}, seg_num ∈ {0..3}). If future ParamSets grow charge × segment × mass dimensions substantially, this is worth re-measuring. Not actionable for V1.
  • msgf-trace diagnostic note: the binary uses param.rank_dist_table.iter().next() / .iter().take(3) for diagnostic dumps. With FxHashMap the "first" partition picked differs from the std-HashMap baseline. This is the diagnostic tool, not the production search — no functional impact.

Skipped (cosmetic / out of scope)

  • aa_set.rs:275 splitn(2, '=').nth(1)split_once('=').map(...) — clippy-style change but already bundled in the FxHash commit; touching it now would create more churn than value.
  • param_model.rs:54 FxHashMap::with_capacity_and_hasher(..., Default::default()) idiom polish.
  • build_terminal_variants doc-comment expansion — the inline comments at the fixed-mod branch already cover the subtle semantics.
  • Iter6 revert pair (9c10429c + 689b4f38) preserved in history as record of the n=10 audit observation (modifies-cache-pressure tradeoff in compute_inner).

CI matrix expected to stay green; final HEAD 741fec5b.

@ypriverol
Copy link
Copy Markdown
Member Author

Final test sweep — all green, PR ready to merge

Local (PR HEAD 741fec5b)

  • cargo test --release --workspace -- <CI skip list>593 passed, 0 failed (full workspace sweep mirroring the CI command)
  • cargo clippy --workspace --all-targets -- -D warnings → clean
  • precursor_cal_off_pin_tsv_match_golden_after_sort → green (bit-identical PIN gate)

CI on 741fec5b

Check Status Duration
Lint (clippy + rustfmt) pass 29s
Test (ubuntu-latest) pass 4m13s
Test (macos-latest) pass 4m50s
Test (windows-latest) pass 5m27s
CodeRabbit pass

Summary

  • PSM counts: identical to dev on all 6 dataset/mode cells (PINs sorted-row identical, Percolator @1% unchanged)
  • Wall: Rust beats Java on 5 of 6 cells, biggest wins on TMT (-27%/-20%) and LFQ (-47%/-36%)
  • Cumulative dev → PR HEAD wall: LFQ -7..-12%, Astral -12..-15%, TMT -29..-31%
  • Bit-identical regression test gated every commit; FMA explicitly disabled to preserve the gate

Ready to merge.

@ypriverol ypriverol merged commit a44b791 into dev May 27, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant