docs: refresh README bench table vs upstream MSGFPlus v2024.03.26 by ypriverol · Pull Request #39 · bigbio/msgf-rust

ypriverol · 2026-05-27T21:01:13Z

Summary

Replaces the previous README benchmark table (which compared against an internal bigbio Java build) with fresh measurements against the canonical upstream MSGFPlus v2024.03.26 release, run with msgf-rust master (--precursor-cal auto). Both ran on the same 8-thread VM and went through Percolator 3.7.1 for 1% FDR PSM counts.

Headline

Dataset	Java PSMs @1%	msgf-rust PSMs @1%	Δ PSMs	Java wall	msgf-rust wall	Speedup
Astral DDA	33,425	36,715	+3,290 (+9.8%)	2:20:42	6:28	21.8×
PXD001819	14,974	14,755	-219 (-1.5%)	8:46	0:54	9.7×
TMT	10,115	9,605	-510 (-5.0%)	1:11:00	2:33	27.9×

Methodology notes added to README

8-thread Intel Xeon Gold 6238 VM (AVX-only, no AVX2/FMA exposed by hypervisor)
Java JAR: stock upstream MSGFPlus v2024.03.26 (no -precursorCal flag — that's a bigbio addition, not in upstream)
msgf-rust: master branch, release build with target-cpu=sandybridge (AVX, FMA disabled to preserve bit-identity)
Java mzid → PIN via msgf2pin from the Percolator 3.6.5 container (single-arg mode). The 3.7.1 container's msgf2pin has a known parser crash on these mzid files — documented inline for reproducibility.
Percolator 3.7.1 with --seed 42 --only-psms for both Java and Rust pins (apples-to-apples).

Why this matters

The previous README headline was "Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster." Against the actual upstream Java, the picture is more nuanced and more favorable:

Speed is a much bigger gap than previously documented — 9.7-27.9× rather than the 14% / 3.3× / 2% in the old table. This is because the old table compared against an internally optimized bigbio Java build, not the upstream release a user would actually download.
PSM counts are within ±10% of Java on all three datasets, with Astral now showing a +9.8% advantage. TMT trails by 5% (tracked separately as the I5 score_psm divergence — see docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md).

Test plan

Render README on GitHub and verify the table + collapsible methodology section format correctly
No CI impact (docs-only change)

Summary by CodeRabbit

Documentation
- Refreshed project positioning and updated benchmark claims for the Rust MS-GF+ port
- Revised "Why msgf-rust?" section with new datasets, performance results, and Speedup column
- Added collapsible "Bench methodology" section with comprehensive hardware specifications, baseline parameters, conversion tooling details, and reproducibility guidance
- Clarified calibration options and run conditions used in benchmarks

Replaces the previous benchmark table (which compared against a bigbio internal Java build) with fresh measurements against the canonical upstream MSGFPlus v2024.03.26 release. Same 8-thread VM, same 3 reference datasets, all at 1% FDR via Percolator 3.7.1. msgf-rust uses --precursor-cal auto. Java mzid is converted to PIN via msgf2pin from the older Percolator 3.6.5 container (single-arg mode) because the 3.7.1 msgf2pin has a parser crash on this mzid output. Headline numbers (Java -> msgf-rust): - PXD001819: 14,974 -> 14,755 PSMs (-1.5%); 8:46 -> 0:54 (9.7x faster) - Astral: 33,425 -> 36,715 PSMs (+9.8%); 2:20:42 -> 6:28 (21.8x faster) - TMT: 10,115 -> 9,605 PSMs (-5.0%); 1:11:00 -> 2:33 (27.9x faster) Added a collapsible "Bench methodology" section documenting hardware, flags, msgf2pin version choice, and reproducibility paths so the numbers can be re-derived from scratch.

qodo-code-review · 2026-05-27T21:01:19Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

coderabbitai · 2026-05-27T21:01:33Z

📝 Walkthrough

Walkthrough

This PR updates the README.md to refresh the project's positioning and benchmark documentation for the Rust MS-GF+ port. The introductory tagline and speed claims are revised, benchmark tables are updated with new datasets and a Speedup column, run-condition details are expanded, and a new collapsible methodology section documents hardware, baseline parameters, and reproducibility information.

Changes

Documentation Updates

Layer / File(s)	Summary
README benchmark documentation refresh `README.md`	The introductory tagline revises stated inputs/outputs and benchmark speed factors; the "Why msgf-rust?" section presents updated datasets with a `Speedup` column, Java baseline configuration (`--precursor-cal auto`), and notes on remaining feature divergences; a new collapsible "Bench methodology" section documents benchmark hardware, run parameters, conversion tooling, wall-time measurement, and reproducibility references.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

📄 A rabbit hops through README rows,
Where benchmarks bloom and speedup shows,
New tables dance, methodology clear,
Documentation refreshed, crystal and dear! 🐰✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: updating the README's benchmark table to compare against the upstream MSGFPlus v2024.03.26 release rather than the internal baseline.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/readme-bench-update-v2024

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 34: Update README.md to avoid pointing to environment-private paths:
replace the two absolute VM-only paths `/srv/data/msgf-bench/finalize2_v2024.sh`
and `/srv/data/msgf-bench/run_percolator_docker.sh` with either
repository-relative paths (e.g., `benchmark/ci/...`) that contain public copies
of those scripts or add a clear note that these are internal-only and provide
equivalent public scripts or instructions; ensure the line mentioning
"Reproducibility" is edited so readers outside the bench VM can reproduce steps
or are directed to the internal-only notice and the public alternatives.
- Line 7: Update the bold headline line starting "**A Rust port of MS-GF+** —
takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches
or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**." to a
softened claim that aligns with the table (e.g., "comparable to Java MS-GF+
(within -5% to +9.8% depending on dataset) while running **10-28× faster**" or
"near parity to Java MS-GF+ (up to +9.8% depending on dataset) while running
**10-28× faster**"); keep the performance multiplier unchanged and ensure the
new wording references the dataset-dependent range shown in the results table.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58855d7b-9192-499e-b12c-f57d2929be10

📥 Commits

Reviewing files that changed from the base of the PR and between 5711ddd and 086166a.

📒 Files selected for processing (1)

README.md

coderabbitai · 2026-05-27T21:04:20Z

 [![License: UCSD-Noncommercial](https://img.shields.io/badge/license-UCSD--Noncommercial-blue)](LICENSE)

-> **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster.
+> **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the headline claim with the table results.

“Matches or beats Java ... PSM counts” conflicts with the published table (two datasets are below Java, including -5.0%). Please soften this to reflect “near parity to +9.8% depending on dataset” (or similar).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 7, Update the bold headline line starting "**A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**." to a softened claim that aligns with the table (e.g., "comparable to Java MS-GF+ (within -5% to +9.8% depending on dataset) while running **10-28× faster**" or "near parity to Java MS-GF+ (up to +9.8% depending on dataset) while running **10-28× faster**"); keep the performance multiplier unchanged and ensure the new wording references the dataset-dependent range shown in the results table.

coderabbitai · 2026-05-27T21:04:20Z

+- **Java → PIN:** `msgf2pin` from the percolator `3.6.5--h6351f2a_0` container (single-arg mode for concatenated-TDA mzid; the `3.7.1` container's msgf2pin has a known parser crash on this mzid output).
+- **Percolator:** `percolator 3.7.1` in `quay.io/biocontainers/percolator:3.7.1--h3b5f4bd_2` with `--seed 42 --only-psms`. Same parser script for both Java and Rust PINs.
+- **Wall time:** `/usr/bin/time -v` "Elapsed (wall clock) time" — does not include Percolator stage.
+- **Reproducibility:** scripts at `/srv/data/msgf-bench/finalize2_v2024.sh` and `/srv/data/msgf-bench/run_percolator_docker.sh` on the bench VM.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make reproducibility references accessible outside the bench VM.

The listed /srv/data/... script paths are environment-private, so readers cannot reproduce from this README alone. Prefer repo paths (e.g., benchmark/ci/...) or explicitly mark these as internal-only and add public equivalents.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 34, Update README.md to avoid pointing to environment-private paths: replace the two absolute VM-only paths `/srv/data/msgf-bench/finalize2_v2024.sh` and `/srv/data/msgf-bench/run_percolator_docker.sh` with either repository-relative paths (e.g., `benchmark/ci/...`) that contain public copies of those scripts or add a clear note that these are internal-only and provide equivalent public scripts or instructions; ensure the line mentioning "Reproducibility" is edited so readers outside the bench VM can reproduce steps or are directed to the internal-only notice and the public alternatives.

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

ypriverol merged commit 613a404 into master May 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: refresh README bench table vs upstream MSGFPlus v2024.03.26#39

docs: refresh README bench table vs upstream MSGFPlus v2024.03.26#39
ypriverol merged 1 commit into
masterfrom
docs/readme-bench-update-v2024

ypriverol commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qodo-code-review Bot commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ypriverol commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline

Methodology notes added to README

Why this matters

Test plan

Summary by CodeRabbit

Uh oh!

qodo-code-review Bot commented May 27, 2026

Qodo reviews are paused for this user.

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ypriverol commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading