docs: refresh README bench table vs upstream MSGFPlus v2024.03.26#39
Conversation
Replaces the previous benchmark table (which compared against a bigbio internal Java build) with fresh measurements against the canonical upstream MSGFPlus v2024.03.26 release. Same 8-thread VM, same 3 reference datasets, all at 1% FDR via Percolator 3.7.1. msgf-rust uses --precursor-cal auto. Java mzid is converted to PIN via msgf2pin from the older Percolator 3.6.5 container (single-arg mode) because the 3.7.1 msgf2pin has a parser crash on this mzid output. Headline numbers (Java -> msgf-rust): - PXD001819: 14,974 -> 14,755 PSMs (-1.5%); 8:46 -> 0:54 (9.7x faster) - Astral: 33,425 -> 36,715 PSMs (+9.8%); 2:20:42 -> 6:28 (21.8x faster) - TMT: 10,115 -> 9,605 PSMs (-5.0%); 1:11:00 -> 2:33 (27.9x faster) Added a collapsible "Bench methodology" section documenting hardware, flags, msgf2pin version choice, and reproducibility paths so the numbers can be re-derived from scratch.
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
📝 WalkthroughWalkthroughThis PR updates the README.md to refresh the project's positioning and benchmark documentation for the Rust MS-GF+ port. The introductory tagline and speed claims are revised, benchmark tables are updated with new datasets and a Speedup column, run-condition details are expanded, and a new collapsible methodology section documents hardware, baseline parameters, and reproducibility information. ChangesDocumentation Updates
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@README.md`:
- Line 34: Update README.md to avoid pointing to environment-private paths:
replace the two absolute VM-only paths `/srv/data/msgf-bench/finalize2_v2024.sh`
and `/srv/data/msgf-bench/run_percolator_docker.sh` with either
repository-relative paths (e.g., `benchmark/ci/...`) that contain public copies
of those scripts or add a clear note that these are internal-only and provide
equivalent public scripts or instructions; ensure the line mentioning
"Reproducibility" is edited so readers outside the bench VM can reproduce steps
or are directed to the internal-only notice and the public alternatives.
- Line 7: Update the bold headline line starting "**A Rust port of MS-GF+** —
takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches
or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**." to a
softened claim that aligns with the table (e.g., "comparable to Java MS-GF+
(within -5% to +9.8% depending on dataset) while running **10-28× faster**" or
"near parity to Java MS-GF+ (up to +9.8% depending on dataset) while running
**10-28× faster**"); keep the performance multiplier unchanged and ensure the
new wording references the dataset-dependent range shown in the results table.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| [](LICENSE) | ||
|
|
||
| > **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster. | ||
| > **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**. |
There was a problem hiding this comment.
Align the headline claim with the table results.
“Matches or beats Java ... PSM counts” conflicts with the published table (two datasets are below Java, including -5.0%). Please soften this to reflect “near parity to +9.8% depending on dataset” (or similar).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 7, Update the bold headline line starting "**A Rust port
of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready
`.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running
**10-28× faster**." to a softened claim that aligns with the table (e.g.,
"comparable to Java MS-GF+ (within -5% to +9.8% depending on dataset) while
running **10-28× faster**" or "near parity to Java MS-GF+ (up to +9.8% depending
on dataset) while running **10-28× faster**"); keep the performance multiplier
unchanged and ensure the new wording references the dataset-dependent range
shown in the results table.
| - **Java → PIN:** `msgf2pin` from the percolator `3.6.5--h6351f2a_0` container (single-arg mode for concatenated-TDA mzid; the `3.7.1` container's msgf2pin has a known parser crash on this mzid output). | ||
| - **Percolator:** `percolator 3.7.1` in `quay.io/biocontainers/percolator:3.7.1--h3b5f4bd_2` with `--seed 42 --only-psms`. Same parser script for both Java and Rust PINs. | ||
| - **Wall time:** `/usr/bin/time -v` "Elapsed (wall clock) time" — does not include Percolator stage. | ||
| - **Reproducibility:** scripts at `/srv/data/msgf-bench/finalize2_v2024.sh` and `/srv/data/msgf-bench/run_percolator_docker.sh` on the bench VM. |
There was a problem hiding this comment.
Make reproducibility references accessible outside the bench VM.
The listed /srv/data/... script paths are environment-private, so readers cannot reproduce from this README alone. Prefer repo paths (e.g., benchmark/ci/...) or explicitly mark these as internal-only and add public equivalents.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 34, Update README.md to avoid pointing to
environment-private paths: replace the two absolute VM-only paths
`/srv/data/msgf-bench/finalize2_v2024.sh` and
`/srv/data/msgf-bench/run_percolator_docker.sh` with either repository-relative
paths (e.g., `benchmark/ci/...`) that contain public copies of those scripts or
add a clear note that these are internal-only and provide equivalent public
scripts or instructions; ensure the line mentioning "Reproducibility" is edited
so readers outside the bench VM can reproduce steps or are directed to the
internal-only notice and the public alternatives.
Summary
Replaces the previous README benchmark table (which compared against an internal bigbio Java build) with fresh measurements against the canonical upstream MSGFPlus v2024.03.26 release, run with msgf-rust master (
--precursor-cal auto). Both ran on the same 8-thread VM and went through Percolator 3.7.1 for 1% FDR PSM counts.Headline
Methodology notes added to README
-precursorCalflag — that's a bigbio addition, not in upstream)target-cpu=sandybridge(AVX, FMA disabled to preserve bit-identity)msgf2pinfrom the Percolator 3.6.5 container (single-arg mode). The 3.7.1 container's msgf2pin has a known parser crash on these mzid files — documented inline for reproducibility.--seed 42 --only-psmsfor both Java and Rust pins (apples-to-apples).Why this matters
The previous README headline was "Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster." Against the actual upstream Java, the picture is more nuanced and more favorable:
docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md).Test plan
Summary by CodeRabbit