Skip to content

docs: refresh README bench table vs upstream MSGFPlus v2024.03.26#39

Merged
ypriverol merged 1 commit into
masterfrom
docs/readme-bench-update-v2024
May 28, 2026
Merged

docs: refresh README bench table vs upstream MSGFPlus v2024.03.26#39
ypriverol merged 1 commit into
masterfrom
docs/readme-bench-update-v2024

Conversation

@ypriverol
Copy link
Copy Markdown
Member

@ypriverol ypriverol commented May 27, 2026

Summary

Replaces the previous README benchmark table (which compared against an internal bigbio Java build) with fresh measurements against the canonical upstream MSGFPlus v2024.03.26 release, run with msgf-rust master (--precursor-cal auto). Both ran on the same 8-thread VM and went through Percolator 3.7.1 for 1% FDR PSM counts.

Headline

Dataset Java PSMs @1% msgf-rust PSMs @1% Δ PSMs Java wall msgf-rust wall Speedup
Astral DDA 33,425 36,715 +3,290 (+9.8%) 2:20:42 6:28 21.8×
PXD001819 14,974 14,755 -219 (-1.5%) 8:46 0:54 9.7×
TMT 10,115 9,605 -510 (-5.0%) 1:11:00 2:33 27.9×

Methodology notes added to README

  • 8-thread Intel Xeon Gold 6238 VM (AVX-only, no AVX2/FMA exposed by hypervisor)
  • Java JAR: stock upstream MSGFPlus v2024.03.26 (no -precursorCal flag — that's a bigbio addition, not in upstream)
  • msgf-rust: master branch, release build with target-cpu=sandybridge (AVX, FMA disabled to preserve bit-identity)
  • Java mzid → PIN via msgf2pin from the Percolator 3.6.5 container (single-arg mode). The 3.7.1 container's msgf2pin has a known parser crash on these mzid files — documented inline for reproducibility.
  • Percolator 3.7.1 with --seed 42 --only-psms for both Java and Rust pins (apples-to-apples).

Why this matters

The previous README headline was "Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster." Against the actual upstream Java, the picture is more nuanced and more favorable:

  • Speed is a much bigger gap than previously documented — 9.7-27.9× rather than the 14% / 3.3× / 2% in the old table. This is because the old table compared against an internally optimized bigbio Java build, not the upstream release a user would actually download.
  • PSM counts are within ±10% of Java on all three datasets, with Astral now showing a +9.8% advantage. TMT trails by 5% (tracked separately as the I5 score_psm divergence — see docs/parity-analysis/notes/2026-05-26-score-psm-trace-findings.md).

Test plan

  • Render README on GitHub and verify the table + collapsible methodology section format correctly
  • No CI impact (docs-only change)

Summary by CodeRabbit

  • Documentation
    • Refreshed project positioning and updated benchmark claims for the Rust MS-GF+ port
    • Revised "Why msgf-rust?" section with new datasets, performance results, and Speedup column
    • Added collapsible "Bench methodology" section with comprehensive hardware specifications, baseline parameters, conversion tooling details, and reproducibility guidance
    • Clarified calibration options and run conditions used in benchmarks

Review Change Stack

Replaces the previous benchmark table (which compared against a bigbio
internal Java build) with fresh measurements against the canonical
upstream MSGFPlus v2024.03.26 release.

Same 8-thread VM, same 3 reference datasets, all at 1% FDR via
Percolator 3.7.1. msgf-rust uses --precursor-cal auto. Java mzid is
converted to PIN via msgf2pin from the older Percolator 3.6.5 container
(single-arg mode) because the 3.7.1 msgf2pin has a parser crash on this
mzid output.

Headline numbers (Java -> msgf-rust):
 - PXD001819: 14,974 -> 14,755 PSMs (-1.5%); 8:46 -> 0:54 (9.7x faster)
 - Astral:    33,425 -> 36,715 PSMs (+9.8%); 2:20:42 -> 6:28 (21.8x faster)
 - TMT:       10,115 -> 9,605 PSMs (-5.0%); 1:11:00 -> 2:33 (27.9x faster)

Added a collapsible "Bench methodology" section documenting hardware,
flags, msgf2pin version choice, and reproducibility paths so the
numbers can be re-derived from scratch.
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

This PR updates the README.md to refresh the project's positioning and benchmark documentation for the Rust MS-GF+ port. The introductory tagline and speed claims are revised, benchmark tables are updated with new datasets and a Speedup column, run-condition details are expanded, and a new collapsible methodology section documents hardware, baseline parameters, and reproducibility information.

Changes

Documentation Updates

Layer / File(s) Summary
README benchmark documentation refresh
README.md
The introductory tagline revises stated inputs/outputs and benchmark speed factors; the "Why msgf-rust?" section presents updated datasets with a Speedup column, Java baseline configuration (--precursor-cal auto), and notes on remaining feature divergences; a new collapsible "Bench methodology" section documents benchmark hardware, run parameters, conversion tooling, wall-time measurement, and reproducibility references.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

📄 A rabbit hops through README rows,
Where benchmarks bloom and speedup shows,
New tables dance, methodology clear,
Documentation refreshed, crystal and dear! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: updating the README's benchmark table to compare against the upstream MSGFPlus v2024.03.26 release rather than the internal baseline.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/readme-bench-update-v2024

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@README.md`:
- Line 34: Update README.md to avoid pointing to environment-private paths:
replace the two absolute VM-only paths `/srv/data/msgf-bench/finalize2_v2024.sh`
and `/srv/data/msgf-bench/run_percolator_docker.sh` with either
repository-relative paths (e.g., `benchmark/ci/...`) that contain public copies
of those scripts or add a clear note that these are internal-only and provide
equivalent public scripts or instructions; ensure the line mentioning
"Reproducibility" is edited so readers outside the bench VM can reproduce steps
or are directed to the internal-only notice and the public alternatives.
- Line 7: Update the bold headline line starting "**A Rust port of MS-GF+** —
takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches
or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**." to a
softened claim that aligns with the table (e.g., "comparable to Java MS-GF+
(within -5% to +9.8% depending on dataset) while running **10-28× faster**" or
"near parity to Java MS-GF+ (up to +9.8% depending on dataset) while running
**10-28× faster**"); keep the performance multiplier unchanged and ensure the
new wording references the dataset-dependent range shown in the results table.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 58855d7b-9192-499e-b12c-f57d2929be10

📥 Commits

Reviewing files that changed from the base of the PR and between 5711ddd and 086166a.

📒 Files selected for processing (1)
  • README.md

Comment thread README.md
[![License: UCSD-Noncommercial](https://img.shields.io/badge/license-UCSD--Noncommercial-blue)](LICENSE)

> **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Beats Java MS-GF+ on all three benchmark datasets at 1% FDR while running 14-330% faster.
> **A Rust port of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready `.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running **10-28× faster**.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the headline claim with the table results.

“Matches or beats Java ... PSM counts” conflicts with the published table (two datasets are below Java, including -5.0%). Please soften this to reflect “near parity to +9.8% depending on dataset” (or similar).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 7, Update the bold headline line starting "**A Rust port
of MS-GF+** — takes mzML/MGF spectra + FASTA in, produces Percolator-ready
`.pin` out. Matches or beats Java MS-GF+ PSM counts at 1% FDR while running
**10-28× faster**." to a softened claim that aligns with the table (e.g.,
"comparable to Java MS-GF+ (within -5% to +9.8% depending on dataset) while
running **10-28× faster**" or "near parity to Java MS-GF+ (up to +9.8% depending
on dataset) while running **10-28× faster**"); keep the performance multiplier
unchanged and ensure the new wording references the dataset-dependent range
shown in the results table.

Comment thread README.md
- **Java → PIN:** `msgf2pin` from the percolator `3.6.5--h6351f2a_0` container (single-arg mode for concatenated-TDA mzid; the `3.7.1` container's msgf2pin has a known parser crash on this mzid output).
- **Percolator:** `percolator 3.7.1` in `quay.io/biocontainers/percolator:3.7.1--h3b5f4bd_2` with `--seed 42 --only-psms`. Same parser script for both Java and Rust PINs.
- **Wall time:** `/usr/bin/time -v` "Elapsed (wall clock) time" — does not include Percolator stage.
- **Reproducibility:** scripts at `/srv/data/msgf-bench/finalize2_v2024.sh` and `/srv/data/msgf-bench/run_percolator_docker.sh` on the bench VM.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make reproducibility references accessible outside the bench VM.

The listed /srv/data/... script paths are environment-private, so readers cannot reproduce from this README alone. Prefer repo paths (e.g., benchmark/ci/...) or explicitly mark these as internal-only and add public equivalents.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` at line 34, Update README.md to avoid pointing to
environment-private paths: replace the two absolute VM-only paths
`/srv/data/msgf-bench/finalize2_v2024.sh` and
`/srv/data/msgf-bench/run_percolator_docker.sh` with either repository-relative
paths (e.g., `benchmark/ci/...`) that contain public copies of those scripts or
add a clear note that these are internal-only and provide equivalent public
scripts or instructions; ensure the line mentioning "Reproducibility" is edited
so readers outside the bench VM can reproduce steps or are directed to the
internal-only notice and the public alternatives.

@ypriverol ypriverol merged commit 613a404 into master May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant