Skip to content

Bug fixes, paired tests, k:1 matching, SMD diagnostics, BH-FDR, vectorized PERMANOVA, and interface parity#23

Open
l1joseph wants to merge 12 commits into
gibsramen:mainfrom
l1joseph:fix-modernize-pandas2
Open

Bug fixes, paired tests, k:1 matching, SMD diagnostics, BH-FDR, vectorized PERMANOVA, and interface parity#23
l1joseph wants to merge 12 commits into
gibsramen:mainfrom
l1joseph:fix-modernize-pandas2

Conversation

@l1joseph

Copy link
Copy Markdown

Hi @gibsramen, this PR has a few changes that include bug fixes, performance improvements, and new features built on top of qupid v0.1.0. @lucaspatel and I are aiming to submit Qupid for publication in the upcomining months. I'm more than happy to share the manuscript draft. All changes preserve the existing public API. 106/106 tests pass (pytest qupid/tests/ qupid/q2/tests/).

Bug fixes + modernization

  • Fix controls property crashing on empty CaseMatchOneToMany
  • Fix match_by_multiple dropping cases silently instead of respecting on_failure
  • Fix sample_size computed from casematches[0] instead of per-matching
  • Fix to_series() crashing on empty CaseMatchOneToOne
  • Fix to_dataframe() silent NaN with strict=False (now warns)
  • Validate on_failure in match_by_multiple before any loop work
  • Reject negative tolerances in both Python API and QIIME 2 CLI
  • Add iterations < 1 guard to create_matched_pairs
  • @total_ordering on CaseMatchOneToOne for <=/>= support
  • PEP 585/604 type annotation modernization throughout

Performance

  • _fast_permanova: replaces scikit-bio's Python permutation loop with two batched numpy matrix multiplies + vectorized rng.permuted~6–7× faster on typical cohort sizes (90×90 DM, 999 permutations). Pseudo-F agrees with scikit-bio to <10⁻¹². Also fixes a ValueError when joblib loky workers receive read-only mmap arrays from skbio's Cython .filter().
  • CaseMatchOneToOne.__hash__: drop unnecessary sort before frozenset.
  • CLI/Q2 statistical assessment commands now default to n_jobs=-1.

New features

  • create_matched_groups(n_controls): k:1 matching so each case gets N distinct controls via sequential Hopcroft-Karp rounds; no double-assignment within an iteration.
  • bulk_univariate_test: paired tests-- paired-t / ttest-rel and wilcoxon / wilcoxon-sr that align values by matched pairing order.
  • compute_covariate_balance: standardized mean difference (Cohen's d) pre vs. post matching, for numeric/boolean covariates.
  • bulk_permanova / bulk_univariate_test: correct=True appends a Benjamini-Hochberg FDR q_value column.

Interface parity with CLI Q2

  • CLI: match-groups, assess-univariate, assess-multivariate, balance commands.
  • QIIME 2: test selection and FDR flag on both existing visualizers; new assess-covariate-balance visualizer (Love plot).

Version

qupid/__init__.py bumped from 0.1.0 --> 0.2.0.

PRs

If you would prefer to review these incrementally, the commits are organized so I can rebase into separate PRs in any of these splits:

  1. Bug fixes + perf (_fast_permanova, hash fix, n_jobs)
  2. k:1 matching + BH-FDR
  3. Paired tests (paired-t, wilcoxon)
  4. SMD diagnostics + CLI/Q2 interface

Just let me know and I will open the split PRs.

Testing

  • pytest qupid/tests/ qupid/q2/tests/ -q → 106 passed.
  • Manual CLI test for the 4 new subcommands.
  • Q2 plugin loads (qiime qupid --help lists new actions).
  • Verified _fast_permanova pseudo-F agreement against scikit-bio on real Bray-Curtis (n=500: 4.72e-14; n=1,126: 6.06e-13) and unweighted UniFrac (n=500: 2.81e-13) matrices.

l1joseph added 12 commits May 29, 2026 14:58
- A1/A5: Replace reduce() with set().union(*...) in controls property to handle empty maps
- A2: Normalize on_failure to lowercase before comparisons in match_by_single and shuffle
- A3/A10: Rewrite tolerance parsing in Q2 methods with per-token error messages
- A6: Sort (case, control) pairs before comparing in __lt__/__gt__ for deterministic ordering
- A7: Include missing case IDs in 'not matched to a control' warning message
- A8: Add rtol=0 to np.isclose() to prevent false positives on large values
- A9: Replace deprecated Series.iteritems() with .items()
- sample_size fix: compute per-iteration case count instead of using first iteration only
- Add basic _resolve_annotation() to decorators to handle string annotations (PEP 563)
- Update test_type_incompat to use pure-string Series (pandas 2.0 changed is_string_dtype)
- Update test_no_more_controls_no_strict to match new warning message format
- TestBugFixes: cover on_failure case-insensitivity (A2), empty intersection
  handling (A4), controls-on-empty-map (A5), sort-stable comparison (A6)
- test_casematch_utils.py: cover _match_continuous rtol=0 behavior (A8)
- test_stats.py: cover per-iteration sample_size computation
- q2/tests: add test_malformed_tolerance for A10 error messages (3 cases)
- match_by_multiple: delete cases with empty intersections instead of
  carrying empty sets forward (A4 completion) and warn when on_failure='warn'
- q2/_methods.py: align error message wording with test expectations
….toml

- Add 'from __future__ import annotations' to all source modules for lazy
  evaluation (enables forward refs and | union syntax on Python 3.9+)
- Replace typing.Dict/List/Set with builtin dict/list/set generics
- Replace Union[X, Y] with X | Y pipe syntax
- Remove now-redundant Dict/List/Set/Union from typing imports
- Drop string forward refs where __future__ annotations makes them unnecessary
- Add pyproject.toml (build system metadata, replaces setup.cfg boilerplate)
- Update .github/workflows, Makefile, README, pytest.ini for new tooling
- F1: _load() return annotation updated to dict[str, set] (landed in C refactor)
- F2: guard to_series() against empty case_control_map (returns empty Series)
- F3: warn in to_dataframe() when strict=False produces NaN across matchings
- F4: validate on_failure in match_by_multiple before any other logic
- F5: reject negative tolerances in match_by_single and Q2 _methods.py
- F6: remove dead all_matches = set() overwritten by Parallel() assignment
- F7: raise ValueError for iterations < 1 in create_matched_pairs
- F8: add qupid/cli/__init__.py so find_packages includes the CLI subpackage
- F9: graft config/ in MANIFEST.in so coverage ini files ship in sdist
- F10: handle PEP 604 unions and typing.Union in _resolve_annotation()
- F11: decorate CaseMatchOneToOne with @total_ordering for __le__/__ge__
- F12: pre-filter focus to surviving cases in match_by_multiple to eliminate
  duplicate 'no matches' warnings and unnecessary computation
- Add TestCodeReviewFindings class and Q2 negative-tolerance test covering F2-F5/F7/F11/F12
D1: add paired-t and wilcoxon-sr options to bulk_univariate_test;
    _single_univariate_test aligns values by matched pairing when paired=True
D2: add compute_covariate_balance returning SMD pre/post matching
    (Cohen's d, numeric/bool only); exported from top-level __init__
D3: add create_matched_groups(n_controls) on CaseMatchOneToMany; each
    iteration assigns N distinct controls per case via sequential HK rounds,
    removing chosen controls between rounds; returns list[CaseMatchOneToMany]
D4: add correct=bool to bulk_univariate_test and bulk_permanova;
    BH-FDR q_value column appended when correct=True
Refactor: _TEST_REGISTRY dict replaces if/elif dispatch in bulk_univariate_test;
    _hk_round extracted from _get_cm_one_to_one; _pairs() deduplicates
    CaseMatchOneToOne ordering/hash; misc simplifications from /simplify pass
Perf:
- CaseMatchOneToOne.__hash__: drop redundant sort before frozenset
  (frozenset is order-independent; O(n log n) wasted work per dedup call)

CLI (qupid/cli/cli.py):
- Refactor shuffle to share _SHUFFLE_OPTIONS, _load_focus_background,
  _build_cats_and_tols helpers with new commands
- match-groups: k:1 matching -> long-format TSV (iteration, case, control)
- assess-univariate: full test selection + BH-FDR flag
- assess-multivariate: PERMANOVA + BH-FDR flag
- balance: SMD pre/post covariate balance report

Q2 plugin:
- assess_matches_univariate: add test selection + correct=Bool params
- assess_matches_multivariate: add correct=Bool param
- assess_covariate_balance: new visualizer rendering a Love plot of
  |SMD| pre/post matching, averaged across all collection iterations
- _write_index_html helper deduplicates all three visualizer HTML blocks
- _descriptions.py: add TEST, CORRECT, N_CONTROLS, CATEGORIES, VALUES,
  DISTANCE_MATRIX, PERMUTATIONS string constants
_fast_permanova computes all 999 permutations as two batched matrix
multiplies using the centered Gram matrix G = -0.5 * H * D^2 * H.
Permutation labels are generated in one vectorized rng.permuted call
(no Python loop). Achieves ~6-7x speedup on a 90x90 distance matrix
(45 cases + 45 controls) vs the scikit-bio implementation.

_single_permanova now extracts the submatrix via numpy fancy indexing
instead of skbio's Cython .filter(), which rejects read-only mmap arrays
that joblib loky workers receive for large distance matrices.

Pseudo-F values agree with skbio to <1e-13 (floating-point precision).
New public API surface added since 0.1.0:
- create_matched_groups (k:1 matching) on CaseMatchOneToMany
- compute_covariate_balance (SMD diagnostic)
- paired-t / wilcoxon-sr options on bulk_univariate_test
- correct=bool (BH-FDR) on bulk_permanova and bulk_univariate_test
- _fast_permanova vectorized PERMANOVA
- 4 new CLI commands and a Q2 covariate-balance visualizer

All additive; no breaking changes to the 0.1.0 API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant