Bug fixes, paired tests, k:1 matching, SMD diagnostics, BH-FDR, vectorized PERMANOVA, and interface parity#23
Open
l1joseph wants to merge 12 commits into
Open
Conversation
- A1/A5: Replace reduce() with set().union(*...) in controls property to handle empty maps - A2: Normalize on_failure to lowercase before comparisons in match_by_single and shuffle - A3/A10: Rewrite tolerance parsing in Q2 methods with per-token error messages - A6: Sort (case, control) pairs before comparing in __lt__/__gt__ for deterministic ordering - A7: Include missing case IDs in 'not matched to a control' warning message - A8: Add rtol=0 to np.isclose() to prevent false positives on large values - A9: Replace deprecated Series.iteritems() with .items() - sample_size fix: compute per-iteration case count instead of using first iteration only - Add basic _resolve_annotation() to decorators to handle string annotations (PEP 563) - Update test_type_incompat to use pure-string Series (pandas 2.0 changed is_string_dtype) - Update test_no_more_controls_no_strict to match new warning message format
- TestBugFixes: cover on_failure case-insensitivity (A2), empty intersection handling (A4), controls-on-empty-map (A5), sort-stable comparison (A6) - test_casematch_utils.py: cover _match_continuous rtol=0 behavior (A8) - test_stats.py: cover per-iteration sample_size computation - q2/tests: add test_malformed_tolerance for A10 error messages (3 cases) - match_by_multiple: delete cases with empty intersections instead of carrying empty sets forward (A4 completion) and warn when on_failure='warn' - q2/_methods.py: align error message wording with test expectations
….toml - Add 'from __future__ import annotations' to all source modules for lazy evaluation (enables forward refs and | union syntax on Python 3.9+) - Replace typing.Dict/List/Set with builtin dict/list/set generics - Replace Union[X, Y] with X | Y pipe syntax - Remove now-redundant Dict/List/Set/Union from typing imports - Drop string forward refs where __future__ annotations makes them unnecessary - Add pyproject.toml (build system metadata, replaces setup.cfg boilerplate) - Update .github/workflows, Makefile, README, pytest.ini for new tooling
- F1: _load() return annotation updated to dict[str, set] (landed in C refactor) - F2: guard to_series() against empty case_control_map (returns empty Series) - F3: warn in to_dataframe() when strict=False produces NaN across matchings - F4: validate on_failure in match_by_multiple before any other logic - F5: reject negative tolerances in match_by_single and Q2 _methods.py - F6: remove dead all_matches = set() overwritten by Parallel() assignment - F7: raise ValueError for iterations < 1 in create_matched_pairs - F8: add qupid/cli/__init__.py so find_packages includes the CLI subpackage - F9: graft config/ in MANIFEST.in so coverage ini files ship in sdist - F10: handle PEP 604 unions and typing.Union in _resolve_annotation() - F11: decorate CaseMatchOneToOne with @total_ordering for __le__/__ge__ - F12: pre-filter focus to surviving cases in match_by_multiple to eliminate duplicate 'no matches' warnings and unnecessary computation - Add TestCodeReviewFindings class and Q2 negative-tolerance test covering F2-F5/F7/F11/F12
D1: add paired-t and wilcoxon-sr options to bulk_univariate_test;
_single_univariate_test aligns values by matched pairing when paired=True
D2: add compute_covariate_balance returning SMD pre/post matching
(Cohen's d, numeric/bool only); exported from top-level __init__
D3: add create_matched_groups(n_controls) on CaseMatchOneToMany; each
iteration assigns N distinct controls per case via sequential HK rounds,
removing chosen controls between rounds; returns list[CaseMatchOneToMany]
D4: add correct=bool to bulk_univariate_test and bulk_permanova;
BH-FDR q_value column appended when correct=True
Refactor: _TEST_REGISTRY dict replaces if/elif dispatch in bulk_univariate_test;
_hk_round extracted from _get_cm_one_to_one; _pairs() deduplicates
CaseMatchOneToOne ordering/hash; misc simplifications from /simplify pass
Perf: - CaseMatchOneToOne.__hash__: drop redundant sort before frozenset (frozenset is order-independent; O(n log n) wasted work per dedup call) CLI (qupid/cli/cli.py): - Refactor shuffle to share _SHUFFLE_OPTIONS, _load_focus_background, _build_cats_and_tols helpers with new commands - match-groups: k:1 matching -> long-format TSV (iteration, case, control) - assess-univariate: full test selection + BH-FDR flag - assess-multivariate: PERMANOVA + BH-FDR flag - balance: SMD pre/post covariate balance report Q2 plugin: - assess_matches_univariate: add test selection + correct=Bool params - assess_matches_multivariate: add correct=Bool param - assess_covariate_balance: new visualizer rendering a Love plot of |SMD| pre/post matching, averaged across all collection iterations - _write_index_html helper deduplicates all three visualizer HTML blocks - _descriptions.py: add TEST, CORRECT, N_CONTROLS, CATEGORIES, VALUES, DISTANCE_MATRIX, PERMUTATIONS string constants
_fast_permanova computes all 999 permutations as two batched matrix multiplies using the centered Gram matrix G = -0.5 * H * D^2 * H. Permutation labels are generated in one vectorized rng.permuted call (no Python loop). Achieves ~6-7x speedup on a 90x90 distance matrix (45 cases + 45 controls) vs the scikit-bio implementation. _single_permanova now extracts the submatrix via numpy fancy indexing instead of skbio's Cython .filter(), which rejects read-only mmap arrays that joblib loky workers receive for large distance matrices. Pseudo-F values agree with skbio to <1e-13 (floating-point precision).
New public API surface added since 0.1.0: - create_matched_groups (k:1 matching) on CaseMatchOneToMany - compute_covariate_balance (SMD diagnostic) - paired-t / wilcoxon-sr options on bulk_univariate_test - correct=bool (BH-FDR) on bulk_permanova and bulk_univariate_test - _fast_permanova vectorized PERMANOVA - 4 new CLI commands and a Q2 covariate-balance visualizer All additive; no breaking changes to the 0.1.0 API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi @gibsramen, this PR has a few changes that include bug fixes, performance improvements, and new features built on top of qupid v0.1.0. @lucaspatel and I are aiming to submit Qupid for publication in the upcomining months. I'm more than happy to share the manuscript draft. All changes preserve the existing public API. 106/106 tests pass (
pytest qupid/tests/ qupid/q2/tests/).Bug fixes + modernization
controlsproperty crashing on emptyCaseMatchOneToManymatch_by_multipledropping cases silently instead of respectingon_failuresample_sizecomputed fromcasematches[0]instead of per-matchingto_series()crashing on emptyCaseMatchOneToOneto_dataframe()silent NaN withstrict=False(now warns)on_failureinmatch_by_multiplebefore any loop workiterations < 1guard tocreate_matched_pairs@total_orderingonCaseMatchOneToOnefor<=/>=supportPerformance
_fast_permanova: replaces scikit-bio's Python permutation loop with two batched numpy matrix multiplies + vectorizedrng.permuted— ~6–7× faster on typical cohort sizes (90×90 DM, 999 permutations). Pseudo-F agrees with scikit-bio to <10⁻¹². Also fixes aValueErrorwhen joblib loky workers receive read-only mmap arrays from skbio's Cython.filter().CaseMatchOneToOne.__hash__: drop unnecessary sort beforefrozenset.n_jobs=-1.New features
create_matched_groups(n_controls): k:1 matching so each case gets N distinct controls via sequential Hopcroft-Karp rounds; no double-assignment within an iteration.bulk_univariate_test: paired tests--paired-t/ttest-relandwilcoxon/wilcoxon-srthat align values by matched pairing order.compute_covariate_balance: standardized mean difference (Cohen's d) pre vs. post matching, for numeric/boolean covariates.bulk_permanova/bulk_univariate_test:correct=Trueappends a Benjamini-Hochberg FDRq_valuecolumn.Interface parity with CLI Q2
match-groups,assess-univariate,assess-multivariate,balancecommands.assess-covariate-balancevisualizer (Love plot).Version
qupid/__init__.pybumped from0.1.0-->0.2.0.PRs
If you would prefer to review these incrementally, the commits are organized so I can rebase into separate PRs in any of these splits:
_fast_permanova, hash fix,n_jobs)paired-t,wilcoxon)Just let me know and I will open the split PRs.
Testing
pytest qupid/tests/ qupid/q2/tests/ -q→ 106 passed.qiime qupid --helplists new actions)._fast_permanovapseudo-F agreement against scikit-bio on real Bray-Curtis (n=500: 4.72e-14; n=1,126: 6.06e-13) and unweighted UniFrac (n=500: 2.81e-13) matrices.