Add batch querying support by CoolJosh0221 · Pull Request #198 · ntucllab/libact

CoolJosh0221 · 2026-06-30T14:41:18Z

Summary

Add QueryStrategy.make_query_batch(batch_size) with default score-based top-k selection.
Add Dataset.update_batch(entry_ids, labels) while preserving existing per-entry observer callback behavior.
Add batch-aware overrides for RandomSampling, CoreSet, and EpsilonUncertaintySampling.
Add DiversityWeightedMeta for diversity-aware batch selection over existing query strategies.
Document the batch API and add an end-to-end batch querying example.

Notes

make_query() keeps returning a single entry id; existing public single-query behavior is unchanged.
Dataset.update_batch() validates length, dimensionality, and duplicate entry ids before applying per-entry updates.
ALBL and VarianceReduction explicitly reject make_query_batch() because their existing semantics are not directly batch-compatible.

Tests

python -m unittest -v
- 217 tests passed
MPLBACKEND=Agg MPLCONFIGDIR=/tmp/libact-mpl python examples/batch_query_plot.py
- sequential uncertainty sampling: 120 training rounds for 120 labels
- top-k batch uncertainty sampling: 12 training rounds for 120 labels
- diversity batch uncertainty sampling: 12 training rounds for 120 labels

…WeightedMeta Selecting N samples previously cost N training rounds, since make_query() returns one entry id per call. This adds a batch path built on the standardized _get_scores() contract (ntucllab#197), while keeping make_query()'s single-int public contract and the per-entry (entry_id, label) observer callback untouched. - QueryStrategy.make_query_batch(batch_size): default = stable descending top-k of _get_scores(); returns np.ndarray of distinct entry ids, most preferred first. TypeError for non-integer batch_size; ValueError for batch_size < 1, an empty pool, or batch_size > n_unlabeled (no silent clamp). Ties break deterministically (stable sort), so make_query_batch(1) may differ from make_query() only at ties. - Semantic overrides where top-k is unfaithful: - RandomSampling: uniform sampling without replacement. - CoreSet: true iterative k-center greedy (Sener & Savarese 2018) with a running min-distance vector; honors metric and transformer. - EpsilonUncertaintySampling: Binomial(batch_size, epsilon) exploration picks drawn from the complement of the top-uncertainty picks, so the batch is always exactly batch_size distinct ids. - ALBL and VarianceReduction: explicit NotImplementedError (inherently sequential / no per-sample scoring). - DiversityWeightedMeta: wraps any score-based strategy so batches are not just top-k with near-duplicate redundancy. Greedy utility (1 - lmbda) * s_norm + lmbda * d_norm, where s_norm is a monotone min-max of the base scores (rank-faithful: never re-interprets score semantics, so confidence-flavored scores like HintSVM's are handled by construction) and d is the min distance to already-selected batch members. First pick = base argmax. Optional candidate_pool_size cap. - Dataset.update_batch(entry_ids, labels): validates lengths, rejects duplicate ids, empty input is a no-op; applies labels through the existing per-entry update() path so observers (ALBL, QUIRE, QBC) see exactly the same incremental notification stream as sequential calls. - CoreSet._get_scores: cdist -> sklearn pairwise_distances, fixing a latent crash on sparse feature matrices (identical dense results). - Tests: 81 new (batch contract across all strategies, override semantics, diversity guarantee with a near-duplicate control fixture, rank-faithfulness with negative/adversarial-magnitude mocks, sparse inputs, update_batch equivalence with sequential updates for QUIRE/QBC/ALBL observers, error paths). Full suite: 217 passing. - Docs: Sphinx entry, README batch-querying section, and examples/batch_query_plot.py (12 vs 120 training rounds for the same 120-label budget on the diabetes dataset).

scikit-learn deprecated `multi_class` in 1.5 and removed it in 1.7, so `LogisticRegression(..., multi_class="ovr")` now raises TypeError. On top of that, `solver="liblinear"` no longer performs one-vs-rest for multiclass data (n_classes >= 3): it raises and directs callers to OneVsRestClassifier. Together these broke 24 tests under scikit-learn 1.8. All changes are behavior-preserving: - Drop `multi_class="ovr"` wherever it was paired with `solver="liblinear"`; liblinear only ever did one-vs-rest, so removing it is a no-op. - Drop `'multi_class': 'multinomial'` from the default logreg_param of MaximumLossReductionMaximalConfidence; multinomial is now the default for the newton-cg solver. - EER iris tests: wrap in SklearnProbaAdapter(OneVsRestClassifier(LogisticRegression(solver="liblinear"))) to retain one-vs-rest. Verified to reproduce the exact recorded query sequences, so no assertion values changed. - LogisticRegression / SklearnAdapter delegation tests: use the default solver with max_iter=1000 on both sides (the tests only assert wrapper-vs-sklearn equality, so the solver is incidental; max_iter avoids a ConvergenceWarning on unscaled iris). - Update the CostSensitiveReferencePairEncoding docstring example. Pre-existing issue independent of the batch-querying work; fails on master too. 217 passed (was 24 failed, 193 passed).

CoolJosh0221 added 2 commits June 30, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add batch querying support#198

Add batch querying support#198
CoolJosh0221 wants to merge 2 commits into
ntucllab:masterfrom
CoolJosh0221:batch-mode-upstream

CoolJosh0221 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

CoolJosh0221 commented Jun 30, 2026

Summary

Notes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant