added benchmarks for blocking tree optimization by LOGANBLUE1 · Pull Request #1332 · zinggAI/zingg

LOGANBLUE1 · 2026-06-29T03:28:41Z

Command to run benchmarks: mvn integration-test -Pbenchmark -pl spark/core -am -DskipTests
Output file path: zingg/spark/core/target/benchmark-results.json

In this optimization we are trying to reduce the expensive calls to spark by storing the values calculated once, but this will take a lot of memory.

Optimizations that i tried -

Reuse a probe Canopy instead of allocating per candidate
getNodeFromCurrent runs for every (field × function) candidate — with 10 fields × N functions that's potentially 50+ new Canopy + copyTo calls per getBestNode invocation, all immediately discarded. Only the winner needs a real object.

Fix: allocate one probe Canopy at the start of getBestNode, set function and context on it per iteration (both are just field assignments from copyTo), then countEliminationsWithValues and estimateCanopies run on it. Only when a new best is found, materialize a real Canopy with getNodeFromCurrent. This cuts N allocations per call down to 1.

Pass precomputed values into buildDupeRemaining
buildDupeRemaining calls function.apply(r, context.fieldName) per row — full row field extraction — even though getBestNode already has preVals1/preVals2 for that exact field. The problem is those arrays are local to the field loop and buildDupeRemaining is called after the loop ends, when the winning field's preVals are out of scope.

Fix: when a new best is found inside the loop, snapshot bestPreVals1 and bestPreVals2 alongside it. Pass them to buildDupeRemaining so it can call function.applyToValue(bestPreVals1[i]) instead of function.apply(r, fieldName) — same savings as countEliminationsWithValues already gets.

Fuse estimateCanopies + getCanopies
estimateCanopies iterates all of training, hashes each row, builds a HashSet → returns count, discards the set. Then getCanopies iterates all of training again, hashes each row a second time, builds a ListMap<hash, rows>.

For the winner these are two full passes over training computing identical hash values. If estimateCanopies instead built the full ListMap and cached it on the canopy, then getCanopies could return the cached result directly — saving one training scan per tree node expansion. The only cost is slightly higher memory for the winning candidate while getBestNode finishes (the non-winners still build only a HashSet).

Adaptive field ordering
The early-exit at least == 0 (and the > limit in countEliminationsWithValues) only pays off if a strong candidate appears early. With fields evaluated in config order, you might process 7 weak fields before hitting the best one.

Fix: maintain a Map<String, Long> avgElimByField updated after each getBestNode call. Sort adjustedFieldOfInterestList by descending average elimination count before the loop. This costs one sort per getBestNode call but pays back by making least drop faster, which causes more early exits in the tail of the candidate list. No change to correctness, low implementation risk.

Combined impact: (1) + (2) together eliminate the biggest source of garbage in the hot loop. (3) saves a full training scan per tree node. (4) amplifies the early-exit that's already in place. All four are single-threaded improvements — they'd stack on top of any parallelization you add later.

added benchmarks

49037c9

LOGANBLUE1 closed this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added benchmarks for blocking tree optimization#1332

added benchmarks for blocking tree optimization#1332
LOGANBLUE1 wants to merge 1 commit into
zinggAI:mainfrom
LOGANBLUE1:blockingTree-perf

LOGANBLUE1 commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LOGANBLUE1 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LOGANBLUE1 commented Jun 29, 2026 •

edited

Loading