Skip to content

added benchmarks for blocking tree optimization#1332

Closed
LOGANBLUE1 wants to merge 1 commit into
zinggAI:mainfrom
LOGANBLUE1:blockingTree-perf
Closed

added benchmarks for blocking tree optimization#1332
LOGANBLUE1 wants to merge 1 commit into
zinggAI:mainfrom
LOGANBLUE1:blockingTree-perf

Conversation

@LOGANBLUE1

@LOGANBLUE1 LOGANBLUE1 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Command to run benchmarks: mvn integration-test -Pbenchmark -pl spark/core -am -DskipTests
Output file path: zingg/spark/core/target/benchmark-results.json

In this optimization we are trying to reduce the expensive calls to spark by storing the values calculated once, but this will take a lot of memory.

Optimizations that i tried -

  1. Reuse a probe Canopy instead of allocating per candidate
    getNodeFromCurrent runs for every (field × function) candidate — with 10 fields × N functions that's potentially 50+ new Canopy + copyTo calls per getBestNode invocation, all immediately discarded. Only the winner needs a real object.

Fix: allocate one probe Canopy at the start of getBestNode, set function and context on it per iteration (both are just field assignments from copyTo), then countEliminationsWithValues and estimateCanopies run on it. Only when a new best is found, materialize a real Canopy with getNodeFromCurrent. This cuts N allocations per call down to 1.

  1. Pass precomputed values into buildDupeRemaining
    buildDupeRemaining calls function.apply(r, context.fieldName) per row — full row field extraction — even though getBestNode already has preVals1/preVals2 for that exact field. The problem is those arrays are local to the field loop and buildDupeRemaining is called after the loop ends, when the winning field's preVals are out of scope.

Fix: when a new best is found inside the loop, snapshot bestPreVals1 and bestPreVals2 alongside it. Pass them to buildDupeRemaining so it can call function.applyToValue(bestPreVals1[i]) instead of function.apply(r, fieldName) — same savings as countEliminationsWithValues already gets.

  1. Fuse estimateCanopies + getCanopies
    estimateCanopies iterates all of training, hashes each row, builds a HashSet → returns count, discards the set. Then getCanopies iterates all of training again, hashes each row a second time, builds a ListMap<hash, rows>.

For the winner these are two full passes over training computing identical hash values. If estimateCanopies instead built the full ListMap and cached it on the canopy, then getCanopies could return the cached result directly — saving one training scan per tree node expansion. The only cost is slightly higher memory for the winning candidate while getBestNode finishes (the non-winners still build only a HashSet).

  1. Adaptive field ordering
    The early-exit at least == 0 (and the > limit in countEliminationsWithValues) only pays off if a strong candidate appears early. With fields evaluated in config order, you might process 7 weak fields before hitting the best one.

Fix: maintain a Map<String, Long> avgElimByField updated after each getBestNode call. Sort adjustedFieldOfInterestList by descending average elimination count before the loop. This costs one sort per getBestNode call but pays back by making least drop faster, which causes more early exits in the tail of the candidate list. No change to correctness, low implementation risk.

Combined impact: (1) + (2) together eliminate the biggest source of garbage in the hot loop. (3) saves a full training scan per tree node. (4) amplifies the early-exit that's already in place. All four are single-threaded improvements — they'd stack on top of any parallelization you add later.

@LOGANBLUE1 LOGANBLUE1 closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant