added benchmarks for blocking tree optimization#1332
Closed
LOGANBLUE1 wants to merge 1 commit into
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Command to run benchmarks: mvn integration-test -Pbenchmark -pl spark/core -am -DskipTests
Output file path: zingg/spark/core/target/benchmark-results.json
In this optimization we are trying to reduce the expensive calls to spark by storing the values calculated once, but this will take a lot of memory.
Optimizations that i tried -
getNodeFromCurrent runs for every (field × function) candidate — with 10 fields × N functions that's potentially 50+ new Canopy + copyTo calls per getBestNode invocation, all immediately discarded. Only the winner needs a real object.
Fix: allocate one probe Canopy at the start of getBestNode, set function and context on it per iteration (both are just field assignments from copyTo), then countEliminationsWithValues and estimateCanopies run on it. Only when a new best is found, materialize a real Canopy with getNodeFromCurrent. This cuts N allocations per call down to 1.
buildDupeRemaining calls function.apply(r, context.fieldName) per row — full row field extraction — even though getBestNode already has preVals1/preVals2 for that exact field. The problem is those arrays are local to the field loop and buildDupeRemaining is called after the loop ends, when the winning field's preVals are out of scope.
Fix: when a new best is found inside the loop, snapshot bestPreVals1 and bestPreVals2 alongside it. Pass them to buildDupeRemaining so it can call function.applyToValue(bestPreVals1[i]) instead of function.apply(r, fieldName) — same savings as countEliminationsWithValues already gets.
estimateCanopies iterates all of training, hashes each row, builds a HashSet → returns count, discards the set. Then getCanopies iterates all of training again, hashes each row a second time, builds a ListMap<hash, rows>.
For the winner these are two full passes over training computing identical hash values. If estimateCanopies instead built the full ListMap and cached it on the canopy, then getCanopies could return the cached result directly — saving one training scan per tree node expansion. The only cost is slightly higher memory for the winning candidate while getBestNode finishes (the non-winners still build only a HashSet).
The early-exit at least == 0 (and the > limit in countEliminationsWithValues) only pays off if a strong candidate appears early. With fields evaluated in config order, you might process 7 weak fields before hitting the best one.
Fix: maintain a Map<String, Long> avgElimByField updated after each getBestNode call. Sort adjustedFieldOfInterestList by descending average elimination count before the loop. This costs one sort per getBestNode call but pays back by making least drop faster, which causes more early exits in the tail of the candidate list. No change to correctness, low implementation risk.
Combined impact: (1) + (2) together eliminate the biggest source of garbage in the hot loop. (3) saves a full training scan per tree node. (4) amplifies the early-exit that's already in place. All four are single-threaded improvements — they'd stack on top of any parallelization you add later.