feat: route additional scalar expressions through codegen dispatcher#4538
Open
andygrove wants to merge 8 commits into
Open
feat: route additional scalar expressions through codegen dispatcher#4538andygrove wants to merge 8 commits into
andygrove wants to merge 8 commits into
Conversation
…[skip ci] Add JVM codegen dispatch support for Spark scalar expressions that Comet did not previously support natively. Each routes through CometCodegenDispatch, so Spark's own doGenCode runs inside the Comet pipeline and behavior matches Spark exactly, with a clean fallback to Spark when the dispatcher is disabled. Expressions added: - math: hypot, nanvl, bround, conv, log1p, pmod, width_bucket, positive - string: levenshtein, elt, find_in_set, format_number, format_string, overlay, soundex, locate, unbase64, to_char, to_number - array: sequence Each expression has a SQL file test that asserts both answer parity and native execution (checkSparkAnswerAndOperator).
…fault [skip ci] Implements the dispatcher change from issue apache#4506. When a native expression reports Incompatible and the user has not opted into the native divergence via spark.comet.expr.allowIncompatible, the dispatcher now prefers routing the expression through the JVM codegen dispatcher (Spark's own doGenCode runs inside the Comet pipeline) instead of falling back to Spark for the whole projection. It falls back only when the dispatcher cannot handle the expression, when the dispatcher is disabled, or when the expression opts out. A new CometExpressionSerde.allowIncompatCodegenDispatch hook (default true) lets specific serdes opt out. The json/regexp expressions covered by separate open work (regexp_replace, split, get_json_object, from_json, to_csv) and map_from_entries (the dispatcher does not yet evaluate map-typed outputs correctly) opt out so this generic routing does not pre-empt or miscompile them. Tests: - from_unixtime now executes natively via dispatch (test updated from expect_fallback to native coverage). - hour/minute/second gain TimestampNTZ coverage, exercising the conditional Incompatible branch that now stays native.
…cher Map-typed outputs are now evaluated correctly by the JVM codegen dispatcher, so CometMapFromEntries no longer needs to opt out of allowIncompatCodegenDispatch. The BinaryType key/value cases now execute natively instead of falling back to Spark; the SQL file test asserts native execution and answer parity for those cases.
map_concat was previously dropped because the codegen dispatcher emitted a wrong map key. Map-typed outputs are now evaluated correctly, so register MapConcat as a CometCodegenDispatch serde and add a SQL file test asserting native execution and answer parity (column inputs, empty and NULL maps, multi-map literals, and integer-keyed maps).
Flip the expressions implemented in this PR from Planned to Supported: math (nanvl, bround, conv, hypot, log1p, pmod), string (elt, find_in_set, format_number, format_string, levenshtein, locate, overlay, position, printf, soundex, to_char, to_number, to_varchar, unbase64), array (sequence), and map (map_concat). Also drop the stale BinaryType fallback note on map_from_entries, which now executes natively via the codegen dispatcher.
…e exprs The codegen dispatcher now keeps Incompatible expressions native instead of falling back to Spark. Update the trunc/date_trunc (non-literal format), array_reverse (binary array), and map_from_entries (binary key/value) tests to assert native execution rather than a fallback reason.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4506.
Rationale for this change
Comet has a JVM codegen dispatcher (
CometCodegenDispatch/CometScalaUDF.emitJvmCodegenDispatch) that runs a Spark expression's owndoGenCodeinside the Comet native pipeline. This keeps a query native even when there is no Rust implementation (or the Rust implementation diverges from Spark), while guaranteeing behavior matches Spark exactly across supported Spark versions. When the dispatcher is disabled the operator falls back to Spark cleanly.This PR uses that mechanism in two ways:
Incompatiblenative path no longer forces the whole projection back to Spark. Instead the divergent expression is routed through the codegen dispatcher and evaluated correctly inside Comet, withspark.comet.expr.allowIncompatible=trueleft purely as a perf knob for users who accept the native-path divergence.What changes are included in this PR?
New scalar expressions (registered as
CometCodegenDispatchserdes inQueryPlanSerde):hypot,nanvl,bround,conv,log1p,pmod,width_bucket,positivelevenshtein,elt,find_in_set,format_number,format_string,overlay,soundex,locate,unbase64,to_char,to_numbersequencemap_concatDispatcher change for
Incompatibleexpressions (#4506):QueryPlanSerdenow routes anIncompatibleexpression through the codegen dispatcher (whenallowIncompatible=false) before falling back to Spark.CometExpressionSerde.allowIncompatCodegenDispatchhook (defaulttrue) lets a serde opt out.Incompatibleexpression. For examplefrom_unixtimenow executes natively, and theTimestampNTZbranches ofhour/minute/secondnow stay native.Docs:
docs/source/user-guide/latest/expressions.mdmarks the newly dispatched expressions (andmap_concat) as supported.Scope notes:
regexp_replace,split,get_json_object,from_json,to_csvopt out of the generic routing.map_concatis dispatched and theIncompatibleBinaryTypekey/value case ofmap_from_entriesnow executes natively instead of falling back.RuntimeReplaceableexpressions (rewritten before serde),CodegenFallbackexpressions (e.g. xpath), higher-order functions, interval/null output types, and folded-at-plan-time expressions (current_*).try_to_number(throws instead of returning NULL on invalid input) andencode(lowers toStaticInvokeso the class is never seen).Incompatibleexpressions that areCodegenFallback(timezone conversions,parse_url) are not dispatchable and continue to fall back; they need native fixes rather than dispatch.collect_setis an aggregate and uses a different serialization path, so it is out of scope here.from_unixtime, NTZhour/minute/second) do not, so no golden-file changes are expected.How are these changes tested?
Each expression has a SQL file test under
spark/src/test/resources/sql-tests/expressions/run byCometSqlFileTestSuite. Thequerymode usescheckSparkAnswerAndOperator, which asserts both answer parity with Spark and that the expression executed natively (a fallback fails the test). This includesmap_concatand the now-nativeBinaryTypecases ofmap_from_entries. New and updated fixtures pass against Spark 3.5, with no new regressions in the suite.