Skip to content

feat: route additional scalar expressions through codegen dispatcher#4538

Open
andygrove wants to merge 8 commits into
apache:mainfrom
andygrove:codegen-dispatch-scalar-exprs
Open

feat: route additional scalar expressions through codegen dispatcher#4538
andygrove wants to merge 8 commits into
apache:mainfrom
andygrove:codegen-dispatch-scalar-exprs

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 30, 2026

Which issue does this PR close?

Part of #4506.

Rationale for this change

Comet has a JVM codegen dispatcher (CometCodegenDispatch / CometScalaUDF.emitJvmCodegenDispatch) that runs a Spark expression's own doGenCode inside the Comet native pipeline. This keeps a query native even when there is no Rust implementation (or the Rust implementation diverges from Spark), while guaranteeing behavior matches Spark exactly across supported Spark versions. When the dispatcher is disabled the operator falls back to Spark cleanly.

This PR uses that mechanism in two ways:

  1. Add native (dispatched) support for scalar expressions that previously had no Comet support and forced a fallback to Spark.
  2. Implement the dispatcher change from [EPIC] Provide JVM/codegen-dispatch implementations for Incompatible expressions so they never fall back by default #4506: an Incompatible native path no longer forces the whole projection back to Spark. Instead the divergent expression is routed through the codegen dispatcher and evaluated correctly inside Comet, with spark.comet.expr.allowIncompatible=true left purely as a perf knob for users who accept the native-path divergence.

What changes are included in this PR?

New scalar expressions (registered as CometCodegenDispatch serdes in QueryPlanSerde):

  • math: hypot, nanvl, bround, conv, log1p, pmod, width_bucket, positive
  • string: levenshtein, elt, find_in_set, format_number, format_string, overlay, soundex, locate, unbase64, to_char, to_number
  • array: sequence
  • map: map_concat

Dispatcher change for Incompatible expressions (#4506):

  • QueryPlanSerde now routes an Incompatible expression through the codegen dispatcher (when allowIncompatible=false) before falling back to Spark.
  • New CometExpressionSerde.allowIncompatCodegenDispatch hook (default true) lets a serde opt out.
  • This generically benefits any dispatchable Incompatible expression. For example from_unixtime now executes natively, and the TimestampNTZ branches of hour / minute / second now stay native.

Docs:

  • docs/source/user-guide/latest/expressions.md marks the newly dispatched expressions (and map_concat) as supported.

Scope notes:

  • JSON and regexp functions are excluded (separate PRs are open): regexp_replace, split, get_json_object, from_json, to_csv opt out of the generic routing.
  • Map-typed outputs are now evaluated correctly by the codegen dispatcher, so map_concat is dispatched and the Incompatible BinaryType key/value case of map_from_entries now executes natively instead of falling back.
  • Excluded from the new-expression set: generators, RuntimeReplaceable expressions (rewritten before serde), CodegenFallback expressions (e.g. xpath), higher-order functions, interval/null output types, and folded-at-plan-time expressions (current_*).
  • Dropped after testing surfaced real issues: try_to_number (throws instead of returning NULL on invalid input) and encode (lowers to StaticInvoke so the class is never seen).
  • Incompatible expressions that are CodegenFallback (timezone conversions, parse_url) are not dispatchable and continue to fall back; they need native fixes rather than dispatch.
  • collect_set is an aggregate and uses a different serialization path, so it is out of scope here.
  • This may change TPC-DS physical plans only if a flipped expression appears in those queries; the affected expressions (e.g. from_unixtime, NTZ hour/minute/second) do not, so no golden-file changes are expected.

How are these changes tested?

Each expression has a SQL file test under spark/src/test/resources/sql-tests/expressions/ run by CometSqlFileTestSuite. The query mode uses checkSparkAnswerAndOperator, which asserts both answer parity with Spark and that the expression executed natively (a fallback fails the test). This includes map_concat and the now-native BinaryType cases of map_from_entries. New and updated fixtures pass against Spark 3.5, with no new regressions in the suite.

andygrove added 2 commits May 30, 2026 11:12
…[skip ci]

Add JVM codegen dispatch support for Spark scalar expressions that Comet did
not previously support natively. Each routes through CometCodegenDispatch, so
Spark's own doGenCode runs inside the Comet pipeline and behavior matches Spark
exactly, with a clean fallback to Spark when the dispatcher is disabled.

Expressions added:
- math: hypot, nanvl, bround, conv, log1p, pmod, width_bucket, positive
- string: levenshtein, elt, find_in_set, format_number, format_string, overlay,
  soundex, locate, unbase64, to_char, to_number
- array: sequence

Each expression has a SQL file test that asserts both answer parity and native
execution (checkSparkAnswerAndOperator).
…fault [skip ci]

Implements the dispatcher change from issue apache#4506. When a native expression
reports Incompatible and the user has not opted into the native divergence via
spark.comet.expr.allowIncompatible, the dispatcher now prefers routing the
expression through the JVM codegen dispatcher (Spark's own doGenCode runs inside
the Comet pipeline) instead of falling back to Spark for the whole projection.
It falls back only when the dispatcher cannot handle the expression, when the
dispatcher is disabled, or when the expression opts out.

A new CometExpressionSerde.allowIncompatCodegenDispatch hook (default true) lets
specific serdes opt out. The json/regexp expressions covered by separate open
work (regexp_replace, split, get_json_object, from_json, to_csv) and
map_from_entries (the dispatcher does not yet evaluate map-typed outputs
correctly) opt out so this generic routing does not pre-empt or miscompile them.

Tests:
- from_unixtime now executes natively via dispatch (test updated from
  expect_fallback to native coverage).
- hour/minute/second gain TimestampNTZ coverage, exercising the conditional
  Incompatible branch that now stays native.
andygrove added 5 commits June 2, 2026 13:40
…cher

Map-typed outputs are now evaluated correctly by the JVM codegen
dispatcher, so CometMapFromEntries no longer needs to opt out of
allowIncompatCodegenDispatch. The BinaryType key/value cases now
execute natively instead of falling back to Spark; the SQL file test
asserts native execution and answer parity for those cases.
map_concat was previously dropped because the codegen dispatcher
emitted a wrong map key. Map-typed outputs are now evaluated correctly,
so register MapConcat as a CometCodegenDispatch serde and add a SQL file
test asserting native execution and answer parity (column inputs, empty
and NULL maps, multi-map literals, and integer-keyed maps).
Flip the expressions implemented in this PR from Planned to Supported:
math (nanvl, bround, conv, hypot, log1p, pmod), string (elt, find_in_set,
format_number, format_string, levenshtein, locate, overlay, position,
printf, soundex, to_char, to_number, to_varchar, unbase64), array
(sequence), and map (map_concat). Also drop the stale BinaryType
fallback note on map_from_entries, which now executes natively via the
codegen dispatcher.
@andygrove andygrove marked this pull request as ready for review June 2, 2026 20:18
…e exprs

The codegen dispatcher now keeps Incompatible expressions native instead of
falling back to Spark. Update the trunc/date_trunc (non-literal format),
array_reverse (binary array), and map_from_entries (binary key/value) tests to
assert native execution rather than a fallback reason.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant