feat: support exact percentile aggregate natively by andygrove · Pull Request #4542 · apache/datafusion-comet

andygrove · 2026-05-30T18:32:01Z

Which issue does this PR close?

Part of #3190.

Rationale for this change

Comet had no native percentile aggregate, so percentile(...) (and the ANSI percentile_cont(...) WITHIN GROUP, which Spark rewrites to Percentile) always fell back to Spark. Codegen dispatch is not an option here: Percentile is a TypedImperativeAggregate, and the codegen dispatcher is a per-row scalar kernel that explicitly cannot run aggregates. So the only paths are native or fall back, and this PR wires it natively.

DataFusion's percentile_cont computes the percentile with index = p * (n - 1) and linear interpolation between the two closest ranks, which is exactly Spark's exact Percentile algorithm. So the common single-percentage form matches Spark.

What changes are included in this PR?

proto: new Percentile AggExpr message (child, percentage, datatype).
native planner (planner.rs): map AggExprStruct::Percentile to percentile_cont_udaf() with args [child, percentile].
CometPercentile serde: Compatible for a single literal double percentage, default frequency, and numeric input. The child is cast to double so the native result is DoubleType, matching Spark.
operators.adjustOutputForNativeState: map Percentile's TypedImperativeAggregate Binary partial buffer to the native List<Float64> state (ArrayType(DoubleType)), mirroring the existing CollectSet handling, so the partial/shuffle/final exchange schema is correct.

Out of scope (fall back to Spark): an array of percentages, a non-default frequency argument, and interval inputs. approx_percentile is deliberately not included (t-digest vs Spark's GK algorithm; tracked separately under #3189).

Known minor caveat: DataFusion quantizes the interpolation fraction to 6 decimal places, so a deeply-interpolated value could in principle differ from Spark in the last ULPs. The tested percentiles match exactly; if needed this can be revisited with a custom accumulator.

How are these changes tested?

A SQL file test (expressions/aggregate/percentile.sql) run by CometSqlFileTestSuite covers global, grouped, integer-input, all-null-group, and exact and interpolated percentiles, asserting answer parity and native execution via checkSparkAnswerAndOperator. It also asserts that the array-of-percentages and frequency-argument forms fall back to Spark. The full SQL suite shows no new regressions.

Wire Spark's exact `Percentile` aggregate (and the `percentile_cont` ANSI form, which Spark rewrites to `Percentile`) to DataFusion's `percentile_cont` aggregate. DataFusion uses the same `index = p * (n - 1)` linear interpolation as Spark, so results match for the common single-percentage form. - proto: add `Percentile` AggExpr message (child, percentage, datatype). - native planner: map it to `percentile_cont_udaf()` with [child, percentile]. - CometPercentile serde: Compatible for a single literal double percentage, default frequency, and numeric input; the child is cast to double so the native result is DoubleType. Array-of-percentages, a non-default frequency argument, and interval inputs fall back to Spark. - operators.adjustOutputForNativeState: map Percentile's TypedImperativeAggregate Binary partial buffer to the native List<Float64> state (ArrayType(DoubleType)), mirroring CollectSet, so the partial/shuffle/final exchange schema is correct. Codegen dispatch is not applicable: aggregates (TypedImperativeAggregate) cannot run in the per-row scalar kernel, so native is the only path. Tests: SQL file test covering global, grouped, integer-input, all-null, exact and interpolated percentiles, plus fallback assertions for the array and frequency forms. No new regressions in the SQL suite.

…p ci]

andygrove force-pushed the percentile-support branch from 312293f to 751131c Compare May 30, 2026 18:39

bench: add percentile cases to CometAggregateExpressionBenchmark [ski…

437ce2d

…p ci]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support exact percentile aggregate natively#4542

feat: support exact percentile aggregate natively#4542
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:percentile-support

andygrove commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant