Skip to content

feat: support exact percentile aggregate natively#4542

Draft
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:percentile-support
Draft

feat: support exact percentile aggregate natively#4542
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:percentile-support

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Part of #3190.

Rationale for this change

Comet had no native percentile aggregate, so percentile(...) (and the ANSI percentile_cont(...) WITHIN GROUP, which Spark rewrites to Percentile) always fell back to Spark. Codegen dispatch is not an option here: Percentile is a TypedImperativeAggregate, and the codegen dispatcher is a per-row scalar kernel that explicitly cannot run aggregates. So the only paths are native or fall back, and this PR wires it natively.

DataFusion's percentile_cont computes the percentile with index = p * (n - 1) and linear interpolation between the two closest ranks, which is exactly Spark's exact Percentile algorithm. So the common single-percentage form matches Spark.

What changes are included in this PR?

  • proto: new Percentile AggExpr message (child, percentage, datatype).
  • native planner (planner.rs): map AggExprStruct::Percentile to percentile_cont_udaf() with args [child, percentile].
  • CometPercentile serde: Compatible for a single literal double percentage, default frequency, and numeric input. The child is cast to double so the native result is DoubleType, matching Spark.
  • operators.adjustOutputForNativeState: map Percentile's TypedImperativeAggregate Binary partial buffer to the native List<Float64> state (ArrayType(DoubleType)), mirroring the existing CollectSet handling, so the partial/shuffle/final exchange schema is correct.

Out of scope (fall back to Spark): an array of percentages, a non-default frequency argument, and interval inputs. approx_percentile is deliberately not included (t-digest vs Spark's GK algorithm; tracked separately under #3189).

Known minor caveat: DataFusion quantizes the interpolation fraction to 6 decimal places, so a deeply-interpolated value could in principle differ from Spark in the last ULPs. The tested percentiles match exactly; if needed this can be revisited with a custom accumulator.

How are these changes tested?

A SQL file test (expressions/aggregate/percentile.sql) run by CometSqlFileTestSuite covers global, grouped, integer-input, all-null-group, and exact and interpolated percentiles, asserting answer parity and native execution via checkSparkAnswerAndOperator. It also asserts that the array-of-percentages and frequency-argument forms fall back to Spark. The full SQL suite shows no new regressions.

Wire Spark's exact `Percentile` aggregate (and the `percentile_cont` ANSI form,
which Spark rewrites to `Percentile`) to DataFusion's `percentile_cont`
aggregate. DataFusion uses the same `index = p * (n - 1)` linear interpolation
as Spark, so results match for the common single-percentage form.

- proto: add `Percentile` AggExpr message (child, percentage, datatype).
- native planner: map it to `percentile_cont_udaf()` with [child, percentile].
- CometPercentile serde: Compatible for a single literal double percentage,
  default frequency, and numeric input; the child is cast to double so the
  native result is DoubleType. Array-of-percentages, a non-default frequency
  argument, and interval inputs fall back to Spark.
- operators.adjustOutputForNativeState: map Percentile's TypedImperativeAggregate
  Binary partial buffer to the native List<Float64> state (ArrayType(DoubleType)),
  mirroring CollectSet, so the partial/shuffle/final exchange schema is correct.

Codegen dispatch is not applicable: aggregates (TypedImperativeAggregate) cannot
run in the per-row scalar kernel, so native is the only path.

Tests: SQL file test covering global, grouped, integer-input, all-null, exact and
interpolated percentiles, plus fallback assertions for the array and frequency
forms. No new regressions in the SQL suite.
@andygrove andygrove force-pushed the percentile-support branch from 312293f to 751131c Compare May 30, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant