Skip to content

Add json_union_to_text UDF to flatten a JSON union to JSON text#114

Merged
adriangb merged 1 commit into
datafusion-contrib:mainfrom
pydantic:udf-json-union-to-text
Jun 2, 2026
Merged

Add json_union_to_text UDF to flatten a JSON union to JSON text#114
adriangb merged 1 commit into
datafusion-contrib:mainfrom
pydantic:udf-json-union-to-text

Conversation

@adriangb

@adriangb adriangb commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

json_get returns a heterogeneous JSON union (JSON_UNION_DATA_TYPE). Consumers that can't represent an Arrow Union — notably the Parquet writer, whose arrow_to_parquet_schema panics on unions (unimplemented!("See ARROW-8817.")) — have no way to materialize such a column. None of the existing UDFs help: they all take (json_string, path…) and parse source text rather than consuming a union.

This adds json_union_to_text(json_union), which flattens the union (via the public JsonUnionEncoder) into canonical JSON text as Utf8View:

  • scalars render as true / 42 / 1.5
  • strings are JSON-quoted and escaped
  • array/object arms (already raw JSON text) pass through
  • a JSON null arm becomes SQL NULL

Signature::exact([JSON_UNION_DATA_TYPE]) constrains the input to the JSON union, so any other argument type is a planning error. Registered in register_all and exported under functions:: / udfs::, matching the other functions.

Tested: cargo test --lib json_union_to_text (covers every union arm incl. escaping), cargo clippy --all-targets (pedantic), cargo fmt --check.

Comment thread src/json_union_to_text.rs
return exec_err!("json_union_to_text argument is not the JSON union type");
};

let mut builder = StringViewBuilder::with_capacity(encoder.len());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we copy what happens with arrow-rs and use serde json to do this? I worry deeply by hand rolling our own encoder we are going to hit a number of edge cases & bugs. serde_json is already a dependency of datafusion/arrow:

https://github.com/apache/arrow-rs/blob/58.3.0/arrow-json/src/writer/encoder.rs

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched the scalar arms to encode via serde_json::to_writer (added serde_json as a direct dep, since it's already in the tree via arrow). The array/object arms still pass their raw JSON text through verbatim, since those are already-serialized JSON substrings (re-encoding them through a JSON string encoder would double-quote them). Dropped the hand-rolled escaper entirely and extended the test to cover quote/newline/control-char escaping.

@cetra3 cetra3 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but I would like to use serde_json to do the encoding

`json_get` returns a heterogeneous JSON union (`JSON_UNION_DATA_TYPE`).
Consumers that can't represent an Arrow `Union` — notably the Parquet writer,
whose `arrow_to_parquet_schema` panics on unions ("See ARROW-8817.") — have no
way to materialize such a column. None of the existing UDFs help: they all take
`(json_string, path…)` and parse source text rather than consuming a union.

`json_union_to_text(json_union)` flattens the union (via `JsonUnionEncoder`)
into canonical JSON text as `Utf8View`: scalars render as `true` / `42` / `1.5`,
strings are JSON-quoted and escaped, array/object arms (already raw JSON text)
pass through, and a JSON `null` arm becomes SQL `NULL`. Its `Signature::exact`
constrains the input to the JSON union, so any other argument type is a planning
error. Registered in `register_all` and exported under `functions` / `udfs`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the udf-json-union-to-text branch from 97ae467 to 46eb0ed Compare June 2, 2026 10:58
@adriangb adriangb merged commit 5a2550f into datafusion-contrib:main Jun 2, 2026
7 checks passed
@adriangb

adriangb commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Also marked the output field as JSON: the UDF now implements return_field_from_args to return a Utf8View field tagged with the canonical Arrow JSON extension via json_field_metadata() (ARROW:extension:name = arrow.json), matching json_get_json/json_get_array. Added a test asserting the extension name on the output field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants