Add json_union_to_text UDF to flatten a JSON union to JSON text#114
Conversation
| return exec_err!("json_union_to_text argument is not the JSON union type"); | ||
| }; | ||
|
|
||
| let mut builder = StringViewBuilder::with_capacity(encoder.len()); |
There was a problem hiding this comment.
Can we copy what happens with arrow-rs and use serde json to do this? I worry deeply by hand rolling our own encoder we are going to hit a number of edge cases & bugs. serde_json is already a dependency of datafusion/arrow:
https://github.com/apache/arrow-rs/blob/58.3.0/arrow-json/src/writer/encoder.rs
There was a problem hiding this comment.
Done — switched the scalar arms to encode via serde_json::to_writer (added serde_json as a direct dep, since it's already in the tree via arrow). The array/object arms still pass their raw JSON text through verbatim, since those are already-serialized JSON substrings (re-encoding them through a JSON string encoder would double-quote them). Dropped the hand-rolled escaper entirely and extended the test to cover quote/newline/control-char escaping.
cetra3
left a comment
There was a problem hiding this comment.
Approved but I would like to use serde_json to do the encoding
`json_get` returns a heterogeneous JSON union (`JSON_UNION_DATA_TYPE`).
Consumers that can't represent an Arrow `Union` — notably the Parquet writer,
whose `arrow_to_parquet_schema` panics on unions ("See ARROW-8817.") — have no
way to materialize such a column. None of the existing UDFs help: they all take
`(json_string, path…)` and parse source text rather than consuming a union.
`json_union_to_text(json_union)` flattens the union (via `JsonUnionEncoder`)
into canonical JSON text as `Utf8View`: scalars render as `true` / `42` / `1.5`,
strings are JSON-quoted and escaped, array/object arms (already raw JSON text)
pass through, and a JSON `null` arm becomes SQL `NULL`. Its `Signature::exact`
constrains the input to the JSON union, so any other argument type is a planning
error. Registered in `register_all` and exported under `functions` / `udfs`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
97ae467 to
46eb0ed
Compare
|
Also marked the output field as JSON: the UDF now implements |
json_getreturns a heterogeneous JSON union (JSON_UNION_DATA_TYPE). Consumers that can't represent an ArrowUnion— notably the Parquet writer, whosearrow_to_parquet_schemapanics on unions (unimplemented!("See ARROW-8817.")) — have no way to materialize such a column. None of the existing UDFs help: they all take(json_string, path…)and parse source text rather than consuming a union.This adds
json_union_to_text(json_union), which flattens the union (via the publicJsonUnionEncoder) into canonical JSON text asUtf8View:true/42/1.5nullarm becomes SQLNULLSignature::exact([JSON_UNION_DATA_TYPE])constrains the input to the JSON union, so any other argument type is a planning error. Registered inregister_alland exported underfunctions::/udfs::, matching the other functions.Tested:
cargo test --lib json_union_to_text(covers every union arm incl. escaping),cargo clippy --all-targets(pedantic),cargo fmt --check.