You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A major usecase for Variant values in Parquet and Arrow is efficiently
processing JSON encoded data. Thus an important capability is being able to
efficiently read JSON encoded bytes into the Variant binary encoding described in VariantEncoding.md. This ticket covers an API to parse one JSON value to
one Variant value. Other tickets will cover converting converting Variants to
JSON as well as converting to/from Arrow Utf8* arrays and Variant arrays as
well as writing this to/from parquet.
Describe the solution you'd like
I would like an API to convert JSON encoded bytes to Variant encoded bytes
Describe alternatives you've considered
I suggest is an API like this:
// Provide location to write metadata, and value output// (should be anything that implements `std::io::Write` or some trait)letmut metadata_buffer = vec![];letmut value_buffer = vec![];// Input json encoded byteslet json_data:&[u8] = ...;// Call the new APIjson_to_variant(&mut metadata_buffer,&mut value_buffer, json_data)?;// metadata_buffer and value_buffer contain the variant information
I think it will be common that same metadata is used across many different
variant values (e.g. because the schema of the json documents is the same). Thus
we should probably permit reusing validated metadata somehow (rather than
requiring recreating it for each decoded json value)
One option would be to add a json function to the VariantBuilder envisioned in
the ticket linked above for reading Variant values.
// Location to write metadataletmut metadata_buffer = vec![]// Create a builderlet builder = VariantBuilder::new(&mut metadata_buffer);// Location to write the output variant valueletmut value_buffer = vec![];
builder.json(&mut value_buffer, json_data)?;// value_buffer contains the result of converting json_data to `Variant`)
Support for "streaming" / a push API
As sketched above, this API would require the entire JSON value in a single
buffer. A potentially more efficient API might be a "push" api, similar to how
the arrow JSON reader works, which would support smaller buffer sizes and lower
peak memory usage as well as interleaving variant parsing with IO fetch.
Perhaps something like
letmut metadata_buffer = vec![]let builder = VariantBuilder::new(&mut metadata_buffer);letmut value_buffer = vec![];letmut parser = builder.json_parser(&mut value_buffer)?;// json data comes in from some sourcewhileletSome(json_data) = source.next(){
parser.push(json_data);// incrementally parses json,}
parser.finish();// complete in-progress variant// value_buffer contains the result of converting json_data to Variant)
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A major usecase for Variant values in Parquet and Arrow is efficiently
processing JSON encoded data. Thus an important capability is being able to
efficiently read JSON encoded bytes into the Variant binary encoding described in
VariantEncoding.md. This ticket covers an API to parse one JSON value to
one
Variant
value. Other tickets will cover converting converting Variants toJSON as well as converting to/from Arrow Utf8* arrays and
Variant
arrays aswell as writing this to/from parquet.
Describe the solution you'd like
I would like an API to convert JSON encoded bytes to Variant encoded bytes
Describe alternatives you've considered
I suggest is an API like this:
Prior art:
Additional context
Considerations:
Reusing metadata across values?
I think it will be common that same metadata is used across many different
variant values (e.g. because the schema of the json documents is the same). Thus
we should probably permit reusing validated metadata somehow (rather than
requiring recreating it for each decoded json value)
One option would be to add a json function to the
VariantBuilder
envisioned inthe ticket linked above for reading
Variant
values.Support for "streaming" / a push API
As sketched above, this API would require the entire JSON value in a single
buffer. A potentially more efficient API might be a "push" api, similar to how
the arrow JSON reader works, which would support smaller buffer sizes and lower
peak memory usage as well as interleaving variant parsing with IO fetch.
Perhaps something like
The text was updated successfully, but these errors were encountered: