Deduplicate and standardize deserialization logic for streams #13412

alihan-synnada · 2024-11-14T15:14:16Z

Which issue does this PR close?

None

Rationale for this change

This PR implements a common Decoder trait, the BatchDeserializer trait and the DecoderDeserializer struct as described in the issue, along with CsvDecoder and JsonDecoder as arrow-csv and arrow-json Decoders are readily available.

What changes are included in this PR?

Note: There are about 290 lines of new tests, so it is about 250 lines of actual code.

Add BatchDeserializer as a common interface.
- digest consumes the input in chunks
- next attempts to deserialize the digested data and returns a DeserializerOutput which is either a RecordBatch, RequiresMoreData and InputExhausted
- finish signals the end of the input stream
Add Decoder trait
- Mimics arrow-json and arrow-csv's Decoders
Implement Decoder for CsvDecoder and JsonDecoder by forwarding methods
Add DecoderDeserializer and implement BatchDeserializer for formats that have a Decoder implementation.
Add deserialize_stream function to deduplicate the deserialization logic

Are these changes tested?

Yes, the changes are covered by new tests added to the CSV and JSON modules.

Are there any user-facing changes?

No

ozankabak · 2024-11-14T15:21:57Z

We discussed this with @alihan-synnada and it looks good to me, but it'd be great to get community review. cc @alamb

ozankabak

Let's wait a little bit for more eyes on this, but I carefully went through and it seems like a good first step towards removing code duplication on the read side.

alamb

Thank you @alihan-synnada and @ozankabak

I think this PR is really well documented and makes a lot of sene to me

One thing I noticed is that the ticket talks about the arrow and avro as well. Do you plan to update them in a follow on PR?

Finally, the ticket also mentions parquet -- I think it will be hard to update the parquet reader (or any columnar file format) to use the DecodeTrait. The parquet reader itself drives what IO to do (aka what byte ranges and when) rather than the more row oriented format.

alamb · 2024-11-15T19:23:47Z

datafusion/core/src/datasource/file_format/mod.rs

@@ -168,6 +172,164 @@ pub enum FilePushdownSupport {
    Supported,
 }

+/// Possible outputs of a [`BatchDeserializer`].
+#[derive(Debug, PartialEq)]
+pub enum DeserializerOutput {


alamb · 2024-11-15T19:27:12Z

datafusion/core/src/datasource/file_format/mod.rs

+pub(crate) trait Decoder: Send + Debug {
+    /// See [`arrow::json::reader::Decoder::decode`].
+    ///
+    /// [`arrow::json::reader::Decoder::decode`]: ::arrow::json::reader::Decoder::decode


I double checked and https://docs.rs/arrow-json/53.2.0/arrow_json/reader/struct.Decoder.html seems to describe this interface well.

alamb · 2024-11-15T19:28:01Z

datafusion/core/src/datasource/file_format/mod.rs

+///
+/// This struct is responsible for converting a stream of bytes, which represent
+/// encoded data, into a stream of `RecordBatch` objects, following the specified
+/// schema and formatting options.


it might be worth also mentioning here this handles any buffering on the input that might be required to fulfill the decode interface (that might return RecordBatches before fully consuming the input)

It took me a while to figure out why this was required

Done - thank you for pointing it out

alamb · 2024-11-15T19:29:03Z

datafusion/core/src/datasource/physical_plan/csv.rs

@@ -651,36 +651,14 @@ impl FileOpener for CsvOpener {
                    Ok(futures::stream::iter(config.open(decoder)?).boxed())
                }
                GetResultPayload::Stream(s) => {
-                    let mut decoder = config.builder().build_decoder();
+                    let decoder = config.builder().build_decoder();


That is certainly a lot nicer 😍

ozankabak · 2024-11-16T05:15:45Z

One thing I noticed is that #13411 talks about the arrow and avro as well. Do you plan to update them in a follow on PR?

Yes, indeed. Not an immediate priority but we would like to tidy up the read side.

Finally, the ticket also mentions parquet -- I think it will be hard to update the parquet reader (or any columnar file format) to use the DecodeTrait. The parquet reader itself drives what IO to do (aka what byte ranges and when) rather than the more row oriented format.

I agree -- Parquet will probably stay separate for the time being.

datafusion/core/src/datasource/file_format/mod.rs

alihan-synnada added 3 commits November 13, 2024 14:31

Add BatchDeserializer

5cb3329

Fix formatting

ed84aa8

Remove unused enum value

cb9b187

github-actions bot added the core Core DataFusion crate label Nov 14, 2024

ozankabak approved these changes Nov 15, 2024

View reviewed changes

alamb approved these changes Nov 15, 2024

View reviewed changes

ozankabak reviewed Nov 16, 2024

View reviewed changes

datafusion/core/src/datasource/file_format/mod.rs Outdated Show resolved Hide resolved

Update datafusion/core/src/datasource/file_format/mod.rs

435964e

ozankabak merged commit 06db9ed into apache:main Nov 16, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate and standardize deserialization logic for streams #13412

Deduplicate and standardize deserialization logic for streams #13412

alihan-synnada commented Nov 14, 2024

ozankabak commented Nov 14, 2024

ozankabak left a comment

alamb left a comment

alamb Nov 15, 2024

alamb Nov 15, 2024

alamb Nov 15, 2024

ozankabak Nov 16, 2024

alamb Nov 15, 2024

ozankabak commented Nov 16, 2024

Deduplicate and standardize deserialization logic for streams #13412

Deduplicate and standardize deserialization logic for streams #13412

Conversation

alihan-synnada commented Nov 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ozankabak commented Nov 14, 2024

ozankabak left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 15, 2024

Choose a reason for hiding this comment

alamb Nov 15, 2024

Choose a reason for hiding this comment

alamb Nov 15, 2024

Choose a reason for hiding this comment

ozankabak Nov 16, 2024

Choose a reason for hiding this comment

alamb Nov 15, 2024

Choose a reason for hiding this comment

ozankabak commented Nov 16, 2024