[Kernel] Default Parquet reader implementation #1846

vkorukanti · 2023-06-20T16:55:27Z

Context

This PR is part of #1783.

Description

It implements Parquet reader based on parquet-mr and generates the output as columnar batches of ColumnVector and ColumnarBatch interface implementations.

How was this patch tested?

UTs

allisonport-db

Mostly left minor comments and a few questions. Two main items/questions

(1) I'd like to understand more about the need for the nullability array? It seems to me like it's only necessary for the GroupConverters and not the primitive ones. Also, the group converters need to be fixed for empty groups vs null values.

(2) We have multiple (primitive) converters for the same parquet type based on our datatypes. What's the reasoning for converting types in the converters vs the column vectors? We kind of convert between types in two places

Converters for types in the PrimitiveConverters
In the ColumnVector for BinaryColumnVector (string vs binary)

Essentially why not have converters only for the parquet types -- and convert to our data type in the column vectors as we already are doing in BinaryColumnVector?

allisonport-db · 2023-06-21T05:21:46Z

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java

+        return null;
+    }
+
+    private static Type pruneSubfields(Type type, DataType deltaDatatype)


Seems like this and pruneSchema can use the same code if we generalize it for GroupType?

Cleaned up a bit, but not clean to merge them. MessageType and Group type needs two different inputs. For now keeping them separate. It only adds extra 5lines of code.

This only prunes the 2nd level of nesting right?

Can't we still factor out the shared code?

private static List<Type> pruneFields(GroupType type, DataType deltaDataType) { // prune fields including nested pruning like in pruneSchema return deltaType.fields().stream() .map(column -> { Type type = findSubFieldType(type, column); if (type != null && column instanceof StructType) { return type.withNewFields(pruneFields(type, column); } else { return type; } }) .filter(Objects::nonNull) .collect(Collectors.toList()); }

You need the MessageType for the pruneSchema.

MessageType extends GroupType through right? So pruneSchema can just call a helper. Not a big deal though, can we just either fix the pruning for nested fields or add a note that nested schema pruning is only 2nd level?

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java

kernel/kernel-default/src/main/java/io/delta/kernel/client/DefaultParquetHandler.java

kernel/kernel-default/src/main/java/io/delta/kernel/data/vector/DefaultBinaryVector.java

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/RowConverter.java

allisonport-db · 2023-06-21T20:34:24Z

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/ParquetConverters.java

+        protected int currentRowIndex;
+        protected boolean[] nullability;
+
+        BasePrimitiveColumnConverter(int maxBatchSize)


Just for my understanding, is maxBatchSize just more of a "suggested batch size" for initializing arrays? here & for the complex type converters

Yep, it is a suggested size. I started with maxBatchSize for maximum possible size, but that got useless when it comes to the complex types. Renaming to suggestedSize

allisonport-db · 2023-06-21T20:41:10Z

kernel/kernel-default/src/main/java/io/delta/kernel/data/vector/AbstractColumnVector.java

+{
+    private final int size;
+    private final DataType dataType;
+    private final Optional<boolean[]> nullability;


Makes sense to ask this here or on any of the converters; but is this needed for any of the primitive types? Does checking the value array for a null value not suffice for the primitives? (assuming they are instantiated with null values in the beginning)

Why do we override isNullAt for DefaultBinaryVector?

Nvm java primitives can't be null. Do we need to override isNullAt for DefaultBinaryVector though? Seems like it could be the same.

We need nullability vector as the values are of java primitive type arrays which can't be null.

For DefaultBinaryVector, we basically rely on byte[][], the first level array can contain nulls.

allisonport-db · 2023-06-21T20:45:00Z

kernel/kernel-default/src/main/java/io/delta/kernel/data/vector/AbstractColumnVector.java

+     * @return
+     */
+    @Override
+    public boolean isNullAt(int rowId)


Again confused about the discrepancy for DefaultBinaryVector. Seems like !nullability.isPresent() is weird default behavior, and since DefaultBinaryVector is the only type where we don't provide a nullability array I wonder if it should be optional at all?

To clarify, I'm mainly wondering what's the expected behavior when no nullability array is provided? The default being all values are null in that scenario seems a little weird.

If we expect any child class that omits the nullability array to override this can we throw an error here when it isn't present?

see the prev reply.

allisonport-db · 2023-06-21T21:12:58Z

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/ArrayConverter.java

+    public void end()
+    {
+        int collectorIndexAtEnd = converter.currentEntryIndex;
+        this.nullability[currentRowIndex] = collectorIndexAtEnd == collectorIndexAtStart;


I think this is wrong for empty arrays (I tried it out and they're read as a null value). Same thing for all the other group converters.

If start/end is called at all shouldn't nullability[currentRowIndex] = false?

Nice catch. This could be a problem for empty array. Fixing it.

Debugging for the empty array case and null array case, unfortunately Parquet calls GroupConvert.start and end in both cases. Looking at the Parquet-Mr provided implementation of same for Avro conversion, they seem to have the same issue. If there is no element set call, then it is considered a null.

I added a test case for now. Once this PR is merged, will create an issue to if the behavior in other engines and if it is going to be a problem.

kernel/kernel-default/src/test/java/io/delta/kernel/parquet/TestParquetBatchReader.java

vkorukanti · 2023-06-21T21:37:17Z

(1) I'd like to understand more about the need for the nullability array? It seems to me like it's only necessary for the GroupConverters and not the primitive ones. Also, the group converters need to be fixed for empty groups vs null values.

The nullability array is needed for primitive types, because the values array is of Java primitive types where you can't have the null value. Also this is optional. So in future if we know the value is not going to have any null, then we don't need to pass any nullable array.

(2) We have multiple (primitive) converters for the same parquet type based on our datatypes. What's the reasoning for converting types in the converters vs the column vectors? We kind of convert between types in two places

Converters for types in the PrimitiveConverters

In the ColumnVector for BinaryColumnVector (string vs binary)

Essentially why not have converters only for the parquet types -- and convert to our data type in the column vectors as we already are doing in BinaryColumnVector?

I think there is some duplicate code here. Will clean up.

vkorukanti · 2023-06-22T04:53:25Z

(2) We have multiple (primitive) converters for the same parquet type based on our datatypes. What's the reasoning for converting types in the converters vs the column vectors? We kind of convert between types in two places

Converters for types in the PrimitiveConverters

In the ColumnVector for BinaryColumnVector (string vs binary)

Essentially why not have converters only for the parquet types -- and convert to our data type in the column vectors as we already are doing in BinaryColumnVector?

I think there is some duplicate code here. Will clean up.

I think the reason we have different types is because we have different vectors and different access methods (getShort vs getInt). These vectors need different value vector types. Thats the reason we have. Let me know if you think we can remove any specific converter.

allisonport-db · 2023-06-22T05:33:00Z

(2) We have multiple (primitive) converters for the same parquet type based on our datatypes. What's the reasoning for converting types in the converters vs the column vectors? We kind of convert between types in two places

Converters for types in the PrimitiveConverters

In the ColumnVector for BinaryColumnVector (string vs binary)

Essentially why not have converters only for the parquet types -- and convert to our data type in the column vectors as we already are doing in BinaryColumnVector?

I think there is some duplicate code here. Will clean up.

I think the reason we have different types is because we have different vectors and different access methods (getShort vs getInt). These vectors need different value vector types. Thats the reason we have. Let me know if you think we can remove any specific converter.

I guess it's just inconsistent since BinaryType and StringType share DefaultBinaryVector? But the other types that share the same parquet base type do not.

vkorukanti · 2023-06-22T05:35:06Z

(2) We have multiple (primitive) converters for the same parquet type based on our datatypes. What's the reasoning for converting types in the converters vs the column vectors? We kind of convert between types in two places

Converters for types in the PrimitiveConverters

In the ColumnVector for BinaryColumnVector (string vs binary)

Essentially why not have converters only for the parquet types -- and convert to our data type in the column vectors as we already are doing in BinaryColumnVector?

I think there is some duplicate code here. Will clean up.

I think the reason we have different types is because we have different vectors and different access methods (getShort vs getInt). These vectors need different value vector types. Thats the reason we have. Let me know if you think we can remove any specific converter.

I guess it's just inconsistent since BinaryType and StringType share DefaultBinaryVector? But the other types that share the same parquet base type do not.

String and Binary vector contents are same byte[][], but not the same for for example: Short and Integer vectors (its short[] vs int[])

allisonport-db · 2023-06-22T02:01:10Z

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/ParquetBatchReader.java

+            }
+
+            @Override
+            public boolean hasNext()


This isn't idempotent; but I think I'll need to update this to add the row_index anyways so I'm fine with adding a comment to fix this and I'll address it in my PR

You are right, this should be fixed. I will push a fix.

allisonport-db · 2023-06-22T04:44:37Z

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java

+        return null;
+    }
+
+    private static Type pruneSubfields(Type type, DataType deltaDatatype)


This only prunes the 2nd level of nesting right?

allisonport-db · 2023-06-22T04:52:07Z

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java

+        return null;
+    }
+
+    private static Type pruneSubfields(Type type, DataType deltaDatatype)


Can't we still factor out the shared code?

private static List<Type> pruneFields(GroupType type, DataType deltaDataType) { // prune fields including nested pruning like in pruneSchema return deltaType.fields().stream() .map(column -> { Type type = findSubFieldType(type, column); if (type != null && column instanceof StructType) { return type.withNewFields(pruneFields(type, column); } else { return type; } }) .filter(Objects::nonNull) .collect(Collectors.toList()); }

allisonport-db · 2023-06-22T04:57:40Z

kernel/kernel-default/src/main/java/io/delta/kernel/data/vector/VectorUtils.java

+     * @return
+     */
+    public static Object getValueAsObject(ColumnVector vector, int rowId) {
+        // TODO: may be it is better to just provide a `getObject` on the `ColumnVector` to


Subclasses can access it right?

allisonport-db · 2023-06-22T04:58:37Z

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/ArrayConverter.java

+        GroupType typeFromFile)
+    {
+        this.typeFromClient = typeFromClient;
+        final GroupType innerElementType = (GroupType) typeFromFile.getType("list");


We should note this down to address. I remember there being a lot of different variations

allisonport-db · 2023-06-22T05:02:41Z

kernel/kernel-default/src/main/java/io/delta/kernel/parquet/ParquetConverters.java

        protected int currentRowIndex;
        protected boolean[] nullability;


I was thinking resetWorkingState() would override and reset these, and is called in implementations of getDataColumnVector. Not a big deal though just unifies code

allisonport-db · 2023-06-22T05:30:56Z

kernel/kernel-default/src/test/java/io/delta/kernel/parquet/TestParquetBatchReader.java

+                case "decimal": {
+                    throw new UnsupportedOperationException("not yet implemented: " + name);
+                }
+                case "nested_struct": {


Test case for null for structs as well? (Or at least future todo?)

There are second level structs which are null.

allisonport-db · 2023-06-22T05:46:16Z

I think the reason we have different types is because we have different vectors and different access methods (getShort vs getInt). These vectors need different value vector types. Thats the reason we have. Let me know if you think we can remove any specific converter.

I guess it's just inconsistent since BinaryType and StringType share DefaultBinaryVector? But the other types that share the same parquet base type do not.

String and Binary vector contents are same byte[][], but not the same for for example: Short and Integer vectors (its short[] vs int[])

I guess my question then would be why do they need different value vector types. What's the difference between doing the conversion between types in for example getShort vs ShortColumnConverter.addInt. Checking for illegal access to a getter could be done in the same way as in DefaultBinaryVector. Mostly just asking questions for my own understanding

vkorukanti · 2023-06-22T05:59:53Z

Storage space and unnecessary type check for every call of getShort

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java

…nelUtils.java Co-authored-by: Allison Portis <[email protected]>

vkorukanti force-pushed the pr2-parquet branch from 84b3128 to ade45e0 Compare June 21, 2023 05:49

[Kernel] Default Parquet reader implementation

fdcc4c6

vkorukanti force-pushed the pr2-parquet branch from ade45e0 to fdcc4c6 Compare June 21, 2023 06:08

vkorukanti and others added 2 commits June 20, 2023 23:46

small issue

a17bff9

Update DefaultColumnarBatch.java

4001179

allisonport-db reviewed Jun 21, 2023

View reviewed changes

kernel/kernel-default/src/test/java/io/delta/kernel/parquet/TestParquetBatchReader.java Show resolved Hide resolved

address review

aabccb3

vkorukanti force-pushed the pr2-parquet branch from ce05199 to aabccb3 Compare June 22, 2023 02:20

vkorukanti added 2 commits June 21, 2023 19:24

fix

a37ba8f

fix data test issue

7a31155

use closeableiter.map

2de16c8

allisonport-db reviewed Jun 22, 2023

View reviewed changes

fix

8d87de3

allisonport-db reviewed Jun 22, 2023

View reviewed changes

kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKernelUtils.java Show resolved Hide resolved

allisonport-db approved these changes Jun 22, 2023

View reviewed changes

Update kernel/kernel-default/src/main/java/io/delta/kernel/DefaultKer…

9dffd61

…nelUtils.java Co-authored-by: Allison Portis <[email protected]>

vkorukanti closed this in 04a29a4 Jun 22, 2023

vkorukanti deleted the pr2-parquet branch September 14, 2023 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Default Parquet reader implementation #1846

[Kernel] Default Parquet reader implementation #1846

vkorukanti commented Jun 20, 2023 •

edited

Loading

allisonport-db left a comment

allisonport-db Jun 21, 2023

vkorukanti Jun 21, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

vkorukanti Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 21, 2023

vkorukanti Jun 22, 2023

allisonport-db Jun 21, 2023

allisonport-db Jun 22, 2023 •

edited

Loading

vkorukanti Jun 22, 2023

allisonport-db Jun 21, 2023

allisonport-db Jun 22, 2023

vkorukanti Jun 22, 2023

allisonport-db Jun 21, 2023

vkorukanti Jun 22, 2023

vkorukanti Jun 22, 2023

vkorukanti commented Jun 21, 2023

vkorukanti commented Jun 22, 2023 •

edited

Loading

allisonport-db commented Jun 22, 2023

vkorukanti commented Jun 22, 2023

allisonport-db Jun 22, 2023

vkorukanti Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

allisonport-db Jun 22, 2023

vkorukanti Jun 22, 2023

allisonport-db commented Jun 22, 2023

vkorukanti commented Jun 22, 2023 •

edited

Loading

		protected int currentRowIndex;
		protected boolean[] nullability;

[Kernel] Default Parquet reader implementation #1846

[Kernel] Default Parquet reader implementation #1846

Conversation

vkorukanti commented Jun 20, 2023 • edited Loading

Context

Description

How was this patch tested?

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkorukanti commented Jun 21, 2023

vkorukanti commented Jun 22, 2023 • edited Loading

allisonport-db commented Jun 22, 2023

vkorukanti commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db commented Jun 22, 2023

vkorukanti commented Jun 22, 2023 • edited Loading

vkorukanti commented Jun 20, 2023 •

edited

Loading

allisonport-db Jun 22, 2023 •

edited

Loading

vkorukanti commented Jun 22, 2023 •

edited

Loading

vkorukanti commented Jun 22, 2023 •

edited

Loading