Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Extended schema JSON serde to support collations #3628

Merged
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
fb588c1
extended StringType to have CollationIdentifier
ilicmarkodb Aug 30, 2024
db4e7b2
reordered attributes
ilicmarkodb Aug 30, 2024
49059ff
changed PROVIDER_KERNEL to PROVIDER_SPARK
ilicmarkodb Aug 30, 2024
d9279ef
extended serialization and deserialization to support collation
ilicmarkodb Aug 30, 2024
916049e
style fix
ilicmarkodb Aug 30, 2024
54b59f4
style fix
ilicmarkodb Aug 30, 2024
9715cf5
style fix
ilicmarkodb Aug 30, 2024
55c5191
added CollationIdentifier equals
ilicmarkodb Aug 30, 2024
51162f0
style fix
ilicmarkodb Aug 30, 2024
712e081
style fix
ilicmarkodb Aug 30, 2024
36571ab
fix
ilicmarkodb Aug 31, 2024
86602c6
tests added for CollationIdentifier
ilicmarkodb Sep 2, 2024
76cdbd5
style fix
ilicmarkodb Sep 2, 2024
d8fc611
style fix
ilicmarkodb Sep 2, 2024
9c9684a
changed toString and fromString
ilicmarkodb Sep 2, 2024
c6bd336
changed CollationIdentifier
ilicmarkodb Sep 5, 2024
5e0e43e
changed CollationIdentifier
ilicmarkodb Sep 5, 2024
2d9465d
merged with extend_string_type_to_have_collation
ilicmarkodb Sep 9, 2024
20a1081
suggestions applied
ilicmarkodb Sep 9, 2024
6469ba1
suggestions applied
ilicmarkodb Sep 9, 2024
daa2f66
merged with extend_string_type_to_have_collation
ilicmarkodb Sep 9, 2024
37c3617
javadoc updated
ilicmarkodb Sep 9, 2024
9b2835f
merged with extend_string_type_to_have_collation
ilicmarkodb Sep 9, 2024
8e0fb82
temp
ilicmarkodb Sep 9, 2024
164edcc
temp
ilicmarkodb Sep 9, 2024
16113cd
parser and tests fixed
ilicmarkodb Sep 9, 2024
65ad43e
parser and tests fixed
ilicmarkodb Sep 9, 2024
dc3db16
temp commit
ilicmarkodb Sep 10, 2024
e107247
suggestions applied
ilicmarkodb Sep 10, 2024
2339d46
stringtype equals tests added
ilicmarkodb Sep 10, 2024
14b7327
stringtype equals updated
ilicmarkodb Sep 10, 2024
e914e6f
removed DEFAULT values
ilicmarkodb Sep 12, 2024
1feea71
since tag added
ilicmarkodb Sep 12, 2024
4c7d72f
merged with extend_string_type_to_have_collation
ilicmarkodb Sep 12, 2024
d95ebc0
changed CollationIdentifier constructor
ilicmarkodb Sep 16, 2024
a7e435b
java doc added
ilicmarkodb Sep 20, 2024
fa836a4
temp
ilicmarkodb Sep 24, 2024
eff0abf
suggestion applied
ilicmarkodb Sep 24, 2024
99ce5ae
test fixed
ilicmarkodb Sep 24, 2024
bd62e3d
style fix
ilicmarkodb Sep 24, 2024
908750d
merged with master
ilicmarkodb Sep 24, 2024
9b2001b
temp
ilicmarkodb Sep 25, 2024
52280ce
suggestions applied
ilicmarkodb Sep 25, 2024
6a45b46
style fix
ilicmarkodb Sep 25, 2024
555d49e
added fetchCollationMetadata method
ilicmarkodb Sep 25, 2024
d359ed4
moved fetchCollationMetadata to constructor
ilicmarkodb Sep 26, 2024
67854c9
style fix
ilicmarkodb Sep 26, 2024
c6f1c97
fix
ilicmarkodb Sep 26, 2024
c5f41b9
fix
ilicmarkodb Sep 26, 2024
7b8c844
Update StructField.java
vkorukanti Sep 26, 2024
ae0b189
minor change
vkorukanti Sep 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,11 @@ public static String serializeDataType(DataType dataType) {
*/
public static StructType deserializeStructType(String structTypeJson) {
try {
DataType parsedType = parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson));
DataType parsedType =
parseDataType(
OBJECT_MAPPER.reader().readTree(structTypeJson),
"" /* fieldPath */,
new FieldMetadata.Builder().build() /* collationsMetadata */);
if (parsedType instanceof StructType) {
return (StructType) parsedType;
} else {
Expand Down Expand Up @@ -130,23 +134,61 @@ public static StructType deserializeStructType(String structTypeJson) {
* "nullable" : false,
* "metadata" : { }
* }
*
* // Collated string type field serialized as:
* {
* "name" : "s",
* "type" : "string",
* "nullable", false,
* "metadata" : {
* "__COLLATIONS": { "s": "ICU.de_DE" }
* }
* }
*
* // Array with collated strings field serialized as:
* {
* "name" : "arr",
* "type" : {
* "type" : "array",
* "elementType" : "string",
* "containsNull" : false
* }
* "nullable" : false,
* "metadata" : {
* "__COLLATIONS": { "arr.element": "ICU.de_DE" }
* }
* }
* </pre>
*
* @param fieldPath Path from the nearest ancestor that is of the {@link StructField} type. For
* example, "c1.key.element" represents a path starting from the {@link StructField} named
* "c1." The next element, "key," indicates that "c1" stores a {@link MapType} type. The final
* element, "element", shows that the key of the map is an {@link ArrayType} type.
* @param collationsMetadata Metadata that maps the path of a {@link StringType} to its collation.
* Only maps non-UTF8_BINARY collated {@link StringType}. Collation metadata is stored in the
* nearest ancestor, which is the StructField. This is because StructField includes a metadata
* field, whereas Map and Array do not, making them unable to store this information. Paths
* are in same form as `fieldPath`. <a
* href="https://github.com/delta-io/delta/blob/master/protocol_rfcs/collated-string-type.md#collation-identifiers">Docs</a>
*/
static DataType parseDataType(JsonNode json) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the method docs to include what the collationMap is.

static DataType parseDataType(JsonNode json, String fieldPath, FieldMetadata collationsMetadata) {
switch (json.getNodeType()) {
case STRING:
// simple types are stored as just a string
return nameToType(json.textValue());
return nameToType(json.textValue(), fieldPath, collationsMetadata);
case OBJECT:
// complex types (array, map, or struct are stored as JSON objects)
String type = getStringField(json, "type");
switch (type) {
case "struct":
assertValidTypeForCollations(fieldPath, "struct", collationsMetadata);
return parseStructType(json);
case "array":
return parseArrayType(json);
assertValidTypeForCollations(fieldPath, "array", collationsMetadata);
return parseArrayType(json, fieldPath, collationsMetadata);
case "map":
return parseMapType(json);
assertValidTypeForCollations(fieldPath, "map", collationsMetadata);
return parseMapType(json, fieldPath, collationsMetadata);
// No default case here; fall through to the following error when no match
}
default:
Expand All @@ -160,26 +202,32 @@ static DataType parseDataType(JsonNode json) {
* Parses an <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md#array-type">array
* type </a>
*/
private static ArrayType parseArrayType(JsonNode json) {
private static ArrayType parseArrayType(
JsonNode json, String fieldPath, FieldMetadata collationsMetadata) {
checkArgument(
json.isObject() && json.size() == 3,
String.format("Expected JSON object with 3 fields for array data type but got:\n%s", json));
boolean containsNull = getBooleanField(json, "containsNull");
DataType dataType = parseDataType(getNonNullField(json, "elementType"));
DataType dataType =
parseDataType(
getNonNullField(json, "elementType"), fieldPath + ".element", collationsMetadata);
return new ArrayType(dataType, containsNull);
}

/**
* Parses an <a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md#map-type">map type
* </a>
*/
private static MapType parseMapType(JsonNode json) {
private static MapType parseMapType(
JsonNode json, String fieldPath, FieldMetadata collationsMetadata) {
checkArgument(
json.isObject() && json.size() == 4,
String.format("Expected JSON object with 4 fields for map data type but got:\n%s", json));
boolean valueContainsNull = getBooleanField(json, "valueContainsNull");
DataType keyType = parseDataType(getNonNullField(json, "keyType"));
DataType valueType = parseDataType(getNonNullField(json, "valueType"));
DataType keyType =
parseDataType(getNonNullField(json, "keyType"), fieldPath + ".key", collationsMetadata);
DataType valueType =
parseDataType(getNonNullField(json, "valueType"), fieldPath + ".value", collationsMetadata);
return new MapType(keyType, valueType, valueContainsNull);
}

Expand Down Expand Up @@ -211,14 +259,25 @@ private static StructType parseStructType(JsonNode json) {
private static StructField parseStructField(JsonNode json) {
Preconditions.checkArgument(json.isObject(), "Expected JSON object for struct field");
String name = getStringField(json, "name");
DataType type = parseDataType(getNonNullField(json, "type"));
FieldMetadata metadata = parseFieldMetadata(json.get("metadata"), false);
DataType type =
parseDataType(
getNonNullField(json, "type"), name, getCollationsMetadata(json.get("metadata")));
boolean nullable = getBooleanField(json, "nullable");
FieldMetadata metadata = parseFieldMetadata(json.get("metadata"));
return new StructField(name, type, nullable, metadata);
}

/** Parses an {@link FieldMetadata}. */
private static FieldMetadata parseFieldMetadata(JsonNode json) {
return parseFieldMetadata(json, true);
}

/**
* Parses a {@link FieldMetadata}, optionally including collation metadata, depending on
* `includecollationsMetadata`.
*/
private static FieldMetadata parseFieldMetadata(
JsonNode json, boolean includecollationsMetadata) {
if (json == null || json.isNull()) {
return FieldMetadata.empty();
}
Expand All @@ -231,6 +290,10 @@ private static FieldMetadata parseFieldMetadata(JsonNode json) {
JsonNode value = entry.getValue();
String key = entry.getKey();

if (!includecollationsMetadata && key.equals(DataType.COLLATIONS_METADATA_KEY)) {
continue;
}

if (value.isNull()) {
builder.putNull(key);
} else if (value.isIntegralNumber()) { // covers both int and long
Expand Down Expand Up @@ -298,8 +361,13 @@ private static <T> List<T> buildList(JsonNode json, Function<JsonNode, T> access
private static Pattern FIXED_DECIMAL_PATTERN = Pattern.compile(FIXED_DECIMAL_REGEX);

/** Parses primitive string type names to a {@link DataType} */
private static DataType nameToType(String name) {
private static DataType nameToType(
String name, String fieldPath, FieldMetadata collationsMetadata) {
if (BasePrimitiveType.isPrimitiveType(name)) {
if (collationsMetadata.contains(fieldPath)) {
assertValidTypeForCollations(fieldPath, name, collationsMetadata);
return new StringType(collationsMetadata.getString(fieldPath));
}
return BasePrimitiveType.createPrimitive(name);
} else if (name.equals("decimal")) {
return DecimalType.USER_DEFAULT;
Expand Down Expand Up @@ -341,6 +409,22 @@ private static String getStringField(JsonNode rootNode, String fieldName) {
return node.textValue(); // double check this only works for string values! and isTextual()!
}

private static void assertValidTypeForCollations(
String fieldPath, String fieldType, FieldMetadata collationsMetadata) {
if (collationsMetadata.contains(fieldPath) && !fieldType.equals("string")) {
throw new IllegalArgumentException(
String.format("Invalid data type for collations: \"%s\"", fieldType));
}
}

/** Returns a metadata with a map of field path to collation name. */
private static FieldMetadata getCollationsMetadata(JsonNode fieldMetadata) {
if (fieldMetadata == null || !fieldMetadata.has(DataType.COLLATIONS_METADATA_KEY)) {
return new FieldMetadata.Builder().build();
}
return parseFieldMetadata(fieldMetadata.get(DataType.COLLATIONS_METADATA_KEY));
}

private static boolean getBooleanField(JsonNode rootNode, String fieldName) {
JsonNode node = getNonNullField(rootNode, fieldName);
Preconditions.checkArgument(
Expand Down Expand Up @@ -414,7 +498,7 @@ private static void writeStructField(JsonGenerator gen, StructField field) throw
writeDataType(gen, field.getDataType());
gen.writeBooleanField("nullable", field.isNullable());
gen.writeFieldName("metadata");
writeFieldMetadata(gen, field.getMetadata());
writeFieldMetadata(gen, field.getSerializationMetadata());
gen.writeEndObject();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
*/
@Evolving
public abstract class DataType {
public static final String COLLATIONS_METADATA_KEY = "__COLLATIONS";

/**
* Are the data types same? The metadata or column names could be different.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
package io.delta.kernel.types;

import io.delta.kernel.annotation.Evolving;
import io.delta.kernel.internal.util.Tuple2;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;

/**
Expand Down Expand Up @@ -102,6 +105,47 @@ public String toString() {
"StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata);
}

public FieldMetadata getSerializationMetadata() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this and how is it different from the getMetadata?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is capturing the nested field collation types and returning in FieldMetadata. Why is this not already the case when this StructField is created?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.

List<Tuple2<String, String>> nestedCollatedFields = getNestedCollatedFields(dataType, name);
if (nestedCollatedFields.isEmpty()) {
return metadata;
}

FieldMetadata.Builder metadataBuilder = new FieldMetadata.Builder();
for (Tuple2<String, String> nestedField : nestedCollatedFields) {
metadataBuilder.putString(nestedField._1, nestedField._2);
}
return new FieldMetadata.Builder()
.fromMetadata(metadata)
.putFieldMetadata(DataType.COLLATIONS_METADATA_KEY, metadataBuilder.build())
.build();
}

private List<Tuple2<String, String>> getNestedCollatedFields(DataType parent, String path) {
List<Tuple2<String, String>> nestedCollatedFields = new ArrayList<>();
if (parent instanceof StringType) {
StringType stringType = (StringType) parent;
if (!stringType
.getCollationIdentifier()
.equals(CollationIdentifier.fromString("SPARK.UTF8_BINARY"))) {
nestedCollatedFields.add(
new Tuple2<>(
path, ((StringType) parent).getCollationIdentifier().toStringWithoutVersion()));
}
} else if (parent instanceof MapType) {
nestedCollatedFields.addAll(
getNestedCollatedFields(((MapType) parent).getKeyType(), path + ".key"));
nestedCollatedFields.addAll(
getNestedCollatedFields(((MapType) parent).getValueType(), path + ".value"));
} else if (parent instanceof ArrayType) {
nestedCollatedFields.addAll(
getNestedCollatedFields(((ArrayType) parent).getElementType(), path + ".element"));
}
// We didn't check for StructType because we store the StringType's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to go through the fields within the StructType and check if any of them contains a Map/Array type.

// collation information in the nearest ancestor StructField's metadata when serializing.
return nestedCollatedFields;
vkorukanti marked this conversation as resolved.
Show resolved Hide resolved
}

@Override
public boolean equals(Object o) {
if (this == o) {
Expand Down
Loading
Loading