Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Extended schema JSON serde to support collations #3628

Merged

Conversation

ilicmarkodb
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Extended serialization and deserialization to support collations in metadata.

How was this patch tested?

Tests added to DataTypeJsonSerDe.java and StructTypeSuite.scala.

Does this PR introduce any user-facing changes?

No.

@ilicmarkodb
Copy link
Contributor Author

@vkorukanti can you please review?

@vkorukanti vkorukanti changed the title Extended serialization and deserialization to support collations in metadata. [Kernel] Extended schema JSON serde to support collations Sep 6, 2024
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Add bit more comments on how the collation property is stored. You can look at the method docs in ColumnMapping (where similar nested level field ids are stored in metadata) for an example docs.


import java.util.Optional;

public class CollationIdentifier {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc, @since version and @evolving tag

import java.util.Optional;

public class CollationIdentifier {
public static final String PROVIDER_SPARK = "SPARK";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these constant mean? add some comment?

Comment on lines 45 to 47
if (parts.length == 1) {
throw new IllegalArgumentException(
String.format("Invalid collation identifier: %s", identifier));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkArgument(parts.length != 1, String.format("Invalid collation identifier: %s", identifier));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or switch(parts.length) {
case 2:
case 3:
default: throw error
}

return String.format("%s.%s", provider, name);
}

public static CollationIdentifier fromString(String identifier) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All public APIs are going to show up in API docs. Please add proper javadoc.

Comment on lines 34 to 36
this.provider = provider;
this.name = name;
this.version = version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null checks. Objects.requireNonNull

@@ -16,6 +16,7 @@
package io.delta.kernel.types;

import io.delta.kernel.annotation.Evolving;
import io.delta.kernel.expressions.CollationIdentifier;

/**
* The data type representing {@code string} type values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the doc to include collation info.

@@ -130,6 +132,14 @@ class DataTypeJsonSerDeSuite extends AnyFunSuite {
}
}

test("parseDataType: types with collated strings") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some negative tests which cause the parser to fail.


import org.scalatest.funsuite.AnyFunSuite

class StructTypeSuite extends AnyFunSuite {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Why not just add the tests in DataTypeJsonSerDeSuite itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, you can roundtrip and test both serialize and deserialize works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. thanks

@@ -89,7 +89,8 @@ public static String serializeDataType(DataType dataType) {
*/
public static StructType deserializeStructType(String structTypeJson) {
try {
DataType parsedType = parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson));
DataType parsedType =
parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>());
parseDataType(
OBJECT_MAPPER.reader().readTree(structTypeJson),
"" /* fieldPath */,
new HashMap<>() /* collationMap*/);

@@ -131,22 +132,29 @@ public static StructType deserializeStructType(String structTypeJson) {
* "metadata" : { }
* }
* </pre>
*
* @param fieldPath Path from the nearest ancestor that is of the {@link StructField} type.
* @param collationMap Maps the path of a {@link StringType} to its collation. Only maps non-UTF8_BINARY collated {@link StringType}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention why it is needed and how it used. Basically for the element types of map or array, have no fieldMetadata -> can't contain the collation -> Use the nearest structfield which has the field metadata to store the collation for map/array elements. Mention lookup key (path) format.

@@ -0,0 +1,43 @@
package io.delta.kernel.types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add header? this didn't fail the CI job?


class StringTypeSuite extends AnyFunSuite {
test("check equals") {
Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Seq(
// Testcase: (instance1, instance2, expected value for `instance1 == instance2`)
Seq(

*/
static DataType parseDataType(JsonNode json) {
static DataType parseDataType(
JsonNode json, String fieldPath, HashMap<String, String> collationMap) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use Map in arguments def instead of HashMap. Same in other places.

Comment on lines 385 to 403
private static HashMap<String, String> getCollationsMap(JsonNode fieldMetadata) {
if (fieldMetadata == null || !fieldMetadata.has(DataType.COLLATIONS_METADATA_KEY)) {
return new HashMap<>();
}
HashMap<String, String> collationsMap = new HashMap<>();
FieldMetadata collationFieldMetadata =
parseFieldMetadata(fieldMetadata.get(DataType.COLLATIONS_METADATA_KEY));
for (Map.Entry<String, Object> collationField :
collationFieldMetadata.getEntries().entrySet()) {
String fieldPath = collationField.getKey();
Object collationName = collationField.getValue();
if (!(collationName instanceof String)) {
throw new IllegalArgumentException(
String.format("Invalid collation name: %s.", collationName));
}
collationsMap.put(fieldPath, (String) collationName);
}
return collationsMap;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a need for this. Basically instead of creating the Map, why not just use the fieldMetadata? And the fieldMetadata has a getString which already has type checks.

This will improve the code in this file as well. I think we do similarly in ColumnMapping.java for nested field ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, thanks!

@@ -102,6 +105,47 @@ public String toString() {
"StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata);
}

public FieldMetadata getSerializationMetadata() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this and how is it different from the getMetadata?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is capturing the nested field collation types and returning in FieldMetadata. Why is this not already the case when this StructField is created?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.

@ilicmarkodb ilicmarkodb force-pushed the extend_SerDe_to_support_collations branch from bfb2ee5 to 908750d Compare September 24, 2024 17:02
Comment on lines 34 to 36
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),
"",
new FieldMetadata.Builder().build())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),
"",
new FieldMetadata.Builder().build())
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),
"" /* fieldPath */,
new FieldMetadata.Builder().build() /* collation field metadata */)

Comment on lines +394 to +396
.add("b1", new ArrayType(new StringType("SPARK.UTF8_LCASE"), false))
.add("b2", new MapType(
new StringType("ICU.UNICODE_CI"), new StringType("SPARK.UTF8_LCASE"), true), false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the case where Map/array element is a struct which has a string column with collation.

),
(
structTypeJson(Seq(
structFieldJson("a1", structTypeJson(Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment just above the structFieldJson on what test this field is covering. Easy to see the tests and understand.

)
)

val SAMPLE_JSON_TO_TYPES_WITH_COLLATION_DIFFERING = Seq(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does DIFFERING mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea of this test was to have just difference in StringType collation, i will make some better name

.add("b1", new StringType("ICU.UNICODE")), true)
.add("a2", new StructType()
// In json, this field has "SPARK.UTF8_LCASE" collation
.add("b1", new ArrayType(StringType.STRING, false))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these negative tests? If I understand correctly, as long as the StructType.equals is implemented correctly, then this should work. May be just add one specific test where it is not matching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, make sense since we have StringTypeSuite where we test this

@@ -102,6 +105,47 @@ public String toString() {
"StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata);
}

public FieldMetadata getSerializationMetadata() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.

nestedCollatedFields.addAll(
getNestedCollatedFields(((ArrayType) parent).getElementType(), path + ".element"));
}
// We didn't check for StructType because we store the StringType's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to go through the fields within the StructType and check if any of them contains a Map/Array type.

@@ -79,9 +82,37 @@ public DataType getDataType() {

/** @return the metadata for this field */
public FieldMetadata getMetadata() {
fetchCollationMetadata();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this can be handled simply by adding the code in the StructField constructor. It can go through the type and figure out if it needs collation data to be stored in its metadata. If yes, just add them there. Given StructField is immutable, we don't need to do dynamic computation of the collations for nested fields like here which is prone to bugs.

@vkorukanti vkorukanti merged commit b1e4a03 into delta-io:master Sep 27, 2024
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants