Jcipar/core 5087 changes to support iceberg #248

jcipar · 2024-07-02T18:30:14Z

https://redpandadata.atlassian.net/browse/CORE-5087

Adds some missing features to the Avro C++ implementation to support Iceberg development:

Custom metadata in Avro files.
map as a logical type for arrays.
element-id field for arrays.

My somewhat limited reading of the Java Iceberg implementation and the Java Avro implementation suggest that Java Iceberg is relying on custom properties (called custom attributes or custom fields in C++) for the "element-id". In Java Avro, custom properties can be any type, not just strings. In the future we may want to change the way custom properties/attributes work in C++ Avro, rather than having element-id be a special case for Iceberg.

More info about open questions related to this PR: https://docs.google.com/document/d/13Nh-01cTWU0cRoo-SqncqviG3oQBX8lp-Vt6oLWcqwo/edit

This change bumps our C++ version to 17 and disables specific warnings that cause build errors. I was hitting three build errors, possibly due to the newer compiler: - lang/c++/impl/Compiler.cc:304:18: error: possibly dangling reference to a temporary [-Werror=dangling-reference] - /usr/include/boost/math/tools/config.hpp:23:6: error: #warning "The minimum language standard to use Boost.Math will be C++14 starting in July 2023 (Boost 1.82 release)" [-Werror=cpp] - avro/lang/c++/include/avro/Reader.hh:89:17: error: ‘v.avro::ReaderImpl<avro::NullValidator>::readValue(double&)::<unnamed union>::d’ may be used uninitialized [-Werror=maybe-uninitialized] The last one only occurred during `./build.sh install`, not during `./build.sh test`

CLAassistant · 2024-07-02T18:30:20Z

All committers have signed the CLA.

lang/c++/impl/DataFile.cc

lang/c++/impl/NodeImpl.cc

andrwng · 2024-07-03T19:03:23Z

lang/c++/impl/Compiler.cc

+    std::shared_ptr<NodeArray> nodePointer = std::make_shared<NodeArray>(
+        asSingleAttribute(makeNode(it->second, st, ns)));
+    NodePtr node = NodePtr(nodePointer);
+    if (containsField(m, "element-id")) {
+        nodePointer->elementId_ = getStringField(e, m, "element-id");
+    }


I'm kind of confused about what element id is. In my head, element-id was a numeric label indicating the position in the array. Is that not the case? Looking at some other impl, it does look like it's an int, but it's possible that may be an internal id that indexes the actual string value.

https://github.com/Netflix/iceberg/blob/5684ac1b77851737ed316e7f9d302f902b3e1e67/core/src/main/java/com/netflix/iceberg/avro/AvroSchemaUtil.java#L215-L219

Do you have any pointers to more context about element-ids?

I tried googling it too, and couldn't find anything terribly useful. A lot of this stuff seems undocumented. I'm primarily working backwards from a schema generated by pyspark. I wrote a script to extract the schema from valid manifest and manifest list files. I tried removing things to see what's actually required. element-id is required here by both the Java and Python versions.

Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.

You're right that it appears to be an integer in all of the examples. I made it a string because in the C++ version the field-id is required to be a string¹. It looks like in pyIceberg it has a type annotation as an int field, but pyspark is able to read files generated by this. It's possible we'll need to revisit this in the future and make it an int, or do a big overhaul of how custom attributes work in the C++ Avro implementation.

Footnotes

This is actually an incompatibility with other versions of Avro. The original Iceberg schema that was generated by pySpark had integer field IDs. I changed them to strings and it worked. ↩

To clarify about the custom attribute thing:

An alternative to adding element-id to the array class would be to add custom attributes for arrays. In the C++ Avro implementation custom attributes are only supported for record types. If we added them for arrays element ID could just be a normal custom attribute instead of a special case.

This doesn't solve the typing problem. In C++ Avro custom attributes are a map<string, string>. To support integer attributes we would need to make custom attributes more generic. Digging through the commit history, it looks like that's how it used to work, but support for non-string attributes was removed a few years ago.

Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.

Ah yeah, these docs pretty much confirm that, under "Field IDs": https://iceberg.apache.org/spec/#avro. BTW have you come across key-id and value-id? Does the C++ library support them well already?

AFAICT these all may as well have been some column-id, not quite sure why they opted to use different names per type.

Any idea how difficult it'd be to tweak the field-id to be an int? I guess I don't feel strongly enough or know enough about how this may bite us later, but if it's easy, might as well get it out of the way now?

Re: custom attributes, maybe a quick-ish hack would be to revive that feature but with a map<string, int> that'd be specific to custom numeric attributes? I guess that'd be pretty hard to upstream, but maybe not a huge priority atm

I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.

Custom attributes weren't completely removed. In the original implementation they could be any type. That was quickly revised to be only strings.

In this commit custom attributes were introduced and the type for the value is json::Entity

In this commit, 2 weeks later, it was changed to string. It looks like the reasoning was to avoid

I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.

I'm okay with that being the case, but just noting that what I'm referring to is also apparently an Avro schema.

Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.

I think this is saying that when writing data with the Avro format, the Avro schema needs to have element-id et al. Are you saying that this PR is required for all data formats (not just Avro) and is referring to the Iceberg schema? Might be helpful to include the sample schema so it's easier to point to exactly what you're referring to

Oh yeah, it looks like you're right:

Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.

In all of the examples they give, the IDs are integers.

On the other hand, it looks like in the Java Avro implementation, custom attributes have to be strings: https://avro.apache.org/docs/1.6.2/api/csharp/classAvro_1_1Field.html#ac93ac61816cb1f110f6f84eb248f0643

WillemKauf · 2024-07-04T18:23:56Z

lang/c++/impl/Compiler.cc

+        asSingleAttribute(makeNode(it->second, st, ns)));
+    NodePtr node = NodePtr(nodePointer);
+    if (containsField(m, "element-id")) {
+        nodePointer->elementId_ = getStringField(e, m, "element-id");


This also ends up being two lookups to the map between containsField() and getStringField(). I guess avro isn't too concerned about these lookups though, since they use this sub-optimal pattern throughout their functions. Could be a nice extension to add a function std::optional<string> maybeGetStringField() or something of the like in the future.

Yeah, getStringField calls findField which throws an exception if the field isn't present.

The Iceberg spec requires custom metadata fields for Avro files. It expects information such as the table scheme, the partition information and the manifest type to be in the Manifest File's metadata. This change makes the `setMetadata` method of `DataFileWriterBase` public instead of private, and adds a similar public method to `DataFileWriter`.

Maps can be represented by either the built-in map type, or an array of key-value pairs with the logical type `map`. The Iceberg "column_sizes" field is an example of the latter. This change allows arrays to be annotated with the `map` logical type, which was previously not supported in C++ Avro.

`element-id` is an optional field used by Iceberg arrays and required by some Iceberg implmentations.

andrwng

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

Just summarizing/double checking my understanding today, for future reference in case it's useful:

The primary requirement of Avro with Iceberg is that Iceberg metadata files are mostly written in Avro.
The Iceberg project has static (but evolving, release to release) schemas for its metadata. These are schemas that we'll codegen as a part of the build so we can generate these metadata objects in C++.
These static schemas we've been using as reference were generated with other projects/languages' Avro impls.
The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.
The map changes are necessary because the Iceberg metadata includes a map.
The addition of custom attributes should subsume the element-id changes, provided callers will have a way to serialize the value strings as integers. It's currently unclear exactly where Iceberg does this, but it's apparent from the Java Iceberg implementation that they are casting a string to an int for these special "field id"-like fields.

Is that all correct, @jcipar ?

andrwng · 2024-07-06T05:23:07Z

lang/c++/CMakeLists.txt

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
 set (CMAKE_LEGACY_CYGWIN_WIN32 0)

 if (NOT DEFINED CMAKE_CXX_STANDARD)
-    set(CMAKE_CXX_STANDARD 11)
+    set(CMAKE_CXX_STANDARD 17)


apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?

dotnwat

lgtm

dotnwat · 2024-07-06T14:38:39Z

lang/c++/CMakeLists.txt

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
 set (CMAKE_LEGACY_CYGWIN_WIN32 0)

 if (NOT DEFINED CMAKE_CXX_STANDARD)
-    set(CMAKE_CXX_STANDARD 11)
+    set(CMAKE_CXX_STANDARD 17)


This is fine, but I would expect this to be unnecessary for us since we set C++20. ah but you're probably working on this as an independent library.

dotnwat · 2024-07-06T14:39:29Z

lang/c++/impl/DataFile.cc

-                                       Codec codec) : filename_(filename),
+                                       Codec codec, const std::map<std::string, std::string> &customMetadata) : filename_(filename),
                                                      schema_(schema),
                                                      encoderPtr_(binaryEncoder()),
                                                      syncInterval_(syncInterval),


thanks! much easier to read

dotnwat · 2024-07-06T18:03:49Z

lang/c++/CMakeLists.txt

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
 set (CMAKE_LEGACY_CYGWIN_WIN32 0)

 if (NOT DEFINED CMAKE_CXX_STANDARD)
-    set(CMAKE_CXX_STANDARD 11)
+    set(CMAKE_CXX_STANDARD 17)


apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?

dotnwat · 2024-07-06T18:10:50Z

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

@jcipar can you capture any todo items / follow-ups that come from this discussion into a jira ticket?

jcipar · 2024-07-08T14:41:07Z

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

Just summarizing/double checking my understanding today, for future reference in case it's useful:

The primary requirement of Avro with Iceberg is that Iceberg metadata files are mostly written in Avro.

The Iceberg project has static (but evolving, release to release) schemas for its metadata. These are schemas that we'll codegen as a part of the build so we can generate these metadata objects in C++.

These static schemas we've been using as reference were generated with other projects/languages' Avro impls.

The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.

The map changes are necessary because the Iceberg metadata includes a map.

The addition of custom attributes should subsume the element-id changes, provided callers will have a way to serialize the value strings as integers. It's currently unclear exactly where Iceberg does this, but it's apparent from the Java Iceberg implementation that they are casting a string to an int for these special "field id"-like fields.

Is that all correct, @jcipar ?

That's mostly correct. The only correction I have is this:

The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.

Element-id is required by at least some tools that consume Iceberg metadata. E.g. PyIceberg won't read a table that doesn't have element-ids. So I'd say this is required to produce correct Iceberg.

jcipar · 2024-07-08T14:50:20Z

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

@jcipar can you capture any todo items / follow-ups that come from this discussion into a jira ticket?

https://redpandadata.atlassian.net/browse/CORE-5538

github-actions bot added the C++ label Jul 2, 2024

jcipar force-pushed the jcipar/CORE-5087-changes-to-support-iceberg branch 3 times, most recently from a84ea6c to e22aaec Compare July 3, 2024 17:45

jcipar marked this pull request as ready for review July 3, 2024 17:47

jcipar requested review from BenPope, dotnwat and andrwng July 3, 2024 17:48

dotnwat reviewed Jul 3, 2024

View reviewed changes

lang/c++/impl/DataFile.cc Outdated Show resolved Hide resolved

lang/c++/impl/DataFile.cc Outdated Show resolved Hide resolved

lang/c++/impl/NodeImpl.cc Outdated Show resolved Hide resolved

andrwng reviewed Jul 3, 2024

View reviewed changes

WillemKauf reviewed Jul 4, 2024

View reviewed changes

jcipar added 3 commits July 5, 2024 13:51

Add element-id support for Avro arrays.

966d457

`element-id` is an optional field used by Iceberg arrays and required by some Iceberg implmentations.

jcipar force-pushed the jcipar/CORE-5087-changes-to-support-iceberg branch from e22aaec to 966d457 Compare July 5, 2024 17:53

andrwng approved these changes Jul 5, 2024

View reviewed changes

andrwng reviewed Jul 6, 2024

View reviewed changes

dotnwat approved these changes Jul 6, 2024

View reviewed changes

jcipar merged commit aa41aac into redpanda-data:release-1.11.1-redpanda Jul 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jcipar/core 5087 changes to support iceberg #248

Jcipar/core 5087 changes to support iceberg #248

jcipar commented Jul 2, 2024 •

edited

Loading

CLAassistant commented Jul 2, 2024 •

edited

Loading

andrwng Jul 3, 2024

jcipar Jul 3, 2024

jcipar Jul 3, 2024

andrwng Jul 5, 2024

jcipar Jul 5, 2024 •

edited

Loading

andrwng Jul 5, 2024 •

edited

Loading

jcipar Jul 5, 2024

WillemKauf Jul 4, 2024

jcipar Jul 5, 2024

andrwng left a comment

andrwng Jul 6, 2024

dotnwat Jul 6, 2024

dotnwat left a comment

dotnwat Jul 6, 2024

dotnwat Jul 6, 2024

dotnwat Jul 6, 2024

dotnwat commented Jul 6, 2024

jcipar commented Jul 8, 2024

jcipar commented Jul 8, 2024

Jcipar/core 5087 changes to support iceberg #248

Jcipar/core 5087 changes to support iceberg #248

Conversation

jcipar commented Jul 2, 2024 • edited Loading

CLAassistant commented Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcipar Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

andrwng Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Jul 6, 2024

jcipar commented Jul 8, 2024

jcipar commented Jul 8, 2024

jcipar commented Jul 2, 2024 •

edited

Loading

CLAassistant commented Jul 2, 2024 •

edited

Loading

jcipar Jul 5, 2024 •

edited

Loading

andrwng Jul 5, 2024 •

edited

Loading