Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jcipar/core 5087 changes to support iceberg #248

Conversation

jcipar
Copy link

@jcipar jcipar commented Jul 2, 2024

https://redpandadata.atlassian.net/browse/CORE-5087

Adds some missing features to the Avro C++ implementation to support Iceberg development:

  • Custom metadata in Avro files.
  • map as a logical type for arrays.
  • element-id field for arrays.

My somewhat limited reading of the Java Iceberg implementation and the Java Avro implementation suggest that Java Iceberg is relying on custom properties (called custom attributes or custom fields in C++) for the "element-id". In Java Avro, custom properties can be any type, not just strings. In the future we may want to change the way custom properties/attributes work in C++ Avro, rather than having element-id be a special case for Iceberg.

More info about open questions related to this PR: https://docs.google.com/document/d/13Nh-01cTWU0cRoo-SqncqviG3oQBX8lp-Vt6oLWcqwo/edit

This change bumps our C++ version to 17 and disables specific warnings
that cause build errors.

I was hitting three build errors, possibly due to the newer compiler:
- lang/c++/impl/Compiler.cc:304:18: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
- /usr/include/boost/math/tools/config.hpp:23:6: error: #warning "The minimum language standard to use Boost.Math will be C++14 starting in July 2023 (Boost 1.82 release)" [-Werror=cpp]
- avro/lang/c++/include/avro/Reader.hh:89:17: error: ‘v.avro::ReaderImpl<avro::NullValidator>::readValue(double&)::<unnamed union>::d’ may be used uninitialized [-Werror=maybe-uninitialized]

The last one only occurred during `./build.sh install`, not during `./build.sh test`
@CLAassistant
Copy link

CLAassistant commented Jul 2, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added the C++ label Jul 2, 2024
@jcipar jcipar force-pushed the jcipar/CORE-5087-changes-to-support-iceberg branch 3 times, most recently from a84ea6c to e22aaec Compare July 3, 2024 17:45
@jcipar jcipar marked this pull request as ready for review July 3, 2024 17:47
@jcipar jcipar requested review from BenPope, dotnwat and andrwng July 3, 2024 17:48
lang/c++/impl/DataFile.cc Outdated Show resolved Hide resolved
lang/c++/impl/DataFile.cc Outdated Show resolved Hide resolved
lang/c++/impl/NodeImpl.cc Outdated Show resolved Hide resolved
lang/c++/impl/NodeImpl.cc Outdated Show resolved Hide resolved
Comment on lines +426 to +431
std::shared_ptr<NodeArray> nodePointer = std::make_shared<NodeArray>(
asSingleAttribute(makeNode(it->second, st, ns)));
NodePtr node = NodePtr(nodePointer);
if (containsField(m, "element-id")) {
nodePointer->elementId_ = getStringField(e, m, "element-id");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of confused about what element id is. In my head, element-id was a numeric label indicating the position in the array. Is that not the case? Looking at some other impl, it does look like it's an int, but it's possible that may be an internal id that indexes the actual string value.

https://github.com/Netflix/iceberg/blob/5684ac1b77851737ed316e7f9d302f902b3e1e67/core/src/main/java/com/netflix/iceberg/avro/AvroSchemaUtil.java#L215-L219

Do you have any pointers to more context about element-ids?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried googling it too, and couldn't find anything terribly useful. A lot of this stuff seems undocumented. I'm primarily working backwards from a schema generated by pyspark. I wrote a script to extract the schema from valid manifest and manifest list files. I tried removing things to see what's actually required. element-id is required here by both the Java and Python versions.

Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.

You're right that it appears to be an integer in all of the examples. I made it a string because in the C++ version the field-id is required to be a string1. It looks like in pyIceberg it has a type annotation as an int field, but pyspark is able to read files generated by this. It's possible we'll need to revisit this in the future and make it an int, or do a big overhaul of how custom attributes work in the C++ Avro implementation.

Footnotes

  1. This is actually an incompatibility with other versions of Avro. The original Iceberg schema that was generated by pySpark had integer field IDs. I changed them to strings and it worked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify about the custom attribute thing:

An alternative to adding element-id to the array class would be to add custom attributes for arrays. In the C++ Avro implementation custom attributes are only supported for record types. If we added them for arrays element ID could just be a normal custom attribute instead of a special case.

This doesn't solve the typing problem. In C++ Avro custom attributes are a map<string, string>. To support integer attributes we would need to make custom attributes more generic. Digging through the commit history, it looks like that's how it used to work, but support for non-string attributes was removed a few years ago.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.

Ah yeah, these docs pretty much confirm that, under "Field IDs": https://iceberg.apache.org/spec/#avro. BTW have you come across key-id and value-id? Does the C++ library support them well already?

AFAICT these all may as well have been some column-id, not quite sure why they opted to use different names per type.

Any idea how difficult it'd be to tweak the field-id to be an int? I guess I don't feel strongly enough or know enough about how this may bite us later, but if it's easy, might as well get it out of the way now?

Re: custom attributes, maybe a quick-ish hack would be to revive that feature but with a map<string, int> that'd be specific to custom numeric attributes? I guess that'd be pretty hard to upstream, but maybe not a huge priority atm

Copy link
Author

@jcipar jcipar Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.

Custom attributes weren't completely removed. In the original implementation they could be any type. That was quickly revised to be only strings.

  • In this commit custom attributes were introduced and the type for the value is json::Entity
  • In this commit, 2 weeks later, it was changed to string. It looks like the reasoning was to avoid

Copy link

@andrwng andrwng Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.

I'm okay with that being the case, but just noting that what I'm referring to is also apparently an Avro schema.

Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.

I think this is saying that when writing data with the Avro format, the Avro schema needs to have element-id et al. Are you saying that this PR is required for all data formats (not just Avro) and is referring to the Iceberg schema? Might be helpful to include the sample schema so it's easier to point to exactly what you're referring to

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, it looks like you're right:

Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.

In all of the examples they give, the IDs are integers.

On the other hand, it looks like in the Java Avro implementation, custom attributes have to be strings: https://avro.apache.org/docs/1.6.2/api/csharp/classAvro_1_1Field.html#ac93ac61816cb1f110f6f84eb248f0643

asSingleAttribute(makeNode(it->second, st, ns)));
NodePtr node = NodePtr(nodePointer);
if (containsField(m, "element-id")) {
nodePointer->elementId_ = getStringField(e, m, "element-id");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also ends up being two lookups to the map between containsField() and getStringField(). I guess avro isn't too concerned about these lookups though, since they use this sub-optimal pattern throughout their functions. Could be a nice extension to add a function std::optional<string> maybeGetStringField() or something of the like in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, getStringField calls findField which throws an exception if the field isn't present.

jcipar added 3 commits July 5, 2024 13:51
The Iceberg spec requires custom metadata fields for Avro files. It
expects information such as the table scheme, the partition information
and the manifest type to be in the Manifest File's metadata.

This change makes the `setMetadata` method of `DataFileWriterBase`
public instead of private, and adds a similar public method to
`DataFileWriter`.
Maps can be represented by either the built-in map type, or an array of
key-value pairs with the logical type `map`. The Iceberg "column_sizes"
field is an example of the latter.

This change allows arrays to be annotated with the `map` logical type,
which was previously not supported in C++ Avro.
`element-id` is an optional field used by Iceberg arrays and required by
some Iceberg implmentations.
@jcipar jcipar force-pushed the jcipar/CORE-5087-changes-to-support-iceberg branch from e22aaec to 966d457 Compare July 5, 2024 17:53
Copy link

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

Just summarizing/double checking my understanding today, for future reference in case it's useful:

  • The primary requirement of Avro with Iceberg is that Iceberg metadata files are mostly written in Avro.
  • The Iceberg project has static (but evolving, release to release) schemas for its metadata. These are schemas that we'll codegen as a part of the build so we can generate these metadata objects in C++.
  • These static schemas we've been using as reference were generated with other projects/languages' Avro impls.
  • The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.
  • The map changes are necessary because the Iceberg metadata includes a map.
  • The addition of custom attributes should subsume the element-id changes, provided callers will have a way to serialize the value strings as integers. It's currently unclear exactly where Iceberg does this, but it's apparent from the Java Iceberg implementation that they are casting a string to an int for these special "field id"-like fields.

Is that all correct, @jcipar ?

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
set (CMAKE_LEGACY_CYGWIN_WIN32 0)

if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
set (CMAKE_LEGACY_CYGWIN_WIN32 0)

if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but I would expect this to be unnecessary for us since we set C++20. ah but you're probably working on this as an independent library.

Comment on lines -67 to 72
Codec codec) : filename_(filename),
Codec codec, const std::map<std::string, std::string> &customMetadata) : filename_(filename),
schema_(schema),
encoderPtr_(binaryEncoder()),
syncInterval_(syncInterval),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! much easier to read

@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1)
set (CMAKE_LEGACY_CYGWIN_WIN32 0)

if (NOT DEFINED CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch

we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?

@dotnwat
Copy link
Member

dotnwat commented Jul 6, 2024

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

@jcipar can you capture any todo items / follow-ups that come from this discussion into a jira ticket?

@jcipar
Copy link
Author

jcipar commented Jul 8, 2024

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

Just summarizing/double checking my understanding today, for future reference in case it's useful:

  • The primary requirement of Avro with Iceberg is that Iceberg metadata files are mostly written in Avro.
  • The Iceberg project has static (but evolving, release to release) schemas for its metadata. These are schemas that we'll codegen as a part of the build so we can generate these metadata objects in C++.
  • These static schemas we've been using as reference were generated with other projects/languages' Avro impls.
  • The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.
  • The map changes are necessary because the Iceberg metadata includes a map.
  • The addition of custom attributes should subsume the element-id changes, provided callers will have a way to serialize the value strings as integers. It's currently unclear exactly where Iceberg does this, but it's apparent from the Java Iceberg implementation that they are casting a string to an int for these special "field id"-like fields.

Is that all correct, @jcipar ?

That's mostly correct. The only correction I have is this:

  • The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.

Element-id is required by at least some tools that consume Iceberg metadata. E.g. PyIceberg won't read a table that doesn't have element-ids. So I'd say this is required to produce correct Iceberg.

@jcipar
Copy link
Author

jcipar commented Jul 8, 2024

I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.

@jcipar can you capture any todo items / follow-ups that come from this discussion into a jira ticket?

https://redpandadata.atlassian.net/browse/CORE-5538

@jcipar jcipar merged commit aa41aac into redpanda-data:release-1.11.1-redpanda Jul 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants