-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jcipar/core 5087 changes to support iceberg #248
Jcipar/core 5087 changes to support iceberg #248
Conversation
This change bumps our C++ version to 17 and disables specific warnings that cause build errors. I was hitting three build errors, possibly due to the newer compiler: - lang/c++/impl/Compiler.cc:304:18: error: possibly dangling reference to a temporary [-Werror=dangling-reference] - /usr/include/boost/math/tools/config.hpp:23:6: error: #warning "The minimum language standard to use Boost.Math will be C++14 starting in July 2023 (Boost 1.82 release)" [-Werror=cpp] - avro/lang/c++/include/avro/Reader.hh:89:17: error: ‘v.avro::ReaderImpl<avro::NullValidator>::readValue(double&)::<unnamed union>::d’ may be used uninitialized [-Werror=maybe-uninitialized] The last one only occurred during `./build.sh install`, not during `./build.sh test`
a84ea6c
to
e22aaec
Compare
std::shared_ptr<NodeArray> nodePointer = std::make_shared<NodeArray>( | ||
asSingleAttribute(makeNode(it->second, st, ns))); | ||
NodePtr node = NodePtr(nodePointer); | ||
if (containsField(m, "element-id")) { | ||
nodePointer->elementId_ = getStringField(e, m, "element-id"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of confused about what element id is. In my head, element-id was a numeric label indicating the position in the array. Is that not the case? Looking at some other impl, it does look like it's an int, but it's possible that may be an internal id that indexes the actual string value.
Do you have any pointers to more context about element-ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried googling it too, and couldn't find anything terribly useful. A lot of this stuff seems undocumented. I'm primarily working backwards from a schema generated by pyspark. I wrote a script to extract the schema from valid manifest and manifest list files. I tried removing things to see what's actually required. element-id
is required here by both the Java and Python versions.
Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id
is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.
You're right that it appears to be an integer in all of the examples. I made it a string because in the C++ version the field-id
is required to be a string1. It looks like in pyIceberg it has a type annotation as an int
field, but pyspark is able to read files generated by this. It's possible we'll need to revisit this in the future and make it an int, or do a big overhaul of how custom attributes work in the C++ Avro implementation.
Footnotes
-
This is actually an incompatibility with other versions of Avro. The original Iceberg schema that was generated by pySpark had integer field IDs. I changed them to strings and it worked. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify about the custom attribute thing:
An alternative to adding element-id
to the array class would be to add custom attributes for arrays. In the C++ Avro implementation custom attributes are only supported for record types. If we added them for arrays element ID could just be a normal custom attribute instead of a special case.
This doesn't solve the typing problem. In C++ Avro custom attributes are a map<string, string>
. To support integer attributes we would need to make custom attributes more generic. Digging through the commit history, it looks like that's how it used to work, but support for non-string attributes was removed a few years ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a value in the schema, my guess is that it's meant to identify the array element type, in the same way that field-id is used for record fields. So if you change the element type of an array without changing the name of an array, the system will be able to detect the schema evolution and parse the elements correctly.
Ah yeah, these docs pretty much confirm that, under "Field IDs": https://iceberg.apache.org/spec/#avro. BTW have you come across key-id and value-id? Does the C++ library support them well already?
AFAICT these all may as well have been some column-id, not quite sure why they opted to use different names per type.
Any idea how difficult it'd be to tweak the field-id
to be an int? I guess I don't feel strongly enough or know enough about how this may bite us later, but if it's easy, might as well get it out of the way now?
Re: custom attributes, maybe a quick-ish hack would be to revive that feature but with a map<string, int>
that'd be specific to custom numeric attributes? I guess that'd be pretty hard to upstream, but maybe not a huge priority atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.
Custom attributes weren't completely removed. In the original implementation they could be any type. That was quickly revised to be only strings.
- In this commit custom attributes were introduced and the type for the value is json::Entity
- In this commit, 2 weeks later, it was changed to string. It looks like the reasoning was to avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the correct spec you linked. I think that's referring to the Iceberg schema, which is also written in JSON and looks a lot like an Avro schema.
I'm okay with that being the case, but just noting that what I'm referring to is also apparently an Avro schema.
Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.
I think this is saying that when writing data with the Avro format, the Avro schema needs to have element-id
et al. Are you saying that this PR is required for all data formats (not just Avro) and is referring to the Iceberg schema? Might be helpful to include the sample schema so it's easier to point to exactly what you're referring to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, it looks like you're right:
Iceberg struct, list, and map types identify nested types by ID. When writing data to Avro files, these IDs must be stored in the Avro schema to support ID-based column pruning.
In all of the examples they give, the IDs are integers.
On the other hand, it looks like in the Java Avro implementation, custom attributes have to be strings: https://avro.apache.org/docs/1.6.2/api/csharp/classAvro_1_1Field.html#ac93ac61816cb1f110f6f84eb248f0643
asSingleAttribute(makeNode(it->second, st, ns))); | ||
NodePtr node = NodePtr(nodePointer); | ||
if (containsField(m, "element-id")) { | ||
nodePointer->elementId_ = getStringField(e, m, "element-id"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also ends up being two lookups to the map between containsField()
and getStringField()
. I guess avro
isn't too concerned about these lookups though, since they use this sub-optimal pattern throughout their functions. Could be a nice extension to add a function std::optional<string> maybeGetStringField()
or something of the like in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, getStringField
calls findField
which throws an exception if the field isn't present.
The Iceberg spec requires custom metadata fields for Avro files. It expects information such as the table scheme, the partition information and the manifest type to be in the Manifest File's metadata. This change makes the `setMetadata` method of `DataFileWriterBase` public instead of private, and adds a similar public method to `DataFileWriter`.
Maps can be represented by either the built-in map type, or an array of key-value pairs with the logical type `map`. The Iceberg "column_sizes" field is an example of the latter. This change allows arrays to be annotated with the `map` logical type, which was previously not supported in C++ Avro.
`element-id` is an optional field used by Iceberg arrays and required by some Iceberg implmentations.
e22aaec
to
966d457
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with merging this as is, but it seems like we'll have to revisit custom attributes when we get around to supporting Avro schemas for user data.
Just summarizing/double checking my understanding today, for future reference in case it's useful:
- The primary requirement of Avro with Iceberg is that Iceberg metadata files are mostly written in Avro.
- The Iceberg project has static (but evolving, release to release) schemas for its metadata. These are schemas that we'll codegen as a part of the build so we can generate these metadata objects in C++.
- These static schemas we've been using as reference were generated with other projects/languages' Avro impls.
- The element-id changes aren't strictly necessary to get us started writing Iceberg metadata, but will be when we come around to supporting Avro schemas for data.
- The map changes are necessary because the Iceberg metadata includes a map.
- The addition of custom attributes should subsume the element-id changes, provided callers will have a way to serialize the value strings as integers. It's currently unclear exactly where Iceberg does this, but it's apparent from the Java Iceberg implementation that they are casting a string to an int for these special "field id"-like fields.
Is that all correct, @jcipar ?
@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1) | |||
set (CMAKE_LEGACY_CYGWIN_WIN32 0) | |||
|
|||
if (NOT DEFINED CMAKE_CXX_STANDARD) | |||
set(CMAKE_CXX_STANDARD 11) | |||
set(CMAKE_CXX_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch
we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1) | |||
set (CMAKE_LEGACY_CYGWIN_WIN32 0) | |||
|
|||
if (NOT DEFINED CMAKE_CXX_STANDARD) | |||
set(CMAKE_CXX_STANDARD 11) | |||
set(CMAKE_CXX_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I would expect this to be unnecessary for us since we set C++20. ah but you're probably working on this as an independent library.
Codec codec) : filename_(filename), | ||
Codec codec, const std::map<std::string, std::string> &customMetadata) : filename_(filename), | ||
schema_(schema), | ||
encoderPtr_(binaryEncoder()), | ||
syncInterval_(syncInterval), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! much easier to read
@@ -21,7 +21,7 @@ cmake_minimum_required (VERSION 3.1) | |||
set (CMAKE_LEGACY_CYGWIN_WIN32 0) | |||
|
|||
if (NOT DEFINED CMAKE_CXX_STANDARD) | |||
set(CMAKE_CXX_STANDARD 11) | |||
set(CMAKE_CXX_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/avro seems to have this. Wondering if it makes sense to rebase on the apache master branch
we should probably wait until after we cut the release, unless we need stuff from from upstream avro. cc @BenPope maybe he has a sense of what's been going on upstream?
@jcipar can you capture any todo items / follow-ups that come from this discussion into a jira ticket? |
That's mostly correct. The only correction I have is this:
Element-id is required by at least some tools that consume Iceberg metadata. E.g. PyIceberg won't read a table that doesn't have element-ids. So I'd say this is required to produce correct Iceberg. |
|
https://redpandadata.atlassian.net/browse/CORE-5087
Adds some missing features to the Avro C++ implementation to support Iceberg development:
map
as a logical type for arrays.element-id
field for arrays.My somewhat limited reading of the Java Iceberg implementation and the Java Avro implementation suggest that Java Iceberg is relying on custom properties (called custom attributes or custom fields in C++) for the "element-id". In Java Avro, custom properties can be any type, not just strings. In the future we may want to change the way custom properties/attributes work in C++ Avro, rather than having element-id be a special case for Iceberg.
More info about open questions related to this PR: https://docs.google.com/document/d/13Nh-01cTWU0cRoo-SqncqviG3oQBX8lp-Vt6oLWcqwo/edit