Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing iceberg_* functions with parameters #275

Merged
merged 3 commits into from
Oct 14, 2024
Merged

Conversation

mkaruza
Copy link
Collaborator

@mkaruza mkaruza commented Oct 12, 2024

This also adds support for conversion to UBIGINT because one of the Iceberg functions return that type.

@JelteF JelteF added the enhancement New feature or request label Oct 12, 2024
RETURNS SETOF record LANGUAGE 'plpgsql' AS
$func$
BEGIN
RAISE EXCEPTION 'Function `iceberg_scan(TEXT)` only works with Duckdb execution.';
END;
$func$;

CREATE OR REPLACE FUNCTION iceberg_metadata(path text, allow_moved_paths BOOLEAN DEFAULT FALSE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new functions should be added to the function_names array in BuildDuckdbOnlyFunctions.

skip_schema_inference BOOLEAN DEFAULT FALSE,
version TEXT DEFAULT 'version-hint.text',
version_name_format TEXT DEFAULT 'v%s%s.metadata.json,%s%s.metadata.json')
RETURNS SETOF record LANGUAGE 'plpgsql' AS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two new functions have fixed result columns, so no need to use record here. We can return actual rows with known column names and types. Then people don't have to manually specify the columns using AS (...)

Comment on lines +100 to +108
CREATE OR REPLACE FUNCTION iceberg_scan(path text, allow_moved_paths BOOLEAN DEFAULT FALSE,
mode TEXT DEFAULT '',
metadata_compression_codec TEXT DEFAULT 'none',
skip_schema_inference BOOLEAN DEFAULT FALSE,
version TEXT DEFAULT 'version-hint.text',
version_name_format TEXT DEFAULT 'v%s%s.metadata.json,%s%s.metadata.json')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you find all these arguments? I only see allow_moved_paths in the docs here: https://duckdb.org/docs/extensions/iceberg.html#access-iceberg-metadata

It would be good to add a code comment with a link to the source for these arguments, so we have a reference in case we need to update them in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked source code from iceberg extension

@mkaruza mkaruza force-pushed the iceberg-functions branch 2 times, most recently from f916871 to e5d8e6e Compare October 13, 2024 14:13
@@ -571,6 +571,7 @@ GetPostgresDuckDBType(duckdb::LogicalType type) {
case duckdb::LogicalTypeId::INTEGER:
return INT4OID;
case duckdb::LogicalTypeId::BIGINT:
case duckdb::LogicalTypeId::UBIGINT:
Copy link
Collaborator

@JelteF JelteF Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct. All other unsigned ints use a type that is one size larger than their own type, which makes sense. The largest half of an unsigned int does not fit into a signed int of the same size.

We might want to start depending on pguint, to provide actual unsigned integer types: https://github.com/petere/pguint

This change definitely seems like a better fit for another PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubigint is required as manifest_sequence_number field returns that from duckdb execution - so we don't match it and execution fails. i can add comment for this in code.

Copy link
Collaborator

@JelteF JelteF Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's pretty annoying. Now I understand why you added it to this PR.

The main problem I have is that right now we fail hard for ubigint, which seems nicer than to silently convert too-high ubigints to negative numbers.

How about we just don't return that field for now? i.e. make the type:

CREATE TYPE iceberg_metadata_record AS (
  manifest_path TEXT,
-- manifest_sequence_number is not included, because it is a UBIGINT, which we don't support yet.
  manifest_content  TEXT,
  status TEXT,
  content TEXT,
  file_path TEXT
);

And then in another PR we can support UBIGINT, probably by converting it into a NUMERIC instead of a BIGINT.

Copy link
Collaborator

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the UBIGINT change should be removed from this PR, but other than that this now LGTM.

@mkaruza mkaruza merged commit 78cf9f7 into main Oct 14, 2024
4 checks passed
@mkaruza mkaruza deleted the iceberg-functions branch October 14, 2024 11:06
@JelteF JelteF mentioned this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants