Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use duckdb functions without having to specify the full type signature #531

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

JelteF
Copy link
Collaborator

@JelteF JelteF commented Jan 14, 2025

Currently when you use read_parquet/read_csv/etc they require specifying the full type signature of the function call. This is quite cumbersome, especially when using SELECT * FROM read_parquet(...) AS (...). Using SELECT * its main advantage is that you don't have to specify all the columns.

This changes that by instead allowing the following new syntax:

SELECT * FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

This syntax is made possible by using two clever hacks to solve two different problems:

The first problem is that Postgres its parser needs to consider the query valid, which means it has to resolve all the types. Sadly, we cannot hook into the type resolution as an extension. Instead the types need to be resolved fully from information in the Postgres catalogs. To do so we create a new type for our extension: duckdb.unresolved_type. This type only exists for the Postgres parser and should never be used explicitly. We add catalog entries that allow this type be casted explicitly to any type that's supported by pg_duckdb. Using explicit casts always is quite annoying though. So we also define various operators and functions for this type, so that things like r['age'] > 21 are allowed.

The second problem is that the Postgres parser replaces the * in SELECT * with an expanded column list. So the query above will look as follows once we receive it (* replaced with r):

SELECT r FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

To resolve this problem we let read_parquet return a duckdb.row, which we then replace with a * again before in our pgduckdb_get_querydef function. This might sound pretty simple but there's tricks that need to be done for subqueries that return both a duckdb.row type and some other columns.

TODO:

  • More comments about edge cases (see reminders for myself)
  • Add some more tests with more complex queries
  • Support all supported PG versions (not just PG17), hopefully by moving some of the code to the src/pgduckdb_ruleutils.cpp which is shared across versions. Afaik there's no technical problems to implement this feature for all PG versions that we support. The reason this is not done yet is that I'd like to wait with updating all the PG version specific ruleutils files until after the first round of review.
  • Support for subscripts, aka array/json indexing
  • BUG: Fix CREATE TABLE AS (broken by a recent commit)

@JelteF JelteF force-pushed the redesigned-read-parquet branch 6 times, most recently from 59d8a57 to 1992351 Compare January 14, 2025 15:45
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
src/vendor/pg_ruleutils_17.c Outdated Show resolved Hide resolved
src/vendor/pg_ruleutils_17.c Outdated Show resolved Hide resolved
src/vendor/pg_ruleutils_17.c Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
@JelteF JelteF force-pushed the redesigned-read-parquet branch 2 times, most recently from 392c28a to afb8c6d Compare January 21, 2025 19:47
@JelteF JelteF marked this pull request as ready for review January 21, 2025 19:47
@JelteF JelteF requested review from mkaruza and Y-- January 21, 2025 19:47
@JelteF JelteF force-pushed the redesigned-read-parquet branch from afb8c6d to eb09c25 Compare January 21, 2025 19:51
Copy link
Collaborator

@Y-- Y-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very nice piece of work! Well tested from what I can tell, and obviously works as advertised :-)
Only minor comments/suggestions.

src/pgduckdb_ddl.cpp Outdated Show resolved Hide resolved
src/pgduckdb_ddl.cpp Outdated Show resolved Hide resolved
if (edata->elevel == ERROR && edata->sqlerrcode == ERRCODE_UNDEFINED_COLUMN && pgduckdb::IsExtensionRegistered()) {
/*
* XXX: It would be nice if we could check if the query contained any
* of the functions, but we don't have access to the query string here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we don't have access to it - debug_query_string is not reliable? Doesn't the PG log the statement when an error occurs? We should have the query then no? (not blocking of course, I'm happy to dig into it later)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah that should probably work indeed. Doesn't seem the most important thing though, so I'll leave as is for now.

src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
src/pgduckdb_options.cpp Outdated Show resolved Hide resolved
}

DECLARE_PG_FUNCTION(duckdb_unresolved_type_operator) {
elog(ERROR, "Unresolved duckdb types cannot be used by the Postgres executor");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a way to identify which operator caused it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily I think, and it also doesn't really matter. This error should never really be shown in practice. It means we don't detect correctly that the query requires duckdb.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically related to this PR (nor this file, but all the pg_ruleutils_*: I wonder if we could reduce code duplication for all these versions? It feels a bit clunky (and error-prone) to have to repeat the same code for each PG version we support :-)
Maybe we should have an open issue an gather options there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it would be very nice not to have this duplication. That's why I tried to put as much code as possible in pgduckdb_ruleutils.cpp, because that is shared across versions.

I think there's only really one other option: Have a shared pg_ruleutils.c file with a bunch of #if PG_VERSION_NUM > 150000 branches to handle differences in the AST structure between PG versions. I'm not convinced that's any less error prone though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could send Copilot on it and see how bad that version would look like (I don't think we should spend time right now on that in any case!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd be curious. I expect it's a lot, but maybe it's not as bad as I think.

@@ -2,26 +2,26 @@
-- duckdb.force_execution is turned off
SET duckdb.force_execution = false;
\set pwd `pwd`
select * from read_parquet(:'pwd' || '/data/unsigned_types.parquet') as (usmallint int);
select r['usmallint'] from read_parquet(:'pwd' || '/data/unsigned_types.parquet') r;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep a few tests with the old format? (or maybe you did and they don't show up in diff?)

Copy link
Collaborator Author

@JelteF JelteF Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old syntax does not work anymore: AS (...) is only allowed for functions that return record and they now return duckdb.row.

The only way (I can think of) to support the old syntax would be to introduce additional functions, like read_parquet_record & read_csv_record. I don't think that's worth the hassle for now. But we can reconsider if we get valid user feedback that the new syntax is worse in some cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's very fair. My only comment then would be to try our best to update the doc (and various blog posts we know of) so we don't get a ton of bug reports on this :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I indeed realized I didn't update the docs yet. I'll spend some time on that, but I think that can wait for a follow up PR (as long as we do it before the release).

One thing that I did change now is add a hint to the error message when people use the old syntax.

test/regression/expected/temporary_tables.out Show resolved Hide resolved
@JelteF JelteF force-pushed the redesigned-read-parquet branch from eb09c25 to c890f35 Compare January 23, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants