-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use duckdb functions without having to specify the full type signature #531
base: main
Are you sure you want to change the base?
Conversation
59d8a57
to
1992351
Compare
392c28a
to
afb8c6d
Compare
afb8c6d
to
eb09c25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very nice piece of work! Well tested from what I can tell, and obviously works as advertised :-)
Only minor comments/suggestions.
src/pgduckdb_hooks.cpp
Outdated
if (edata->elevel == ERROR && edata->sqlerrcode == ERRCODE_UNDEFINED_COLUMN && pgduckdb::IsExtensionRegistered()) { | ||
/* | ||
* XXX: It would be nice if we could check if the query contained any | ||
* of the functions, but we don't have access to the query string here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we don't have access to it - debug_query_string
is not reliable? Doesn't the PG log the statement when an error occurs? We should have the query then no? (not blocking of course, I'm happy to dig into it later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah that should probably work indeed. Doesn't seem the most important thing though, so I'll leave as is for now.
} | ||
|
||
DECLARE_PG_FUNCTION(duckdb_unresolved_type_operator) { | ||
elog(ERROR, "Unresolved duckdb types cannot be used by the Postgres executor"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a way to identify which operator caused it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easily I think, and it also doesn't really matter. This error should never really be shown in practice. It means we don't detect correctly that the query requires duckdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specifically related to this PR (nor this file, but all the pg_ruleutils_*
: I wonder if we could reduce code duplication for all these versions? It feels a bit clunky (and error-prone) to have to repeat the same code for each PG version we support :-)
Maybe we should have an open issue an gather options there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it would be very nice not to have this duplication. That's why I tried to put as much code as possible in pgduckdb_ruleutils.cpp
, because that is shared across versions.
I think there's only really one other option: Have a shared pg_ruleutils.c
file with a bunch of #if PG_VERSION_NUM > 150000
branches to handle differences in the AST structure between PG versions. I'm not convinced that's any less error prone though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could send Copilot on it and see how bad that version would look like (I don't think we should spend time right now on that in any case!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd be curious. I expect it's a lot, but maybe it's not as bad as I think.
@@ -2,26 +2,26 @@ | |||
-- duckdb.force_execution is turned off | |||
SET duckdb.force_execution = false; | |||
\set pwd `pwd` | |||
select * from read_parquet(:'pwd' || '/data/unsigned_types.parquet') as (usmallint int); | |||
select r['usmallint'] from read_parquet(:'pwd' || '/data/unsigned_types.parquet') r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep a few tests with the old format? (or maybe you did and they don't show up in diff?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old syntax does not work anymore: AS (...)
is only allowed for functions that return record
and they now return duckdb.row
.
The only way (I can think of) to support the old syntax would be to introduce additional functions, like read_parquet_record
& read_csv_record
. I don't think that's worth the hassle for now. But we can reconsider if we get valid user feedback that the new syntax is worse in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's very fair. My only comment then would be to try our best to update the doc (and various blog posts we know of) so we don't get a ton of bug reports on this :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I indeed realized I didn't update the docs yet. I'll spend some time on that, but I think that can wait for a follow up PR (as long as we do it before the release).
One thing that I did change now is add a hint to the error message when people use the old syntax.
This reverts commit a799f7e.
eb09c25
to
c890f35
Compare
Currently when you use
read_parquet
/read_csv
/etc they require specifying the full type signature of the function call. This is quite cumbersome, especially when usingSELECT * FROM read_parquet(...) AS (...)
. UsingSELECT *
its main advantage is that you don't have to specify all the columns.This changes that by instead allowing the following new syntax:
This syntax is made possible by using two clever hacks to solve two different problems:
The first problem is that Postgres its parser needs to consider the query valid, which means it has to resolve all the types. Sadly, we cannot hook into the type resolution as an extension. Instead the types need to be resolved fully from information in the Postgres catalogs. To do so we create a new type for our extension:
duckdb.unresolved_type
. This type only exists for the Postgres parser and should never be used explicitly. We add catalog entries that allow this type be casted explicitly to any type that's supported by pg_duckdb. Using explicit casts always is quite annoying though. So we also define various operators and functions for this type, so that things liker['age'] > 21
are allowed.The second problem is that the Postgres parser replaces the
*
inSELECT *
with an expanded column list. So the query above will look as follows once we receive it (*
replaced withr
):To resolve this problem we let
read_parquet
return aduckdb.row
, which we then replace with a*
again before in ourpgduckdb_get_querydef
function. This might sound pretty simple but there's tricks that need to be done for subqueries that return both aduckdb.row
type and some other columns.TODO:
src/pgduckdb_ruleutils.cpp
which is shared across versions. Afaik there's no technical problems to implement this feature for all PG versions that we support. The reason this is not done yet is that I'd like to wait with updating all the PG version specific ruleutils files until after the first round of review.