Use duckdb functions without having to specify the full type signature #531

JelteF · 2025-01-14T10:36:21Z

Currently when you use read_parquet/read_csv/etc they require specifying the full type signature of the function call. This is quite cumbersome, especially when using SELECT * FROM read_parquet(...) AS (...). Using SELECT * its main advantage is that you don't have to specify all the columns.

This changes that by instead allowing the following new syntax:

SELECT * FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

This syntax is made possible by using two clever hacks to solve two different problems:

The first problem is that Postgres its parser needs to consider the query valid, which means it has to resolve all the types. Sadly, we cannot hook into the type resolution as an extension. Instead the types need to be resolved fully from information in the Postgres catalogs. To do so we create a new type for our extension: duckdb.unresolved_type. This type only exists for the Postgres parser and should never be used explicitly. We add catalog entries that allow this type be casted explicitly to any type that's supported by pg_duckdb. Using explicit casts always is quite annoying though. So we also define various operators and functions for this type, so that things like r['age'] > 21 are allowed.

The second problem is that the Postgres parser replaces the * in SELECT * with an expanded column list. So the query above will look as follows once we receive it (* replaced with r):

SELECT r FROM read_parquet('people.parquet') r WHERE r['age'] > 21;

To resolve this problem we let read_parquet return a duckdb.row, which we then replace with a * again before in our pgduckdb_get_querydef function. This might sound pretty simple but there's tricks that need to be done for subqueries that return both a duckdb.row type and some other columns.

TODO:

More comments about edge cases (see reminders for myself)
Add some more tests with more complex queries
Support all supported PG versions (not just PG17), hopefully by moving some of the code to the src/pgduckdb_ruleutils.cpp which is shared across versions. Afaik there's no technical problems to implement this feature for all PG versions that we support. The reason this is not done yet is that I'd like to wait with updating all the PG version specific ruleutils files until after the first round of review.
Support for subscripts, aka array/json indexing
BUG: Fix CREATE TABLE AS (broken by a recent commit)

src/pgduckdb_options.cpp

src/vendor/pg_ruleutils_17.c

src/pgduckdb_options.cpp

Y--

LGTM, very nice piece of work! Well tested from what I can tell, and obviously works as advertised :-)
Only minor comments/suggestions.

src/pgduckdb_ddl.cpp

Y-- · 2025-01-22T08:52:04Z

src/pgduckdb_hooks.cpp

+	if (edata->elevel == ERROR && edata->sqlerrcode == ERRCODE_UNDEFINED_COLUMN && pgduckdb::IsExtensionRegistered()) {
+		/*
+		 * XXX: It would be nice if we could check if the query contained any
+		 * of the functions, but we don't have access to the query string here.


I'm surprised we don't have access to it - debug_query_string is not reliable? Doesn't the PG log the statement when an error occurs? We should have the query then no? (not blocking of course, I'm happy to dig into it later)

hmm, yeah that should probably work indeed. Doesn't seem the most important thing though, so I'll leave as is for now.

src/pgduckdb_options.cpp

Y-- · 2025-01-22T09:47:17Z

src/pgduckdb_options.cpp

+}
+
+DECLARE_PG_FUNCTION(duckdb_unresolved_type_operator) {
+	elog(ERROR, "Unresolved duckdb types cannot be used by the Postgres executor");


Do we have a way to identify which operator caused it here?

Not easily I think, and it also doesn't really matter. This error should never really be shown in practice. It means we don't detect correctly that the query requires duckdb.

Y-- · 2025-01-22T09:58:22Z

src/vendor/pg_ruleutils_17.c

Not specifically related to this PR (nor this file, but all the pg_ruleutils_*: I wonder if we could reduce code duplication for all these versions? It feels a bit clunky (and error-prone) to have to repeat the same code for each PG version we support :-)
Maybe we should have an open issue an gather options there?

Agreed that it would be very nice not to have this duplication. That's why I tried to put as much code as possible in pgduckdb_ruleutils.cpp, because that is shared across versions.

I think there's only really one other option: Have a shared pg_ruleutils.c file with a bunch of #if PG_VERSION_NUM > 150000 branches to handle differences in the AST structure between PG versions. I'm not convinced that's any less error prone though.

I wonder if we could send Copilot on it and see how bad that version would look like (I don't think we should spend time right now on that in any case!)

Yeah I'd be curious. I expect it's a lot, but maybe it's not as bad as I think.

Y-- · 2025-01-22T09:59:18Z

test/regression/expected/duckdb_only_functions.out

@@ -2,26 +2,26 @@
 -- duckdb.force_execution is turned off
 SET duckdb.force_execution = false;
 \set pwd `pwd`
-select * from read_parquet(:'pwd' || '/data/unsigned_types.parquet') as (usmallint int);
+select r['usmallint'] from read_parquet(:'pwd' || '/data/unsigned_types.parquet') r;


Should we keep a few tests with the old format? (or maybe you did and they don't show up in diff?)

The old syntax does not work anymore: AS (...) is only allowed for functions that return record and they now return duckdb.row.

The only way (I can think of) to support the old syntax would be to introduce additional functions, like read_parquet_record & read_csv_record. I don't think that's worth the hassle for now. But we can reconsider if we get valid user feedback that the new syntax is worse in some cases.

I think that's very fair. My only comment then would be to try our best to update the doc (and various blog posts we know of) so we don't get a ton of bug reports on this :-)

Yeah, good point. I indeed realized I didn't update the docs yet. I'll spend some time on that, but I think that can wait for a follow up PR (as long as we do it before the release).

One thing that I did change now is add a hint to the error message when people use the old syntax.

test/regression/expected/temporary_tables.out

This reverts commit a799f7e.

…db_ruleutils.cpp

JelteF force-pushed the redesigned-read-parquet branch 6 times, most recently from 59d8a57 to 1992351 Compare January 14, 2025 15:45