Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder) #6728

collimarco · 2023-06-20T11:20:56Z

collimarco
Jun 20, 2023

I am getting started with Arrow Datafusion and looking at the examples:
https://arrow.apache.org/datafusion/user-guide/example-usage.html

I don't see any way to execute a SQL query on multiple files at the same time.

Is that possible?

Let's say that you have thousands of Parquet files already stored in a folder.

The schema is similar, but it is not identical for all the files. For example:

some may have some additional columns or less columns
rarely a column may be of different type (like a status column may be an integer but sometimes a string).

Is it possible to use Datafusion to query all the files in a directory?

Or it possible to give Datafusion a long list of files to query dynamically?

Ideally each query uses a different set of files (they are grouped in partitions), so it would be better to be able to execute the queries directly on a list of files, without having to perform too many intermediate steps.

Is this possible with Datafusion?

alamb · 2023-06-23T17:20:00Z

alamb
Jun 23, 2023
Collaborator

Yes, you can do this.

Here is an example via datafusion-cli:

# /data/99 has a bunch of parquet files with "compatible" schema:
$ ls /data/99 | head
03f0ada5-22ea-4121-99ac-77a61c74479c.parquet
041e28e6-6373-4e8b-873d-c5d6f612edc4.parquet
050cc247-686b-4167-8bdb-f3e42f1ba088.parquet
08243ac7-db62-4d19-83ac-b829d36568b6.parquet
0b309152-36ca-4d90-bdb1-edcf628dafb9.parquet
0bf87579-d9e3-457f-8048-0162394ba8b3.parquet
0ce16343-6850-4f03-800d-75f9810af87b.parquet
10ba4e59-0651-42ca-9d44-ee7939e1f36a.parquet
1125d145-e363-4dd0-8561-e10027cbdb76.parquet
117447dd-4af0-4866-b443-c3ce64d3dfdb.parquet

You can query them via datafusion-cli like this:

❯ select * from '/data/99' limit 10;
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| free       | host                | in | out | time                | total       | used        | used_percent    |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:05:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:05:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:05:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:05:50 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:00 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:10 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T22:06:50 | 12884901888 | 11499995136 | 89.251708984375 |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
10 rows in set. Query took 0.035 seconds.

You can also use the explicit CREATE EXTERNAL table syntax

❯ create external table t stored as parquet location '/Users/alamb/.influxdb_iox/object_store/1/6/99';
0 rows in set. Query took 0.009 seconds.
❯ select * from t limit 10;
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| free      | host                | in | out | time                | total       | used        | used_percent      |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| 624427008 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:45:20 | 10737418240 | 10112991232 | 94.1845703125     |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:45:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:45:40 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:45:50 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:00 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:10 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:20 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 861405184 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:40 | 11811160064 | 10949754880 | 92.70685369318183 |
| 861405184 | MacBook-Pro-8.local | 0  | 0   | 2023-06-09T16:46:50 | 11811160064 | 10949754880 | 92.70685369318183 |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+

You can also do this explicitly using ListingTable

4 replies

collimarco Jun 23, 2023
Author

Thanks! Do all the files need to have the exact same schema? Because on the docs I read:

You can also query directories of files with compatible schemas

Is there an official definition of "compatible"?

For example, what if some files have some more/less columns or a column is of different type (e.g. a status that was a int and then in new files is a string)?

collimarco Jun 23, 2023
Author

When I add a Parquet file to the directory that has a slightly different schema I get this error:

Error during planning: table 'datafusion.public.parquet' not found

Is it possible to have more information about what exactly prevents the query execution? (i.e. what differences in the schema are compatible and what are not)

collimarco Jun 23, 2023
Author

I made some testing and it seems that adding/removing columns is not an issue.

However having a column with same name and different type (e.g. integer in some files and string in other files) causes the error reported above. Is there any way to circumvent that error (like automatic casting to string in case of different types)?

alamb Jun 24, 2023
Collaborator

I made some testing and it seems that adding/removing columns is not an issue.

Yes that is correct

Is it possible to have more information about what exactly prevents the query execution? (i.e. what differences in the schema are compatible and what are not)

I think you have identified the issue -- which is that the same name is used with a different type.

However having a column with same name and different type (e.g. integer in some files and string in other files) causes the error reported above. Is there any way to circumvent that error (like automatic casting to string in case of different types)?

I am not sure -- the underlying code that is doing it is https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html

You may be able to explciitly specify the desired schema using file_schema on https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTableConfig.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder) #6728

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder) #6728

collimarco Jun 20, 2023

Replies: 1 comment · 4 replies

alamb Jun 23, 2023 Collaborator

collimarco Jun 23, 2023 Author

collimarco Jun 23, 2023 Author

collimarco Jun 23, 2023 Author

alamb Jun 24, 2023 Collaborator

collimarco
Jun 20, 2023

Replies: 1 comment 4 replies

alamb
Jun 23, 2023
Collaborator

collimarco Jun 23, 2023
Author

collimarco Jun 23, 2023
Author

collimarco Jun 23, 2023
Author

alamb Jun 24, 2023
Collaborator