Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder) #6728
collimarco
started this conversation in
General
Replies: 1 comment 4 replies
-
Yes, you can do this. Here is an example via # /data/99 has a bunch of parquet files with "compatible" schema:
$ ls /data/99 | head
03f0ada5-22ea-4121-99ac-77a61c74479c.parquet
041e28e6-6373-4e8b-873d-c5d6f612edc4.parquet
050cc247-686b-4167-8bdb-f3e42f1ba088.parquet
08243ac7-db62-4d19-83ac-b829d36568b6.parquet
0b309152-36ca-4d90-bdb1-edcf628dafb9.parquet
0bf87579-d9e3-457f-8048-0162394ba8b3.parquet
0ce16343-6850-4f03-800d-75f9810af87b.parquet
10ba4e59-0651-42ca-9d44-ee7939e1f36a.parquet
1125d145-e363-4dd0-8561-e10027cbdb76.parquet
117447dd-4af0-4866-b443-c3ce64d3dfdb.parquet You can query them via datafusion-cli like this: ❯ select * from '/data/99' limit 10;
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| free | host | in | out | time | total | used | used_percent |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:50 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:00 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:10 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:50 | 12884901888 | 11499995136 | 89.251708984375 |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
10 rows in set. Query took 0.035 seconds. You can also use the explicit ❯ create external table t stored as parquet location '/Users/alamb/.influxdb_iox/object_store/1/6/99';
0 rows in set. Query took 0.009 seconds.
❯ select * from t limit 10;
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| free | host | in | out | time | total | used | used_percent |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| 624427008 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:20 | 10737418240 | 10112991232 | 94.1845703125 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:40 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:50 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:00 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:10 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:20 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 861405184 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:40 | 11811160064 | 10949754880 | 92.70685369318183 |
| 861405184 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:50 | 11811160064 | 10949754880 | 92.70685369318183 |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+ You can also do this explicitly using |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am getting started with Arrow Datafusion and looking at the examples:
https://arrow.apache.org/datafusion/user-guide/example-usage.html
I don't see any way to execute a SQL query on multiple files at the same time.
Is that possible?
Let's say that you have thousands of Parquet files already stored in a folder.
The schema is similar, but it is not identical for all the files. For example:
Is it possible to use Datafusion to query all the files in a directory?
Or it possible to give Datafusion a long list of files to query dynamically?
Ideally each query uses a different set of files (they are grouped in partitions), so it would be better to be able to execute the queries directly on a list of files, without having to perform too many intermediate steps.
Is this possible with Datafusion?
Beta Was this translation helpful? Give feedback.
All reactions