You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large table stored on Minio (S3 compatible product).
The table structure is a bit complex and have few nested fields with maximum depth around 5.
I use the hive connector and the table is stored in parquet format.
Presto version 336, 3 workers 1 coordinator running in docker swarm.
The queries are:
SELELCT string_field1 FROM table1
WHERE year='2020' AND month='6'
AND struct_field1.struct_field2.array_field1[1].struct_field3.string_field2 LIKE '%GITHUB%'
This query perform with around 50K rows/sec, and data flow of around 200 MB/sec.
SELELCT string_field1 FROM table1
WHERE year='2020' AND month='6'
AND element_at(struct_field1.struct_field2.array_field1, 1).struct_field3.string_field2 LIKE '%GITHUB%'
This query perform with around 100K rows/sec, and data flow of around 200 MB/sec.
SELELCT string_field1 FROM table1
WHERE year='2020' AND month='6'
AND any_match(struct_field1.struct_field2.array_field1, e -> e.struct_field3.string_field2 LIKE '%GITHUB%')
This query perform with around 1M rows/sec, and data flow of around 200 MB/sec.
The results for the first and second queries are the same as expected. And as expected the results of the first and second are a subset of the results of the third query.
Table1 is partitioned by year, month, day, hour almost evenly.
The total number of rows in month='6' is around 60 million.
Accordingly to the data flow and the rows scan rate, I do see that the data queried for the first is larger than the data queried for the second and the data queried for the second is larger than the data for the third.
The result kind of goes against my intuition.
Can someone explain if this is a bug or lack of implementation?
(I reviewed the release notes for versions 337, 338 and 339 and it seems like no fix was connected to this issue)
This discussion was converted from issue #4653 on September 03, 2024 19:41.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a large table stored on Minio (S3 compatible product).
The table structure is a bit complex and have few nested fields with maximum depth around 5.
I use the hive connector and the table is stored in parquet format.
Presto version 336, 3 workers 1 coordinator running in docker swarm.
The queries are:
This query perform with around 50K rows/sec, and data flow of around 200 MB/sec.
This query perform with around 100K rows/sec, and data flow of around 200 MB/sec.
This query perform with around 1M rows/sec, and data flow of around 200 MB/sec.
The results for the first and second queries are the same as expected. And as expected the results of the first and second are a subset of the results of the third query.
Table1 is partitioned by year, month, day, hour almost evenly.
The total number of rows in month='6' is around 60 million.
Accordingly to the data flow and the rows scan rate, I do see that the data queried for the first is larger than the data queried for the second and the data queried for the second is larger than the data for the third.
The result kind of goes against my intuition.
Can someone explain if this is a bug or lack of implementation?
(I reviewed the release notes for versions 337, 338 and 339 and it seems like no fix was connected to this issue)
Beta Was this translation helpful? Give feedback.
All reactions