Array index operator less performant than any_match and element_at on Parquet files #23255

ofekby · 2020-07-31T09:44:19Z

ofekby
Jul 31, 2020

I have a large table stored on Minio (S3 compatible product).
The table structure is a bit complex and have few nested fields with maximum depth around 5.
I use the hive connector and the table is stored in parquet format.
Presto version 336, 3 workers 1 coordinator running in docker swarm.

The queries are:

SELELCT string_field1 FROM table1 
WHERE year='2020' AND month='6'
AND struct_field1.struct_field2.array_field1[1].struct_field3.string_field2 LIKE '%GITHUB%'

This query perform with around 50K rows/sec, and data flow of around 200 MB/sec.

SELELCT string_field1 FROM table1 
WHERE year='2020' AND month='6'
AND element_at(struct_field1.struct_field2.array_field1, 1).struct_field3.string_field2 LIKE '%GITHUB%'

This query perform with around 100K rows/sec, and data flow of around 200 MB/sec.

SELELCT string_field1 FROM table1 
WHERE year='2020' AND month='6'
AND any_match(struct_field1.struct_field2.array_field1, e -> e.struct_field3.string_field2 LIKE '%GITHUB%')

This query perform with around 1M rows/sec, and data flow of around 200 MB/sec.

The results for the first and second queries are the same as expected. And as expected the results of the first and second are a subset of the results of the third query.
Table1 is partitioned by year, month, day, hour almost evenly.
The total number of rows in month='6' is around 60 million.

Accordingly to the data flow and the rows scan rate, I do see that the data queried for the first is larger than the data queried for the second and the data queried for the second is larger than the data for the third.

The result kind of goes against my intuition.
Can someone explain if this is a bug or lack of implementation?
(I reviewed the release notes for versions 337, 338 and 339 and it seems like no fix was connected to this issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array index operator less performant than any_match and element_at on Parquet files #23255

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Array index operator less performant than any_match and element_at on Parquet files #23255

ofekby Jul 31, 2020

Replies: 0 comments

ofekby
Jul 31, 2020