How to improve Parquet reading performances? #7737

devoxi · 2023-10-04T08:50:46Z

devoxi
Oct 4, 2023

Hello!

We've been experimenting in the last couple of days with Datafusion (31.0) and we've been comparing performances with our existing ClickHouse setup. To do so, we have exported a 5GB Parquet dataset, and building some UDFs we managed to replicate some of our queries.

In the end we are running a single quite simple query over the same Parquet dataset on the same mac with both Datafusion and ClickHouse. ClickHouse is always answering in about 700ms while Datafusion in 1.2s.

I've tried multiple settings, verified it was not our UDF causing it, checked there was no cache on ClickHouse, and I couldn't make it any faster with Datafusion. According to the EXPLAIN ANALYZE the poor performances are coming from the Parquet phase.

I have to confess that we are beginners in Rust and we might have missed something, hence this message.
Here is the EXPLAIN ANALYZE of our query, if it can help:

Plan with Metrics | ProjectionExec: expr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)@1 as _ccount_1], metrics=[output_rows=1, elapsed_compute=584ns]
     AggregateExec: mode=Final, gby=[], aggr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)], metrics=[output_rows=1, elapsed_compute=48.459µs]
        CoalescePartitionsExec, metrics=[output_rows=20, elapsed_compute=5.708µs]
            AggregateExec: mode=Partial, gby=[], aggr=[SUM(my_table.sign)@0 as tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * my_table.sign)], metrics=[output_rows=20, elapsed_compute=4.769958246s]
                ProjectionExec: expr=[sign@0 as sign, nested_field.array_column@3 as nested_field.array_column], metrics=[output_rows=6185527, elapsed_compute=396.238µs]
                    CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=6185527, elapsed_compute=168.795215ms]
                        FilterExec: int_column@1 = 2 AND (CAST(string_column@2 AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), metrics=[output_rows=6185527, elapsed_compute=652.683132ms]
                            ParquetExec: file_groups={20 groups: [[path/to/parquet/my_dataset.parquet:0..262363404], [path/to/parquet/my_dataset.parquet:262363404..524726808], [path/to/parquet/my_dataset.parquet:524726808..787090212], [path/to/parquet/my_dataset.parquet:787090212..1049453616], [path/to/parquet/my_dataset.parquet:1049453616..1311817020], ...]}, projection=[sign, int_column, string_column, nested_field.array_column], predicate=int_column@5 = 2 AND (CAST(string_column@79 AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), pruning_predicate=int_column_min@0 <= 2 AND 2 <= int_column_max@1, metrics=[output_rows=6185527, elapsed_compute=20ns, page_index_rows_filtered=0, predicate_evaluation_errors=0, file_scan_errors=0, row_groups_pruned=13, num_predicate_creation_errors=0, file_open_errors=0, pushdown_rows_filtered=2880077, bytes_scanned=311061152, time_elapsed_processing=5.335386071s, pushdown_eval_time=781.00424ms, time_elapsed_scanning_until_data=1.472284126s, time_elapsed_opening=3.032638581s, time_elapsed_scanning_total=18.9360584s, page_index_eval_time=1.318µs]

We also noticed in some other queries that when having more Parquet files performances were much worse than in ClickHouse compared to a single Parquet file.

So is there anything we might have missed, that is general knowledge and could that lead to those performances?
Thanks for your help!

tustvold · 2023-10-04T11:35:47Z

tustvold
Oct 4, 2023
Collaborator

Some ideas:

Run in release mode
Enable SIMD instructions, see performance tips here
Enable parquet filter pushdown - https://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html#structfield.pushdown_filters
Rewrite the regex filter to be a cheaper InList or disjunction of like expressions, to avoid expensive regex evaluation

9 replies

tustvold Oct 4, 2023
Collaborator

That's interesting, the reason that might be is because it will read the footer once for every file group, which means it is doing that 20 times. Normally that is outweighed by the additional parallelism, but it is possible that the parquet file has been written in such a way that this isn't possible. Arrow-cpp had a bug for a very long time where it produced massive row groups, and DuckDB has an interesting approach to the spec 😅

Couple of questions:

What did you use to write the parquet file
How many columns does the parquet file have
How many row groups does the file contain - can be found with https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-layout.rs

Couple of things to try:

You could disable https://docs.rs/datafusion/latest/datafusion/config/struct.OptimizerOptions.html#structfield.repartition_file_scans
Reduce the number of https://docs.rs/datafusion/latest/datafusion/config/struct.ExecutionOptions.html#structfield.target_partitions
Rewrite the file with a smaller https://docs.rs/datafusion/latest/datafusion/common/config/struct.ParquetOptions.html#structfield.max_row_group_size

devoxi Oct 4, 2023
Author

In order to write the parquet file I've used ClickHouse (v23.3). (And by the way to perform the comparison with Datafusion I used the v23.9 of ClickHouse)
279 row groups according to parquet-layout. (and the whole parquet file is about 5GB and contains 250 columns, and 5M rows)

I'll try what you suggested and come back later with the result :) Thanks for your help!

devoxi Oct 4, 2023
Author

So, I did rewrite my Parquet file with datafusion-cli with this command:
COPY 'my_file_v1.parquet' TO 'my_file_v2.parquet' (format parquet, single_file_output true, compression snappy);. The number of row groups went down to 10. It drastically improved the query time with only the SUM(sign), and it's now even better than ClickHouse.

However I couldn't test properly with my full query as the rewritten parquet file has an issue with my nested column leading to this error:
Error: ArrowError(ExternalError(ArrowError("Parquet argument error: Parquet error: Invalid offset in sparse column chunk data: 145661441")))
I now understand the problems that can come from various Parquet implementations and it's definitely something we'll take into account now 😅

I also managed to test my query including the regex, and on this one ClickHouse was still faster.
I also saw that you did a draft PR to mitigate the issue when there are a lot of row groups, so I tried it with your branch, and while it did change the flamegraph shape, the total execution time didn't really decrease. So there might be other issues, but probably not linked to Parquet.

Anyway, thanks a lot for your help, it was really appreciated!

Ted-Jiang Oct 7, 2023
Collaborator

That's interesting, the reason that might be is because it will read the footer once for every file group, which means it is doing that 20 times.

@tustvold could you please show me where the code is 🤣 , took me a long time to find it, could we read the footer in the file level and pass the info offset to file group level 🤔

tustvold Oct 7, 2023
Collaborator

It's a consequence of the repartition file scans pass, #7739 is one option to rectify this, but I'm not sure it is a good idea

alamb · 2023-10-04T14:04:59Z

alamb
Oct 4, 2023
Collaborator

I also think @Ted-Jiang added code (not yet released) in #7570 that caches parquet data statistics. Maybe this could help the usecase described in this PR as well. The usecase was a little different (reusing the statistics within a session, rather than within a query)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve Parquet reading performances? #7737

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to improve Parquet reading performances? #7737

devoxi Oct 4, 2023

Replies: 2 comments · 9 replies

tustvold Oct 4, 2023 Collaborator

tustvold Oct 4, 2023 Collaborator

devoxi Oct 4, 2023 Author

devoxi Oct 4, 2023 Author

Ted-Jiang Oct 7, 2023 Collaborator

tustvold Oct 7, 2023 Collaborator

alamb Oct 4, 2023 Collaborator

devoxi
Oct 4, 2023

Replies: 2 comments 9 replies

tustvold
Oct 4, 2023
Collaborator

tustvold Oct 4, 2023
Collaborator

devoxi Oct 4, 2023
Author

devoxi Oct 4, 2023
Author

Ted-Jiang Oct 7, 2023
Collaborator

tustvold Oct 7, 2023
Collaborator

alamb
Oct 4, 2023
Collaborator