Memory usage during Parquet scan #5901
-
Hi DataFusion community! I'm relatively new to Arrow, DataFusion, and Rust and had a question about how parquet records are streamed. I have an Arrow Flight service in front of DataFusion. When scanning a 50GB table of ~1GB-256MB physical parquet files. I see memory around ~1-2GB. I get the sense as the arrow batches are yielded to the outbound flight stream the used memory is not dropped until a given file is complete? Is this accurate? I'm asking as we look to support many requests each holding onto a whole file requires a lot of memory. When we reduced the parquet file size to ~32MB each we could support a lot more requests, but was wondering if there was something datafusion related we could change in configuration or feature request to keep memory usage low when streaming larger files. We are using version 19. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
You could possibly try reducing the number of rows per row group |
Beta Was this translation helpful? Give feedback.
You could possibly try reducing the number of rows per row group