Description
Hi there,
I have a question regarding the chunked=true
option in awswrangler.s3.read_parquet().
I'm looking to load parquet files from S3 in the most memory efficient way possible. Our data has a differing number of rows per parquet file, but the same number of columns (11). I'd like it so the results from read_parquet()
is separated as a pandas DF on a per-parquet file basis. i.e. if based on the filter_query
it returns 10 parquet files, I will receive 10 pandas DFs in return. chunked=True
works if the number of rows is the same every time, but with our data there will be a different number of rows from time to time, so hard-coding the chunk size isn't feasible.
The documentation says:
If chunked=True, a new DataFrame will be returned for each file in your path/dataset.
However it also seems to be choosing an arbitrary size to chunk in (in my case it's chunks of 65536)
Is there something I'm missing here with regards to this? Thanks very much for your help!