Description
Describe the bug
I was using wr.s3.download
on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads
parameter. It was measured using this memory profiler.
Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg
gives a little different memory estimation:
$ dmesg | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0
It turns out that wr.s3.download
by default uses botocore
's s3.get_object
and fits whole response into a memory:
aws-sdk-pandas/awswrangler/s3/_fs.py
Lines 65 to 75 in 7e83b89
Is it possible to chunkify reading of botocore response in awswrangler
to be more memory efficient?
For instance, using the following snippet I got my file without any issues on the same machine:
raw_stream = s3.get_object(**kwargs)["Body"]
with open("test_botocore_iter_chunks.gz", 'wb') as f:
for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
f.write(chunk)
I tried to use wr.config.s3_block_size
parameter expecting to chunkify the response but it does not help. After setting the s3_block_size
up to be less than the file size you fall into this if
condition:
aws-sdk-pandas/awswrangler/s3/_fs.py
Line 326 in 7e83b89
which just fits the whole response into a memory
How to Reproduce
use memory profiler on
wr.s3.download(path, local_file)
Expected behavior
Please let me know if it's already possible to read chunkified response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.6.9 -- this is old, but I can double check on newer versions
AWS SDK for pandas version
2.14.0
Additional context
No response