Fail to read_parquet using chunked batches on a large parquet file from s3 #1435

erandvora · 2022-07-05T14:54:05Z

erandvora
Jul 5, 2022

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.
Hi,

I'm trying to use awswrangler(version 2.2.0) read_parquet function with chunked - but without any success.
I have a lambda on aws that is going to an s3 location to get a parquet file(with 1,000,00 records) and process the records incrementally.
The aws lambda is defined with 1024mb memory and a timeout of 10 minutes.
The lambda runs and gets to the timeout, but it doesn't seem to get into the records.
I've tried printing an index just to see if the loop is entered, but no prints from inside the loop.

dfs = wr.s3.read_parquet(curr_file_path, chunked=10000)
i = 0
for dataframe in dfs:
    print(i)
    i+=1

I've also tried to use the next() function, reaching timeout and again with no prints from inside the loop:

dfs = wr.s3.read_parquet(curr_file_path, chunked=10000)
i = 0
while i < 5:
    df = next(dfs)
    print(i)
    i+=1

Increasing the memory and timeout is less relevent cause eventualy I'm expecting bigger parquet files - so I'm trying to understand what is wrong with this basic code for a small file like this?

Any help/suggesstion would be appriciated.
Thanks
Eran

kukushking · 2022-07-07T11:13:23Z

kukushking
Jul 7, 2022
Maintainer

The code is fine so looks like something else is going on. Did you try running it without chunked parameter?

0 replies

erandvora · 2022-07-09T17:49:23Z

erandvora
Jul 9, 2022
Author

I've tried iterating without the chunked parameter - it doesn't seem to work at all.
Trying awswrangler version 2.9 seems to be doing the job(1M records), but I can't upgrade it for now in the project I'm in.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail to read_parquet using chunked batches on a large parquet file from s3 #1435

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fail to read_parquet using chunked batches on a large parquet file from s3 #1435

Uh oh!

erandvora Jul 5, 2022

Replies: 2 comments

Uh oh!

kukushking Jul 7, 2022 Maintainer

Uh oh!

erandvora Jul 9, 2022 Author

erandvora
Jul 5, 2022

kukushking
Jul 7, 2022
Maintainer

erandvora
Jul 9, 2022
Author