Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using chunksize gives TypeError: 'TextFileReader' object does not support item assignment #106

Open
nigelcharman opened this issue Jul 2, 2024 · 4 comments

Comments

@nigelcharman
Copy link

We've been using python-dwca-reader with no problems loading about 13k occurrences. We now need to scale it up to load about 3.25m occurrences.

Changing the code from:

        core_df = dwca.pd_read('occurrence.txt', parse_dates=True)

to:

        for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
        ...

causes the error:

    ...
    for chunk in dwca.pd_read('occurrence.txt', parse_dates=True, chunksize=10):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/opt/asdf/installs/python/3.11.7/lib/python3.11/site-packages/dwca/read.py", line 209, in pd_read
    df[shorten_term(field['term'])] = field_default_value
    ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'TextFileReader' object does not support item assignment

Looking at gbif-alert, I see that you're using enumerate(dwca) rather than reading it in chunks, so I'll give that a try.

@nigelcharman
Copy link
Author

We're now using enumerate(dwca) so we're in no rush to have this corrected. I'll leave the issue open though in case other people come across it.

@niconoe
Copy link
Member

niconoe commented Jul 8, 2024

Note to self: it only happens with the combination of chunksize (and probably also the iterator parameter) and the DwCA using default values (because pd_read returns a TextFileReader rather than a regular data frame)

niconoe added a commit that referenced this issue Jul 8, 2024
… Pandas option

and archives with default values (issue #106).
@niconoe
Copy link
Member

niconoe commented Jul 8, 2024

After careful inspection I can't see any sane way to deal with this specific combination (pd_read returning TextFileReader objects because of its parameters and the DwC-A using default values).

I therefore decided to document the incompatibility + add a human readable exception for that situation. This is also tested.

@nigelcharman
Copy link
Author

Would it be worth adding a note to https://python-dwca-reader.readthedocs.io/en/latest/pandas_tutorial.html too? It was this documentation that led me to believe that this combination might be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants