Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Csvxlsx file reader #664

Merged
merged 15 commits into from
Jan 17, 2025
Merged

Csvxlsx file reader #664

merged 15 commits into from
Jan 17, 2025

Conversation

sallymatson
Copy link
Collaborator

@sallymatson sallymatson commented Jan 9, 2025

Description

Added a dataframe reader that uses pandas to handle csv and excel input. Wrote tests for the functionality of both. Added openpyxl to poetry as a dependency for reading excel files.

Fixes # (issue)

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works
  • Relevant documentation reviewed and updated

@sallymatson sallymatson linked an issue Jan 9, 2025 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 90.47619% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.60%. Comparing base (047fe6e) to head (2bf1441).

Files with missing lines Patch % Lines
virtual_ecosystem/core/readers.py 90.47% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #664      +/-   ##
===========================================
- Coverage    94.64%   94.60%   -0.04%     
===========================================
  Files           73       73              
  Lines         4759     4800      +41     
===========================================
+ Hits          4504     4541      +37     
- Misses         255      259       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sallymatson sallymatson requested a review from davidorme January 9, 2025 15:22
@sallymatson sallymatson marked this pull request as ready for review January 9, 2025 15:22
Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good. I've requested a minor change on the failure modes.

One thing - I'm not 100% sure about how the try stack is a mix of CSV failure modes and Excel failure modes without it being clear which one is which. That seems harder to maintain. We could split them, but then you'd duplicate the FileNotFoundError, unless the code specifically tests for the file using Path().exists() first - but that's bad practice because the file could in theory vanish between the two calls.

Should we just have separate CSV and Excel reader functions? It's more verbose but feels cleaner. Thoughts?

Comment on lines +178 to +181
except Exception as err:
to_raise = Exception(f"Unidentified exception opening {file}: {err}")
LOGGER.critical(to_raise)
raise to_raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth explicitly adding BadZipError - it's by far the most likely thing to happen and we can effectively report it as "Does not appear to be an Excel file". The general Exception can then handle anything else. I'd add a comment on that exception to say that openpyxl has a whole bunch of assorted failure modes for corrupt/non-excel Zip files and we're not going to try and track down an exhaustive list.

You can then add an actual zip file of (e.g.) a simple text file to test that failure mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a corrupted excel file which triggers that error - I think that should cover that failure case?

@sallymatson
Copy link
Collaborator Author

This all looks good. I've requested a minor change on the failure modes.

One thing - I'm not 100% sure about how the try stack is a mix of CSV failure modes and Excel failure modes without it being clear which one is which. That seems harder to maintain. We could split them, but then you'd duplicate the FileNotFoundError, unless the code specifically tests for the file using Path().exists() first - but that's bad practice because the file could in theory vanish between the two calls.

Should we just have separate CSV and Excel reader functions? It's more verbose but feels cleaner. Thoughts?

Yeah, it already pained me a bit to have the duplication from load_from_netcdf because they are so similar (e.g. syntax for checking if the variable is in the dataset), but you're right that it would all be more explicit if they are separate. Especially since they read_csv and read_excel have relatively different behaviour. I can split them up :)

@sallymatson
Copy link
Collaborator Author

@davidorme I pushed some changes to split up the tests and now the builds are failing. It seems like it is a problem with the poetry version in ci.yml, but that doesn't make sense because I didn't change any files related to that. I have a feeling there is something silly that I've done, does it set of any immediate lightbulbs for you?

@davidorme
Copy link
Collaborator

@sallymatson No - that's not you. For reasons I don't understand, dependabot bumped the poetry-action version to 4.0.0 a few days ago (#652).

The only problem being 4.0.0 doesn't exist. I can't find any indication it ever did. No idea what happened.

Can you push a change to to ci.yml in this branch setting:

    - name: Install Poetry
      uses: abatilo/[email protected]
      with:
        poetry-version: 1.5.1

@davidorme
Copy link
Collaborator

Aha! It did exist briefly, and the maintainer pulled it. abatilo/actions-poetry#88

@davidorme
Copy link
Collaborator

@sallymatson Is this ready for re-review? If so, can you re-request review?

@sallymatson sallymatson requested a review from davidorme January 13, 2025 16:50
@sallymatson sallymatson requested review from davidorme and removed request for davidorme January 17, 2025 10:38
Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I guess I'd also add something to the load_excel that says at present the data is explicitly only loaded from the first sheet. We could fix that, but not now?

@sallymatson sallymatson merged commit cc2e55c into develop Jan 17, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a CSV/XLSX file reader to core
3 participants