Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CSV/XLSX file reader to core #656

Open
davidorme opened this issue Jan 7, 2025 · 0 comments · May be fixed by #664
Open

Add a CSV/XLSX file reader to core #656

davidorme opened this issue Jan 7, 2025 · 0 comments · May be fixed by #664
Assignees

Comments

@davidorme
Copy link
Collaborator

Both the plants and animals models require the users to provide cohort data. For plants, this is providing tuples of data:

(cell id, plant functional type, number of individuals, individual size)

There can be multiple entries per cell id and different numbers of cohorts per cell. The easiest and sanest format for this data is a simple data frame of those tuples and the natural format for creating and maintaining that data is a CSV or XSLX file. Forcing users to convert this into NetCDF for input is not sensible.

So, we need to:

  • Add a CSV/XLSX loader.
  • This should use pandas as that is already a requirement of xarray and is designed explicitly to handle data frames, rather than using the standard library csv or any of the numpy structures.
  • I think we will need to explicitly add openxlsx to [tool.poetry.dependencies] to support reading XLSX format.
  • Test that it works!

It should go in virtual_ecosystem.core.readers and I think the signature will look like:

@register_file_format_loader(file_types=(".csv", ".xlsx"))
def load_from_dataframe(file: Path, var_name: str) -> DataArray:
    """Loads a DataArray from a data frame format."""

The format registry should then automatically switch to using this loader for CSV and XLSX files.

There is some ugliness here in that the file is going to be opened multiple times to load each variable as we don't have persistent file handles, but the same is currently true for NetCDF. A better way to do this in future would be to open each file within the data configuration once to access a tuple of variables that are claimed to live in that file, rather than independently opening the file specified for each variable.

@davidorme davidorme assigned davidorme and sallymatson and unassigned davidorme Jan 7, 2025
@sallymatson sallymatson linked a pull request Jan 9, 2025 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants