Skip to content
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.

Automatically avoid having too many files in a directory when splitting occurrences #284

Open
cjgrady opened this issue May 11, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request user Tasks related to users and user experience

Comments

@cjgrady
Copy link
Contributor

cjgrady commented May 11, 2022

Is your feature request related to a problem? Please describe.
Aimee suggests that it is too complicated for a user to know how and why they should configure the split_occurrences tool to group their occurrence records so that there are not too many files in a directory.

Describe the solution you'd like
The tool should know when a threshold (configurable or not) is passed for the number of files generated in a directory (assume one per species) and if that threshold is reached, it should chunk the key field so that there is a hierarchy created and that will reduce the number of files in a single directory. Ideally, these directories created make sense to the user.

Describe alternatives you've considered

  1. Leave things as they are and elaborate in the documentation how they should be used for larger datasets
  2. Always chunk the key field(s)
@cjgrady cjgrady self-assigned this May 11, 2022
@cjgrady cjgrady added enhancement New feature or request user Tasks related to users and user experience labels May 11, 2022
@cjgrady cjgrady added this to the Version-3.2 milestone May 11, 2022
@cjgrady
Copy link
Contributor Author

cjgrady commented May 11, 2022

For species binomials, I think a clean structure would be something like {First letter of genus}/{Genus}/{Species name}.csv, or maybe the top level is the first two letters of the genus.
Acer rubrum -> A/Acer/Acer rubrum.csv or Ac/Acer/Acer rubrum.csv

@cjgrady
Copy link
Contributor Author

cjgrady commented May 11, 2022

For something that doesn't have any stand-alone information, like a GBIF accepted taxon id, I don't see a better option than just chunking it by some number of characters.
123456789 -> 123/456/123456789.csv

@cjgrady
Copy link
Contributor Author

cjgrady commented May 11, 2022

It would be easiest to default to chunking by some number of characters when needed but is that acceptable when there is some human-discernible value in the field? Is there a solution that works no matter what the data is and retains helpful information?

Maybe we could do the following (moving to the next step as needed):

  1. Don't chunk (retains the most information) -> Acer rubrum.csv or 1234567890.csv
  2. Chunk by splitting on space (retains genus name so still clear, would not help numeric values) -> Acer/Acer rubrum.csv or 1234567890/1234567890.csv
  3. Chunk by space first, then by first letter (picks up fields without spaces) -> A/Acer/Acer rubrum.csv or 1/1234567890.csv
  4. Chunk by second (third, fourth, etc) character -> A/Ac/Acer/Acer rubrum.csv or 1/12/1234567890.csv

@cjgrady cjgrady removed this from the Version-3.2 milestone May 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request user Tasks related to users and user experience
Projects
None yet
Development

No branches or pull requests

1 participant