Automatically avoid having too many files in a directory when splitting occurrences #284

cjgrady · 2022-05-11T20:36:22Z

Is your feature request related to a problem? Please describe.
Aimee suggests that it is too complicated for a user to know how and why they should configure the split_occurrences tool to group their occurrence records so that there are not too many files in a directory.

Describe the solution you'd like
The tool should know when a threshold (configurable or not) is passed for the number of files generated in a directory (assume one per species) and if that threshold is reached, it should chunk the key field so that there is a hierarchy created and that will reduce the number of files in a single directory. Ideally, these directories created make sense to the user.

Describe alternatives you've considered

Leave things as they are and elaborate in the documentation how they should be used for larger datasets
Always chunk the key field(s)

cjgrady · 2022-05-11T20:43:06Z

For species binomials, I think a clean structure would be something like {First letter of genus}/{Genus}/{Species name}.csv, or maybe the top level is the first two letters of the genus.
Acer rubrum -> A/Acer/Acer rubrum.csv or Ac/Acer/Acer rubrum.csv

cjgrady · 2022-05-11T20:45:28Z

For something that doesn't have any stand-alone information, like a GBIF accepted taxon id, I don't see a better option than just chunking it by some number of characters.
123456789 -> 123/456/123456789.csv

cjgrady · 2022-05-11T21:00:17Z

It would be easiest to default to chunking by some number of characters when needed but is that acceptable when there is some human-discernible value in the field? Is there a solution that works no matter what the data is and retains helpful information?

Maybe we could do the following (moving to the next step as needed):

Don't chunk (retains the most information) -> Acer rubrum.csv or 1234567890.csv
Chunk by splitting on space (retains genus name so still clear, would not help numeric values) -> Acer/Acer rubrum.csv or 1234567890/1234567890.csv
Chunk by space first, then by first letter (picks up fields without spaces) -> A/Acer/Acer rubrum.csv or 1/1234567890.csv
Chunk by second (third, fourth, etc) character -> A/Ac/Acer/Acer rubrum.csv or 1/12/1234567890.csv

cjgrady self-assigned this May 11, 2022

cjgrady added enhancement New feature or request user Tasks related to users and user experience labels May 11, 2022

cjgrady added this to the Version-3.2 milestone May 11, 2022

cjgrady removed this from the Version-3.2 milestone May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically avoid having too many files in a directory when splitting occurrences #284

Automatically avoid having too many files in a directory when splitting occurrences #284

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022

Automatically avoid having too many files in a directory when splitting occurrences #284

Automatically avoid having too many files in a directory when splitting occurrences #284

Comments

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022

cjgrady commented May 11, 2022