Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of unwanted very large downloads #1931

Open
MattBlissett opened this issue Apr 18, 2024 · 1 comment
Open

Reduce number of unwanted very large downloads #1931

MattBlissett opened this issue Apr 18, 2024 · 1 comment

Comments

@MattBlissett
Copy link
Member

MattBlissett commented Apr 18, 2024

Many of the very largest downloads (>100GB) are requested, but then never downloaded by the user. This is a significant waste of our resources and the user's time, and is probably frustrating for the user when they realize they cannot use GBIF data as they had hoped.

We have popup banners for large and most/all occurrence downloads, but since those were implemented cookie banners have spread all over the web, so users are even less likely to read them. I suggest instead changing the download page itself, providing different options and in some cases removing the existing DWCA and Simple options.

Ideas (more ideas was added later):

  • If the download is extremely large, e.g. more than half the total dataset, direct the user to the existing monthly downloads and the cloud-hosted monthly snapshots. Along with an encouragement to register a derived dataset later
  • For 500 million-row or larger filters, remove or disable the DWCA and Simple buttons, and instead provide the predicate for creating the download through the API.
    • We already have a UI form for entering predicates and downloading through the UI, so this suggestion might not help much as it is quite easy and require no technical skills
  • In either case advise the user to add additional filters, perhaps directly. "You might add a filter for a taxon, location or date."
  • Always include information about creating a derived datasets if you do post filtering
  • Avoid large blocks of text which are now common in web popups (cookies etc), and format with some icons for Excel, R etc.
  • Do not mint DOI's unless the file is downloaded.
  • Force users to select between TEST_DOWNLOAD | STORE_COPY_FOR_CITATION
  • Allow users to get a random sample within a filter. The idea would be that perhaps they are mostly testing that they can work with the data?
  • Stop serving live data, always use a snapshot that is at most 1 month old. And then all downloads are simply a filter + a snapshot version reference (publishers can test data in UAT)
@CecSve
Copy link

CecSve commented May 6, 2024

I think the following options are a good suggestions:

Force users to select between TEST_DOWNLOAD | STORE_COPY_FOR_CITATION

It should, however, be clarified what the difference is somehow.

Do not mint DOI's unless the file is downloaded. And delete file if not downloaded within 6 months.

Although it is a good idea, I am not sure how it could be coupled with the above suggestion.
How about we define a threshold and make a policy that we do not store data above this threshold for more than XX days/months unless the user actively requests us to?

If the download is extremely large, e.g. more than half the total dataset, direct the user to the existing monthly downloads and the cloud-hosted monthly snapshots. Along with an encouragement to register a derived dataset later

In either case advise the user to add additional filters, perhaps directly. "You might add a filter for a taxon, location or date."

Maybe we could again set a threshold of min. 3-5 filters applied or else this message appears?

Always include information about creating a derived datasets if you do post filtering

Yes, this would be helpful from a helpdesk perspective - it is one of the more common questions that pop up and it would be great if more users could be made aware of the option.

@ahahn-gbif and @timrobertson100 what do you think about the suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants