Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

Open
abubelinha opened this issue Jan 8, 2021 · 5 comments
Labels

Comments

@abubelinha
Copy link

When looking to the number of records downloaded last year from a certain dataset, I found some odd results which I don't know how to interpretate.

As reported in issue #2385, it's difficult to link to a particular download event using offset and limit parameters, because records are being returned in an useless order (the oldest should be first, so offset could always point to the same record).
But I made this snapshot (content pasted below) so you can see the particular json record which reflects the problem:

The occurrence dataset 17a729a0-7ed1-11df-8c4a-0800200c9a66 had 30952 records (as reported by its ipt resource version 1.2) by June 25th 2020.
How is it possible that a particular download event in that date (0008123-200613084148143) reports 61904 records coming from this dataset?
The only explanation for that is those records being present twice within the same download event/file (61904 = 30952 x 2).
But how can this happen? Wouldn't it be a bug?

The same happens to dataset 10734a60-7ed1-11df-8c4a-0800200c9a66 (10409 records provided by IPT, but 20818 reported in the same download event).
So it looks like some other (all?) datasets included in that download file could also be duplicated (I have not checked this: it's too big for me to process, so I just looked at a couple of datasets).

{"offset":1076,"limit":1,"endOfRecords":false,"count":8657,
"results":[{"downloadKey":"0008123-200613084148143","datasetKey":"17a729a0-7ed1-11df-8c4a-0800200c9a66","datasetTitle":"SANT-Algae","datasetDOI":"10.15468/bxbkba","numberRecords":61904,
"download":{"key":"0008123-200613084148143","doi":"10.15468/dl.qe3j25","license":"http://creativecommons.org/licenses/by-nc/4.0/legalcode",
"request":{"predicate":{"type":"equals","key":"BASIS_OF_RECORD","value":"PRESERVED_SPECIMEN","matchCase":false},"sendNotification":true,"format":"SIMPLE_CSV"},"created":"2020-06-25T05:41:36.317+0000","modified":"2020-06-25T05:48:30.439+0000","eraseAfter":"2020-12-25T05:41:36.255+0000","status":"SUCCEEDED","downloadLink":"https://api.gbif.org/v1/occurrence/download/request/0008123-200613084148143.zip","size":18361726949,"totalRecords":175881988,"numberDatasets":11658}}]}

https://archive.is/m67ha

The point is: does this particular case have an explanation, other than it being a buggy download event?
I am interested in this because it affects the interpretation of annuals reports of gbif data downloads from institutions.

  • If it is a bug, how frequent could it be?
  • If it is not a bug, what does it mean?

Thanks a lot in advance

@jlegind
Copy link

jlegind commented Mar 2, 2021

SQL: SELECT dod.number_records FROM public.dataset_occurrence_download dod WHERE dod.download_key = '0008123-200613084148143' AND dod.dataset_key = '17a729a0-7ed1-11df-8c4a-0800200c9a66';

and

https://api.gbif.org/v1/occurrence/count?datasetKey=17a729a0-7ed1-11df-8c4a-0800200c9a66
confirms this.

@jlegind
Copy link

jlegind commented Mar 4, 2021

The issue around download records inflation must have been solved since I made the exact same download 03-03-2021 and the SANT-Algae contribution was the correct number.

@jlegind jlegind closed this as completed Aug 17, 2021
@abubelinha
Copy link
Author

abubelinha commented Oct 28, 2021

@jlegind I don't think a correct download account is a reason to close the issue.

I hope most of the download event accounts are correct. But that one was not.
And that error makes me question how many are not correct.

If we don't know why it happened, we don't know if it could be happening again and again while we keep on making reports based on that accounts.

In other words: unexplained accounting errors make our annual reports untrustable.

@ManonGros ManonGros reopened this Oct 29, 2021
@MattBlissett
Copy link
Member

gbif/occurrence#28 is an issue from a while ago which was causing duplicated data in downloads. The download you give is newer, perhaps gbif/occurrence#267 is the same problem.

I suspect it happens during the nightly (usually some point between 05:30-06:00 UTC) process which rebuilds our data table from which large downloads are created. Something with the locking mechanism isn't working correctly.

I would be cautious in putting too much meaning in the precise number of records downloaded by users. Depending on their work, a user might make several downloads and only use one, or make a large download and immediately discard most of the records in local processing.

@abubelinha
Copy link
Author

abubelinha commented Nov 7, 2021

I would be cautious in putting too much meaning in the precise number of records downloaded by users. Depending on their work, a user might make several downloads and only use one, or make a large download and immediately discard most of the records in local processing.

Yes of course. We know those numbers mean almost nothing because what users do with downloads is far beyond our knowledge. Any report based on them has to clarify that.

But on the other hand, download accounts are the only info we have to make reports about our gbif data usage. Am I wrong?
So we should at least know if those numbers are correct or not.

And I mentioned usage reports because I already make them each year.
But there are other possible use cases which come to my mind (like those in this comment), which would depend on downloads being accurate or not. So it's good to know before spending too much time developing things which could finally fail because of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants