isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

abubelinha · 2021-01-08T00:09:50Z

When looking to the number of records downloaded last year from a certain dataset, I found some odd results which I don't know how to interpretate.

As reported in issue #2385, it's difficult to link to a particular download event using offset and limit parameters, because records are being returned in an useless order (the oldest should be first, so offset could always point to the same record).
But I made this snapshot (content pasted below) so you can see the particular json record which reflects the problem:

The occurrence dataset 17a729a0-7ed1-11df-8c4a-0800200c9a66 had 30952 records (as reported by its ipt resource version 1.2) by June 25th 2020.
How is it possible that a particular download event in that date (0008123-200613084148143) reports 61904 records coming from this dataset?
The only explanation for that is those records being present twice within the same download event/file (61904 = 30952 x 2).
But how can this happen? Wouldn't it be a bug?

The same happens to dataset 10734a60-7ed1-11df-8c4a-0800200c9a66 (10409 records provided by IPT, but 20818 reported in the same download event).
So it looks like some other (all?) datasets included in that download file could also be duplicated (I have not checked this: it's too big for me to process, so I just looked at a couple of datasets).

{"offset":1076,"limit":1,"endOfRecords":false,"count":8657,
"results":[{"downloadKey":"0008123-200613084148143","datasetKey":"17a729a0-7ed1-11df-8c4a-0800200c9a66","datasetTitle":"SANT-Algae","datasetDOI":"10.15468/bxbkba","numberRecords":61904,
"download":{"key":"0008123-200613084148143","doi":"10.15468/dl.qe3j25","license":"http://creativecommons.org/licenses/by-nc/4.0/legalcode",
"request":{"predicate":{"type":"equals","key":"BASIS_OF_RECORD","value":"PRESERVED_SPECIMEN","matchCase":false},"sendNotification":true,"format":"SIMPLE_CSV"},"created":"2020-06-25T05:41:36.317+0000","modified":"2020-06-25T05:48:30.439+0000","eraseAfter":"2020-12-25T05:41:36.255+0000","status":"SUCCEEDED","downloadLink":"https://api.gbif.org/v1/occurrence/download/request/0008123-200613084148143.zip","size":18361726949,"totalRecords":175881988,"numberDatasets":11658}}]}

https://archive.is/m67ha

The point is: does this particular case have an explanation, other than it being a buggy download event?
I am interested in this because it affects the interpretation of annuals reports of gbif data downloads from institutions.

If it is a bug, how frequent could it be?
If it is not a bug, what does it mean?

Thanks a lot in advance

The text was updated successfully, but these errors were encountered:

jlegind · 2021-03-02T15:15:00Z

SQL: SELECT dod.number_records FROM public.dataset_occurrence_download dod WHERE dod.download_key = '0008123-200613084148143' AND dod.dataset_key = '17a729a0-7ed1-11df-8c4a-0800200c9a66';

and

https://api.gbif.org/v1/occurrence/count?datasetKey=17a729a0-7ed1-11df-8c4a-0800200c9a66
confirms this.

jlegind · 2021-03-04T11:59:29Z

The issue around download records inflation must have been solved since I made the exact same download 03-03-2021 and the SANT-Algae contribution was the correct number.

abubelinha · 2021-10-28T20:52:38Z

@jlegind I don't think a correct download account is a reason to close the issue.

I hope most of the download event accounts are correct. But that one was not.
And that error makes me question how many are not correct.

If we don't know why it happened, we don't know if it could be happening again and again while we keep on making reports based on that accounts.

In other words: unexplained accounting errors make our annual reports untrustable.

MattBlissett · 2021-10-29T14:07:26Z

gbif/occurrence#28 is an issue from a while ago which was causing duplicated data in downloads. The download you give is newer, perhaps gbif/occurrence#267 is the same problem.

I suspect it happens during the nightly (usually some point between 05:30-06:00 UTC) process which rebuilds our data table from which large downloads are created. Something with the locking mechanism isn't working correctly.

I would be cautious in putting too much meaning in the precise number of records downloaded by users. Depending on their work, a user might make several downloads and only use one, or make a large download and immediately discard most of the records in local processing.

abubelinha · 2021-11-07T19:47:09Z

I would be cautious in putting too much meaning in the precise number of records downloaded by users. Depending on their work, a user might make several downloads and only use one, or make a large download and immediately discard most of the records in local processing.

Yes of course. We know those numbers mean almost nothing because what users do with downloads is far beyond our knowledge. Any report based on them has to clarify that.

But on the other hand, download accounts are the only info we have to make reports about our gbif data usage. Am I wrong?
So we should at least know if those numbers are correct or not.

And I mentioned usage reports because I already make them each year.
But there are other possible use cases which come to my mind (like those in this comment), which would depend on downloads being accurate or not. So it's good to know before spending too much time developing things which could finally fail because of this.

MortenHofft added the question label Jan 11, 2021

jlegind closed this as completed Aug 17, 2021

ManonGros reopened this Oct 29, 2021

CecSve mentioned this issue Dec 14, 2022

Occurrence count incorrect in dataset search TSV export #4468

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

abubelinha commented Jan 8, 2021

jlegind commented Mar 2, 2021

jlegind commented Mar 4, 2021

abubelinha commented Oct 28, 2021 •

edited

Loading

MattBlissett commented Oct 29, 2021

abubelinha commented Nov 7, 2021 •

edited

Loading

isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187

Comments

abubelinha commented Jan 8, 2021

jlegind commented Mar 2, 2021

jlegind commented Mar 4, 2021

abubelinha commented Oct 28, 2021 • edited Loading

MattBlissett commented Oct 29, 2021

abubelinha commented Nov 7, 2021 • edited Loading

abubelinha commented Oct 28, 2021 •

edited

Loading

abubelinha commented Nov 7, 2021 •

edited

Loading