-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isolated? error in occurrence downloads api: too big count of records per dataset (they seem to be duplicated) #3187
Comments
SQL: SELECT dod.number_records FROM public.dataset_occurrence_download dod WHERE dod.download_key = '0008123-200613084148143' AND dod.dataset_key = '17a729a0-7ed1-11df-8c4a-0800200c9a66'; and https://api.gbif.org/v1/occurrence/count?datasetKey=17a729a0-7ed1-11df-8c4a-0800200c9a66 |
The issue around download records inflation must have been solved since I made the exact same download 03-03-2021 and the SANT-Algae contribution was the correct number. |
@jlegind I don't think a correct download account is a reason to close the issue. I hope most of the download event accounts are correct. But that one was not. If we don't know why it happened, we don't know if it could be happening again and again while we keep on making reports based on that accounts. In other words: unexplained accounting errors make our annual reports untrustable. |
gbif/occurrence#28 is an issue from a while ago which was causing duplicated data in downloads. The download you give is newer, perhaps gbif/occurrence#267 is the same problem. I suspect it happens during the nightly (usually some point between 05:30-06:00 UTC) process which rebuilds our data table from which large downloads are created. Something with the locking mechanism isn't working correctly. I would be cautious in putting too much meaning in the precise number of records downloaded by users. Depending on their work, a user might make several downloads and only use one, or make a large download and immediately discard most of the records in local processing. |
Yes of course. We know those numbers mean almost nothing because what users do with downloads is far beyond our knowledge. Any report based on them has to clarify that. But on the other hand, download accounts are the only info we have to make reports about our gbif data usage. Am I wrong? And I mentioned usage reports because I already make them each year. |
When looking to the number of records downloaded last year from a certain dataset, I found some odd results which I don't know how to interpretate.
As reported in issue #2385, it's difficult to link to a particular download event using
offset
andlimit
parameters, because records are being returned in an useless order (the oldest should be first, sooffset
could always point to the same record).But I made this snapshot (content pasted below) so you can see the particular json record which reflects the problem:
The occurrence dataset 17a729a0-7ed1-11df-8c4a-0800200c9a66 had 30952 records (as reported by its ipt resource version 1.2) by June 25th 2020.
How is it possible that a particular download event in that date (0008123-200613084148143) reports 61904 records coming from this dataset?
The only explanation for that is those records being present twice within the same download event/file (61904 = 30952 x 2).
But how can this happen? Wouldn't it be a bug?
The same happens to dataset 10734a60-7ed1-11df-8c4a-0800200c9a66 (10409 records provided by IPT, but 20818 reported in the same download event).
So it looks like some other (all?) datasets included in that download file could also be duplicated (I have not checked this: it's too big for me to process, so I just looked at a couple of datasets).
{"offset":1076,"limit":1,"endOfRecords":false,"count":8657,
"results":[{"downloadKey":"0008123-200613084148143","datasetKey":"17a729a0-7ed1-11df-8c4a-0800200c9a66","datasetTitle":"SANT-Algae","datasetDOI":"10.15468/bxbkba","numberRecords":61904,
"download":{"key":"0008123-200613084148143","doi":"10.15468/dl.qe3j25","license":"http://creativecommons.org/licenses/by-nc/4.0/legalcode",
"request":{"predicate":{"type":"equals","key":"BASIS_OF_RECORD","value":"PRESERVED_SPECIMEN","matchCase":false},"sendNotification":true,"format":"SIMPLE_CSV"},"created":"2020-06-25T05:41:36.317+0000","modified":"2020-06-25T05:48:30.439+0000","eraseAfter":"2020-12-25T05:41:36.255+0000","status":"SUCCEEDED","downloadLink":"https://api.gbif.org/v1/occurrence/download/request/0008123-200613084148143.zip","size":18361726949,"totalRecords":175881988,"numberDatasets":11658}}]}
https://archive.is/m67ha
The point is: does this particular case have an explanation, other than it being a buggy download event?
I am interested in this because it affects the interpretation of annuals reports of gbif data downloads from institutions.
Thanks a lot in advance
The text was updated successfully, but these errors were encountered: