Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate gbifIDs in BIONOMIA occurrence download #336

Closed
dshorthouse opened this issue Jan 13, 2024 · 9 comments
Closed

Duplicate gbifIDs in BIONOMIA occurrence download #336

dshorthouse opened this issue Jan 13, 2024 · 9 comments

Comments

@dshorthouse
Copy link
Contributor

dshorthouse commented Jan 13, 2024

A BIONOMIA download, https://doi.org/10.15468/dl.emvv7z (see https://github.com/gbif/occurrence/tree/master/occurrence-download/src/main/resources/download-workflow/bionomia) contains a heap of duplicate gbifIDs and I'm not sure how this was possible. I thought perhaps my logic in this relatively new request (for BIONOMIA) by sloughing occurrenceStatus == ABSENT was at fault so I also did https://doi.org/10.15468/dl.b7hqhu. However, it too has a heap of duplicate gbifIDs. And so, I'm at a loss. Is there something odd happening in the production of these downloads that explains the duplicate records & that can be repaired at your end?

@dshorthouse
Copy link
Contributor Author

Related? #267

@dshorthouse
Copy link
Contributor Author

For reference, this one https://doi.org/10.15468/dl.8b63cr (using the same query as https://doi.org/10.15468/dl.emvv7z) has no duplicates.

@dshorthouse
Copy link
Contributor Author

Much appreciated @MattBlissett @timrobertson100 if you've any insight. It's a blocker for a scheduled refresh at my end.

@MattBlissett
Copy link
Member

In this case I don't think #267 is related. There is a discrepancy related to these datasets:

https://registry.gbif.org/dataset/6aeebd1a-c3ad-4bc5-bdfe-24de0e2e9052
https://registry.gbif.org/dataset/f534f8bf-b84c-4412-bfee-536c000528c6
https://registry.gbif.org/dataset/46810667-ac04-48a2-8cbe-3c54fc116e67
https://registry.gbif.org/dataset/c9853083-7c4e-4dc1-8a76-8c2ac0e05d58

It looks like a migration intended to keep identifiers stable has instead made a mess, and we now have the same identifier used for occurrences in different datasets, although it's essentially the same occurrence.

gbif/ingestion-management#858

I'll add some additional monitoring, with a daily check that SELECT COUNT(*) FROM occurrence returns the same as SELECT COUNT(*) FROM (SELECT gbifid FROM occurrence GROUP BY gbifid) q;.

@ManonGros, could you work out what is supposed to have happened?

@dshorthouse
Copy link
Contributor Author

For giggles @ManonGros @MattBlissett, I tried again https://doi.org/10.15468/dl.zcyyzs but still see duplicate gbifIDs.

@ManonGros
Copy link

@MattBlissett The publishers sent a list of occurrenceIDs to be transferred to different datasets. I have divided it between datasets to transfer data to and ran the script. I am not sure what exactly happened but I am happy to show you which files I have used.

@ManonGros
Copy link

@dshorthouse this should be fixed now. Could you let us know if this seems ok to you? Thanks!

@dshorthouse
Copy link
Contributor Author

@dshorthouse this should be fixed now. Could you let us know if this seems ok to you? Thanks!

Thanks for the work on this! I triggered a new download. And, I've also found a work-around by making use of dropDuplicates("gbifID") in my Scala-based, SQL spark scripting. I'll first do a record count with & without that method, just as @MattBlissett has in the daily check mentioned earlier. Will let you know in a few hours what transpired.

@dshorthouse
Copy link
Contributor Author

Good to go! Thanks for the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants