Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored metadata collection and fixes an issue with newly loaded datasets #7

Merged
merged 19 commits into from
Jan 27, 2025

Conversation

Claptar
Copy link
Contributor

@Claptar Claptar commented Jan 17, 2025

Refaactoring. I refactored metadata collection script in hopes to make it more readable. I splitted separate steps in functions and added a lot of fail checks.

Issue fix. There is an issue with newly loaded datasets where they are (it seems so at least) separately loaded to SRA and GEO. So ENA and SRA metadata files do not contain any information about GSM identifiers. We don't like this so I added parsing of GEO's soft_family file to get GSM to SRS to BioSample relations

Tests. I've added a draft lists of datasets that represent common problems with metadata collection. The list below will be extended:

https://claptar.notion.site/Trouble-shooting-datasets-160b0afc4069801ea8bcc412a971b22f?pvs=4

@Claptar Claptar self-assigned this Jan 21, 2025
@Claptar Claptar requested a review from apredeus January 21, 2025 13:38
@apredeus apredeus merged commit b80355c into main Jan 27, 2025
9 checks passed
@apredeus apredeus deleted the refactor_metadata_pulling branch January 27, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants