-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exclude duplicate UniProt proteomes and relax protein count filter #251
Comments
Currently in kg-microbe-function, NCBITaxon:1898103 has 40 linked proteomes and according to UniProt this UP000235024 is the reference one: D SELECT subject The four files in the gdrive link above are for bacteria/archaea and reference/other proteome subsets. This is the better list to ingest because it will exclude duplicates. I can verify in these files that reference proteome id for this NCBITaxon is in the bacteria reference tsv file: grep UP000235024 * So I think the easiest solution would be to take these four tsvs and use them proteome id column for lookup and filtering. If the proteome id in our download is find in the list then we ingest it, otherwise we reject it. We could also try decreasing the protein count threshold to a lower number -- > 100 and < 10000 proteins would be ideal but we would need to check if this would actually increase the size of the transform relative to what we had. We currently have 17147 proteomes > 1000 (but more are non-Reference and non-Other duplicates) and the total number across these 4 files is 50793 but most are small so < 1000. |
Using downloaded files through the UniProt query&download links. This gives a smaller set of reference + other proteomes but does not include duplicated proteomes (according to the UI). The number is similar to what we are ingesting now. I checked this data and there are some very low and high outlier counts. Our > 1000 protein filter was aggressive and excluded a lot of data (because we had too many proteomes, but most are duplicates).
Todo:
https://drive.google.com/drive/folders/11n6dmvY9dY1Nsj1YQDfGbeSMWh0e6Ug2
The text was updated successfully, but these errors were encountered: