exclude duplicate UniProt proteomes and relax protein count filter #251

realmarcin · 2024-09-18T04:22:11Z

Using downloaded files through the UniProt query&download links. This gives a smaller set of reference + other proteomes but does not include duplicated proteomes (according to the UI). The number is similar to what we are ingesting now. I checked this data and there are some very low and high outlier counts. Our > 1000 protein filter was aggressive and excluded a lot of data (because we had too many proteomes, but most are duplicates).

Todo:

Use the proteome ids from the four files here:
https://drive.google.com/drive/folders/11n6dmvY9dY1Nsj1YQDfGbeSMWh0e6Ug2
Only transform these proteome ids and relax the proteome protein count threshold to 100 proteins or similar number depending on proteome count and time estimate.

realmarcin · 2024-09-19T22:07:03Z

Currently in kg-microbe-function, NCBITaxon:1898103 has 40 linked proteomes and according to UniProt this UP000235024 is the reference one:

UP000235024

D SELECT subject
FROM edges
WHERE object = 'NCBITaxon:1898103' AND subject LIKE 'Proteomes:%';
┌───────────────────────┐
│ subject │
│ varchar │
├───────────────────────┤
│ Proteomes:UP000235024 │
│ Proteomes:UP000264384 │
│ Proteomes:UP000267091 │
│ Proteomes:UP000273204 │
│ Proteomes:UP000318886 │
│ Proteomes:UP000321170 │
│ Proteomes:UP000321575 │
│ Proteomes:UP000323901 │
│ Proteomes:UP000473006 │
│ Proteomes:UP000486387 │
│ Proteomes:UP000517624 │
│ Proteomes:UP000520548 │
│ Proteomes:UP000540725 │
│ Proteomes:UP000555766 │
│ Proteomes:UP000664379 │
│ Proteomes:UP000664772 │
│ Proteomes:UP000664804 │
│ Proteomes:UP000674046 │
│ Proteomes:UP000696334 │
│ Proteomes:UP000697603 │
│ Proteomes:UP000698183 │
│ Proteomes:UP000714254 │
│ Proteomes:UP000720119 │
│ Proteomes:UP000725274 │
│ Proteomes:UP000725850 │
│ Proteomes:UP000731274 │
│ Proteomes:UP000733931 │
│ Proteomes:UP000736971 │
│ Proteomes:UP000739626 │
│ Proteomes:UP000745925 │
│ Proteomes:UP000746928 │
│ Proteomes:UP000748029 │
│ Proteomes:UP000753692 │
│ Proteomes:UP000757370 │
│ Proteomes:UP000759488 │
│ Proteomes:UP000771717 │
│ Proteomes:UP000782948 │
│ Proteomes:UP000784857 │
│ Proteomes:UP000808344 │
│ Proteomes:UP000811369 │
├───────────────────────┤
│ 40 rows │
└───────────────────────┘

The four files in the gdrive link above are for bacteria/archaea and reference/other proteome subsets. This is the better list to ingest because it will exclude duplicates. I can verify in these files that reference proteome id for this NCBITaxon is in the bacteria reference tsv file:

grep UP000235024 *
proteomes_AND_superkingdom_Bacteria_AND_2024_09_18_reference.tsv:UP000235024 Rhodocyclaceae bacterium 1898103 2909 C:78.7%[S:78.0%,D:0.7%],F:0.5%,M:20.7%,n:569 Standard GCA_002863805.1

So I think the easiest solution would be to take these four tsvs and use them proteome id column for lookup and filtering. If the proteome id in our download is find in the list then we ingest it, otherwise we reject it.

We could also try decreasing the protein count threshold to a lower number -- > 100 and < 10000 proteins would be ideal but we would need to check if this would actually increase the size of the transform relative to what we had. We currently have 17147 proteomes > 1000 (but more are non-Reference and non-Other duplicates) and the total number across these 4 files is 50793 but most are small so < 1000.

@hrshdhgd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exclude duplicate UniProt proteomes and relax protein count filter #251

exclude duplicate UniProt proteomes and relax protein count filter #251

realmarcin commented Sep 18, 2024

realmarcin commented Sep 19, 2024

exclude duplicate UniProt proteomes and relax protein count filter #251

exclude duplicate UniProt proteomes and relax protein count filter #251

Comments

realmarcin commented Sep 18, 2024

realmarcin commented Sep 19, 2024