Skip to content

earth-metabolome-initiative/npc-labeler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NPC-Labeler

DOI

Utility to run the NPC APIs to classify SMILES, plus preprocessed datasets.

Datasets

Using this utility, we have already labelled SMILES from the following datasets which we share on Zenodo.

Dataset Description Labels Total SMILES Classified SMILES
GNPS Cleaning + MatchMS Preprocessed MS/MS spectra from GNPS using MatchMS Download from Zenodo 54066 54059
GNPS Cleaning Preprocessed MS/MS spectra from GNPS Download from Zenodo 53362 53355
GNPS-LIBRARY Spectra from GNPS Library Download from Zenodo 5617 5581
GNPS-SELLECKCHEM-FDA-PART1 Spectra from GNPS Download from Zenodo 285 285
GNPS-SELLECKCHEM-FDA-PART2 Spectra from GNPS Download from Zenodo 536 536
GNPS-PRESTWICKPHYTOCHEM Spectra from GNPS Download from Zenodo 140 140
GNPS-NIH-CLINICALCOLLECTION1 Spectra from GNPS Download from Zenodo 323 323
GNPS-NIH-CLINICALCOLLECTION2 Spectra from GNPS Download from Zenodo 17 16
GNPS-NIH-NATURALPRODUCTSLIBRARY Spectra from GNPS Download from Zenodo 1255 1255
GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_POSITIVE Positive spectra from GNPS Download from Zenodo 3616 3616
GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_NEGATIVE Negative spectra from GNPS Download from Zenodo 1464 1464
GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE Spectra from GNPS Download from Zenodo 1385 1385
GNPS-FAULKNERLEGACY Spectra from GNPS Download from Zenodo 2 2
GNPS-EMBL-MCF Spectra from GNPS Download from Zenodo 331 331
GNPS-COLLECTIONS-PESTICIDES-POSITIVE Positive spectra from GNPS Download from Zenodo 171 171
GNPS-COLLECTIONS-PESTICIDES-NEGATIVE Negative spectra from GNPS Download from Zenodo 45 45
MMV_POSITIVE Positive spectra from MMV Download from Zenodo 110 110
MMV_NEGATIVE Negative spectra from MMV Download from Zenodo 47 47
LDB_POSITIVE Positive spectra from LDB Download from Zenodo 280 280
LDB_NEGATIVE Negative spectra from LDB Download from Zenodo 346 346
GNPS-NIST14-MATCHES Spectra from GNPS Download from Zenodo 1590 1589
GNPS-COLLECTIONS-MISC Spectra from GNPS Download from Zenodo 6 5
GNPS-MSMLS Spectra from GNPS Download from Zenodo 399 399
PSU-MSMLS Spectra from GNPS Download from Zenodo 367 367
BILELIB19 Spectra from GNPS Download from Zenodo 533 533
DEREPLICATOR_IDENTIFIED_LIBRARY Spectra from GNPS Download from Zenodo 379 379
PNNL-LIPIDS-POSITIVE Positive spectra from PNNL Download from Zenodo 1 1
PNNL-LIPIDS-NEGATIVE Negative spectra from PNNL - 0 0
MIADB Spectra from MIADB Download from Zenodo 421 417
HCE-CELL-LYSATE-LIPIDS Spectra from GNPS Download from Zenodo 92 92
UM-NPDC Spectra from GNPS Download from Zenodo 23 23
GNPS-NUTRI-METAB-FEM-POS Positive spectra from GNPS Download from Zenodo 259 259
GNPS-NUTRI-METAB-FEM-NEG Negative spectra from GNPS Download from Zenodo 197 197
GNPS-SCIEX-LIBRARY Spectra from GNPS Download from Zenodo 314 314
GNPS-IOBA-NHC Spectra from GNPS Download from Zenodo 142 141
BERKELEY-LAB Spectra from GNPS Download from Zenodo 4124 4124
IQAMDB Spectra from GNPS Download from Zenodo 322 320
GNPS-SAM-SIK-KANG-LEGACY-LIBRARY Spectra from GNPS Download from Zenodo 223 219
GNPS-D2-AMINO-LIPID-LIBRARY Spectra from GNPS - 0 0
DRUGS-OF-ABUSE-LIBRARY Spectra from GNPS Download from Zenodo 237 237
ECG-ACYL-AMIDES-C4-C24-LIBRARY Spectra from GNPS Download from Zenodo 1277 1277
ECG-ACYL-ESTERS-C4-C24-LIBRARY Spectra from GNPS Download from Zenodo 496 496
LEAFBOT Spectra from GNPS Download from Zenodo 299 299
XANTHONES-DB Spectra from GNPS Download from Zenodo 19 19
TUEBINGEN-NATURAL-PRODUCT-COLLECTION Spectra from GNPS Download from Zenodo 343 342
NEO-MSMS Spectra from GNPS Download from Zenodo 358 358
CMMC-LIBRARY Spectra from GNPS Download from Zenodo 3610 3610
PHENOLICSDB Spectra from GNPS Download from Zenodo 69 69
DMIM-DRUG-METABOLITE-LIBRARY Spectra from GNPS Download from Zenodo 1840 1840
ELIXDB-LICHEN-DATABASE Spectra from GNPS Download from Zenodo 529 527
MSNLIB-POSITIVE Positive spectra from MSNLIB Download from Zenodo 26571 26571
MSNLIB-NEGATIVE Negative spectra from MSNLIB Download from Zenodo 26571 26571
GNPS-N-ACYL-LIPIDS-MASSQL Spectra from GNPS _ 0 0
MCE-DRUG Spectra from GNPS Download from Zenodo 2994 2994
CMMC-FOOD-BIOMARKERS Spectra from GNPS Download from Zenodo 182 182
ECRFS_DB Spectra from GNPS Download from Zenodo 102 102
GNPS-IIMN-PROPOGATED Spectra from GNPS Download from Zenodo 45 43
GNPS-SUSPECTLIST Spectra from GNPS _ 0 0
GNPS-BILE-ACID-MODIFICATIONS Spectra from GNPS Download from Zenodo 66 66
GNPS-DRUG-ANALOG Spectra from GNPS - 0 0
BMDMS-NP Spectra from GNPS Download from Zenodo 2581 2581
MASSBANK Spectra from GNPS Download from Zenodo 9206 9108
MASSBANKEU Spectra from GNPS Download from Zenodo 692 691
MONA Spectra from GNPS Download from Zenodo 3151 3151
HMDB Spectra from GNPS Download from Zenodo 748 748
CASMI Spectra from GNPS Download from Zenodo 449 449
SUMNER Spectra from GNPS Download from Zenodo 261 259
BIRMINGHAM-UHPLC-MS-POS Positive spectra from Birmingham UHPLC-MS Download from Zenodo 547 547
BIRMINGHAM-UHPLC-MS-NEG Negative spectra from Birmingham UHPLC-MS Download from Zenodo 549 549
ALL_GNPS_NO_PROPOGATED Spectra from GNPS Download from Zenodo 75744 75587
ALL_GNPS Spectra from GNPS Download from Zenodo 75798 75640
PubChem CID-SMILES CID-SMILES from PubChem In progress 119031918 11000000

Dataset format

The datasets are stored in a gzip-ed JSON file with the following format:

[
    {
        "class_results": [
            "Cyclic peptides",
            "Microcystins"
        ],
        "superclass_results": [
            "Oligopeptides"
        ],
        "pathway_results": [
            "Amino acids and Peptides"
        ],
        "isglycoside": false,
        "smiles": "CC(C=CC1NC(=O)C(CCCN=C(N)N)NC(=O)C(C)C(C(=O)O)NC(=O)C(CC(C)C)=NC(=O)C(C)NC(=O)C(C)N(C)C(=O)CCC(C(=O)O)NC(=O)C1C)=CC(C)C(O)Cc1ccccc1"
    },
    {
        "class_results": [
            "Cyclic peptides",
            "Depsipeptides"
        ],
        "superclass_results": [
            "Oligopeptides"
        ],
        "pathway_results": [
            "Amino acids and Peptides",
            "Polyketides"
        ],
        "isglycoside": false,
        "smiles": "CC(=O)OC1c2nc(cs2)C(=O)OC(CCCC(C)(Cl)[37Cl])C(C)C(=O)OC(C(C)(C)O)c2nc(cs2)C(=O)OC1(C)C"
    }
]

Usage

First, clone this repository:

git clone https://github.com/LucaCappelletti94/npc-labeler.git

Navigate in it and install the requirements:

cd npc-labeler
pip install -r requirements.txt

Then, you can run the labeler by providing the input file and the output file:

python3 labeler.py --input <input_file> --output <output_file>

For instance, suppose you want to classify the SMILES in the metadata of an MGF document and store it into a classified_matchms.json.gz file. You can do it by running:

python3 labeler.py --input matchms.mgf --output classified_matchms.json.gz

Similarly, for a SSV file:

python3 labeler.py --input CID-SMILES.ssv --output pubchem.json.gz

About

Utility to run the NPC APIs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages