NPC-Labeler

Utility to run the NPC APIs to classify SMILES, plus preprocessed datasets.

Datasets

Using this utility, we have already labelled SMILES from the following datasets which we share on Zenodo.

Dataset	Description	Labels	Total SMILES	Classified SMILES
GNPS Cleaning + MatchMS	Preprocessed MS/MS spectra from GNPS using MatchMS	Download from Zenodo	54066	54059
GNPS Cleaning	Preprocessed MS/MS spectra from GNPS	Download from Zenodo	53362	53355
GNPS-LIBRARY	Spectra from GNPS Library	Download from Zenodo	5617	5581
GNPS-SELLECKCHEM-FDA-PART1	Spectra from GNPS	Download from Zenodo	285	285
GNPS-SELLECKCHEM-FDA-PART2	Spectra from GNPS	Download from Zenodo	536	536
GNPS-PRESTWICKPHYTOCHEM	Spectra from GNPS	Download from Zenodo	140	140
GNPS-NIH-CLINICALCOLLECTION1	Spectra from GNPS	Download from Zenodo	323	323
GNPS-NIH-CLINICALCOLLECTION2	Spectra from GNPS	Download from Zenodo	17	16
GNPS-NIH-NATURALPRODUCTSLIBRARY	Spectra from GNPS	Download from Zenodo	1255	1255
GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_POSITIVE	Positive spectra from GNPS	Download from Zenodo	3616	3616
GNPS-NIH-NATURALPRODUCTSLIBRARY_ROUND2_NEGATIVE	Negative spectra from GNPS	Download from Zenodo	1464	1464
GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE	Spectra from GNPS	Download from Zenodo	1385	1385
GNPS-FAULKNERLEGACY	Spectra from GNPS	Download from Zenodo	2	2
GNPS-EMBL-MCF	Spectra from GNPS	Download from Zenodo	331	331
GNPS-COLLECTIONS-PESTICIDES-POSITIVE	Positive spectra from GNPS	Download from Zenodo	171	171
GNPS-COLLECTIONS-PESTICIDES-NEGATIVE	Negative spectra from GNPS	Download from Zenodo	45	45
MMV_POSITIVE	Positive spectra from MMV	Download from Zenodo	110	110
MMV_NEGATIVE	Negative spectra from MMV	Download from Zenodo	47	47
LDB_POSITIVE	Positive spectra from LDB	Download from Zenodo	280	280
LDB_NEGATIVE	Negative spectra from LDB	Download from Zenodo	346	346
GNPS-NIST14-MATCHES	Spectra from GNPS	Download from Zenodo	1590	1589
GNPS-COLLECTIONS-MISC	Spectra from GNPS	Download from Zenodo	6	5
GNPS-MSMLS	Spectra from GNPS	Download from Zenodo	399	399
PSU-MSMLS	Spectra from GNPS	Download from Zenodo	367	367
BILELIB19	Spectra from GNPS	Download from Zenodo	533	533
DEREPLICATOR_IDENTIFIED_LIBRARY	Spectra from GNPS	Download from Zenodo	379	379
PNNL-LIPIDS-POSITIVE	Positive spectra from PNNL	Download from Zenodo	1	1
PNNL-LIPIDS-NEGATIVE	Negative spectra from PNNL	-	0	0
MIADB	Spectra from MIADB	Download from Zenodo	421	417
HCE-CELL-LYSATE-LIPIDS	Spectra from GNPS	Download from Zenodo	92	92
UM-NPDC	Spectra from GNPS	Download from Zenodo	23	23
GNPS-NUTRI-METAB-FEM-POS	Positive spectra from GNPS	Download from Zenodo	259	259
GNPS-NUTRI-METAB-FEM-NEG	Negative spectra from GNPS	Download from Zenodo	197	197
GNPS-SCIEX-LIBRARY	Spectra from GNPS	Download from Zenodo	314	314
GNPS-IOBA-NHC	Spectra from GNPS	Download from Zenodo	142	141
BERKELEY-LAB	Spectra from GNPS	Download from Zenodo	4124	4124
IQAMDB	Spectra from GNPS	Download from Zenodo	322	320
GNPS-SAM-SIK-KANG-LEGACY-LIBRARY	Spectra from GNPS	Download from Zenodo	223	219
GNPS-D2-AMINO-LIPID-LIBRARY	Spectra from GNPS	-	0	0
DRUGS-OF-ABUSE-LIBRARY	Spectra from GNPS	Download from Zenodo	237	237
ECG-ACYL-AMIDES-C4-C24-LIBRARY	Spectra from GNPS	Download from Zenodo	1277	1277
ECG-ACYL-ESTERS-C4-C24-LIBRARY	Spectra from GNPS	Download from Zenodo	496	496
LEAFBOT	Spectra from GNPS	Download from Zenodo	299	299
XANTHONES-DB	Spectra from GNPS	Download from Zenodo	19	19
TUEBINGEN-NATURAL-PRODUCT-COLLECTION	Spectra from GNPS	Download from Zenodo	343	342
NEO-MSMS	Spectra from GNPS	Download from Zenodo	358	358
CMMC-LIBRARY	Spectra from GNPS	Download from Zenodo	3610	3610
PHENOLICSDB	Spectra from GNPS	Download from Zenodo	69	69
DMIM-DRUG-METABOLITE-LIBRARY	Spectra from GNPS	Download from Zenodo	1840	1840
ELIXDB-LICHEN-DATABASE	Spectra from GNPS	Download from Zenodo	529	527
MSNLIB-POSITIVE	Positive spectra from MSNLIB	Download from Zenodo	26571	26571
MSNLIB-NEGATIVE	Negative spectra from MSNLIB	Download from Zenodo	26571	26571
GNPS-N-ACYL-LIPIDS-MASSQL	Spectra from GNPS	_	0	0
MCE-DRUG	Spectra from GNPS	Download from Zenodo	2994	2994
CMMC-FOOD-BIOMARKERS	Spectra from GNPS	Download from Zenodo	182	182
ECRFS_DB	Spectra from GNPS	Download from Zenodo	102	102
GNPS-IIMN-PROPOGATED	Spectra from GNPS	Download from Zenodo	45	43
GNPS-SUSPECTLIST	Spectra from GNPS	_	0	0
GNPS-BILE-ACID-MODIFICATIONS	Spectra from GNPS	Download from Zenodo	66	66
GNPS-DRUG-ANALOG	Spectra from GNPS	-	0	0
BMDMS-NP	Spectra from GNPS	Download from Zenodo	2581	2581
MASSBANK	Spectra from GNPS	Download from Zenodo	9206	9108
MASSBANKEU	Spectra from GNPS	Download from Zenodo	692	691
MONA	Spectra from GNPS	Download from Zenodo	3151	3151
HMDB	Spectra from GNPS	Download from Zenodo	748	748
CASMI	Spectra from GNPS	Download from Zenodo	449	449
SUMNER	Spectra from GNPS	Download from Zenodo	261	259
BIRMINGHAM-UHPLC-MS-POS	Positive spectra from Birmingham UHPLC-MS	Download from Zenodo	547	547
BIRMINGHAM-UHPLC-MS-NEG	Negative spectra from Birmingham UHPLC-MS	Download from Zenodo	549	549
ALL_GNPS_NO_PROPOGATED	Spectra from GNPS	Download from Zenodo	75744	75587
ALL_GNPS	Spectra from GNPS	Download from Zenodo	75798	75640
PubChem CID-SMILES	CID-SMILES from PubChem	In progress	119031918	11000000

Dataset format

The datasets are stored in a gzip-ed JSON file with the following format:

[
    {
        "class_results": [
            "Cyclic peptides",
            "Microcystins"
        ],
        "superclass_results": [
            "Oligopeptides"
        ],
        "pathway_results": [
            "Amino acids and Peptides"
        ],
        "isglycoside": false,
        "smiles": "CC(C=CC1NC(=O)C(CCCN=C(N)N)NC(=O)C(C)C(C(=O)O)NC(=O)C(CC(C)C)=NC(=O)C(C)NC(=O)C(C)N(C)C(=O)CCC(C(=O)O)NC(=O)C1C)=CC(C)C(O)Cc1ccccc1"
    },
    {
        "class_results": [
            "Cyclic peptides",
            "Depsipeptides"
        ],
        "superclass_results": [
            "Oligopeptides"
        ],
        "pathway_results": [
            "Amino acids and Peptides",
            "Polyketides"
        ],
        "isglycoside": false,
        "smiles": "CC(=O)OC1c2nc(cs2)C(=O)OC(CCCC(C)(Cl)[37Cl])C(C)C(=O)OC(C(C)(C)O)c2nc(cs2)C(=O)OC1(C)C"
    }
]

Usage

First, clone this repository:

git clone https://github.com/LucaCappelletti94/npc-labeler.git

Navigate in it and install the requirements:

cd npc-labeler
pip install -r requirements.txt

Then, you can run the labeler by providing the input file and the output file:

python3 labeler.py --input <input_file> --output <output_file>

For instance, suppose you want to classify the SMILES in the metadata of an MGF document and store it into a classified_matchms.json.gz file. You can do it by running:

python3 labeler.py --input matchms.mgf --output classified_matchms.json.gz

Similarly, for a SSV file:

python3 labeler.py --input CID-SMILES.ssv --output pubchem.json.gz

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
labeler.py		labeler.py
manual.json		manual.json
mypy.ini		mypy.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPC-Labeler

Datasets

Dataset format

Usage

About

Releases

Packages

Languages

License

earth-metabolome-initiative/npc-labeler

Folders and files

Latest commit

History

Repository files navigation

NPC-Labeler

Datasets

Dataset format

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages