Skip to content

LewisLabUCSD/GlyCompareCT

Repository files navigation

GlyCompareCT

GlyCompareCT is a Python-based command-line tool available at https://github.com/yuz682/GlyCompareCT. The command-line implementation wraps the existing python package (GlyCompare v1.1.3 https://github.com/LewisLabUCSD/GlyCompare) to increase accessibility by simplifying the user interface. A conda environment yml file is provided for stable installation. Executable files are also available on Zenodo (https://doi.org/10.5281/zenodo.6370789) for Windows (tested on Windows 10, Core i7), Linux (tested on 18.04.6 LTS and CentOS Linux 7 Core), and Mac OS with Intel chip (macOS 12.1, Core i7) and M1 chip.

Mandatory inputs include a glycan abundance table (absolute or relative abundance with rows/columns as samples/glycans; -a <path/to/abundance>) and a glycan annotation table (-v <path/to/annotation>); both in CSV format. GlyCompareCT decomposes glycans to substructures, calculates substructure abundance and identifies a minimal set of glycomotifs.

GlyCompareCT outputs the glycomotif abundance table. The glycomotif abundance table denotes the abundance of the glycomotifs extracted from input glycoprofiles. Rows represent glycomotifs written as <[S/L]i> where S or L denote the structural and linkage-specific references respectfully and i indicates the index in the local reference glycomotif vector (GlyCompareCT/reference if using the python script; glyCompareCT_exe_/reference if using executables). Note that local references will be amended to include previously un-indexed substructures; the github reference will be versioned by date and updated occasionally to integrate new substructures. Column names correspond to glycoprofile names, consistent with the input glycan abundance table.

Citation

Bao, Bokan, Benjamin P. Kellman, Austin WT Chiang, Yujie Zhang, James T. Sorrentino, Austin K. York, Mahmoud A. Mohammad, Morey W. Haymond, Lars Bode, and Nathan E. Lewis. "Correcting for sparsity and interdependence in glycomics by accounting for glycan biosynthesis." Nature communications 12, no. 1 (2021): 1-14. https://doi.org/10.1038/s41467-021-25183-5

Installation

First, please make sure you have conda installed. Version recommendation: conda 4.9.2 and later versions.

Please git clone the main branch to your target local directory.

# get the repo
git clone https://github.com/yuz682/GlyCompareCT.git
# enter the repo
cd GlyCompareCT

All dependencies required to run GlyCompareCT can be installed using environment.yml. A new conda environment is created with all dependencies installed. This step will take a while (10 - 15 minutes).

# Create the environment with all required dependencies installed.
conda env create -f environment.yml

Activate the new environment glycompareCT. Then the preprocessing is all done.

# Activate conda environment
conda activate glycompareCT

Executables

Executables for Window, MacIntel, and Linux can be downloaded from the release or zenodo. The binary file is glyCompareCT (or glyCompareCT.exe). To use more conveniently, you can export the path to PATH variable by

export PATH="<path>/<to>/<glyCompareCT>/<directory>":$PATH

then

source ~/.bashrc

User manual

Please refer to the GlyCompare wiki regarding input file format and more details about input parameters. Please ignore some inconsistent wording as the wiki was written for a web app.

Quick start

Retreive example data

git clone https://github.com/LewisLabUCSD/GlyCompare.git 

Glycopare decomposition of structural, linkage-specific HMO data with no normalization, 2 cores, integer substructure counting, epitope-based motif extraction

python glyCompareCT.py structure \
  -a GlyCompare/example_data/paper_hmo/source_data/abundance_table.csv \
  -v GlyCompare/example_data/paper_hmo/source_data/annotation.csv \
  -o output_hmo/ -p glycoCT -c 2 \

Glycopare decomposition of structural, linkage-specific HMO data with Probabilistic Quotient normalization, 2 cores, binary substructure counting, lactose-based motif extraction

python glyCompareCT.py structure \
  -a GlyCompare/example_data/paper_hmo/source_data/abundance_table.csv \
  -v GlyCompare/example_data/paper_hmo/source_data/annotation.csv \
  -o output_hmo/ -p glycoCT -n prob_quot \
  -m binary -c 2 -r lactose

Naive samples

Simple simulated samples can be retrieved from GlyCompareCT/Naive samples/. There are 4 pairs of test samples.

cd Naive\ samples/

python glyCompareCT.py structure \
  -a test1_abd.csv \
  -v test1_var.csv \
  -o test1 -p glycoCT -b \
  -m integer -c 2 

Inputs:

Outputs:

Table annotation

Annotation format will update the Glytoucan ID column in the previously generated motif annotation table or table with the same format.

python glyCompareCT.py annotate -n <ANNOTATION TABLE>

Structure data

python glyCompareCT.py structure -a <ABUNDANCE TABLE> -v <GLYCAN ANNOTATION> 
-o <OUTPUT_DIRECTORY> -p <GLYCAN_DATA_TYPE> [-n <NORMALIZATION_MODE>, 
-m <SUBSTRUCTURE_ABUNDANCE_MULTIPLIER>, -c <NUMBER_OF_CORES>, -r <ROOT>, 
-cr <CUSTOM_ROOT>, -d, -s, -b, -i]

Required arguments:

Parameter Description
-a, --abundance The file directory to the abundance table, in csv format
-v, --var_annot The file directory to the glycan annotation table, in csv format
-o, --output The directory to save the outputs, folder
-p, --syntax Glycan data type, choose from <'glycoCT', 'iupac_extended', 'linear_code', 'wurcs', 'glytoucan_id'>

Optional arguments:

Parameter Default Description
-e, --share 'private' Either run locally or register the output motif structures to Glytoucan. Choose from <'private', 'register'>.
'private': run GlyCompareCT locally without fetching glytoucan ID and register output motifs to Glytoucan.
'register': Fetch glytoucan ID to output motif annotation table and register any output motifs without glytoucan ID to Glytoucan. Needs to specify Glytoucan contributor ID and API_key.
-C, --Contributor_ID '' User's Glytoucan contributor ID. Can be retrieved at Glytoucan after signing up. Required in -e register mode.
-A, --API_key '' User's Glytoucan API key. Can be retrieved at Glytoucan after signing up. Required in -e register mode.
-s, --no_linkage None Add this parameter if the input glycans don't have linkage information. The default assumes linkage information inclusion.
-c, --core 1 The number of cores to use
-n, --norm 'none' Input glycans normalization within each glycoprofile, choose from <'none', 'min-max', 'prob-quot'>.
'none': no normalization;
'min-max': each element x is set to (x - min) / (max - min);
'prob-quot': A commonly seen normalization method in biological data described in Dieterle et al. 2006
-b, --no_sub_norm None Add this parameter to keep the absolute value of the substructure abundance. If not set, the substructure will be normalized by sum.
-m, --multiplier 'integer' Substructure abundance multiplier, choose from <'binary', 'integer'>.
'binary': 1 if the substructure exists in the glycan, 0 if not;
'integer': the occurrence of the substructure in the glycan.
-r, --root 'epitope' The root substructure of the substructure network, choose from <'epitope', 'N', 'O', 'lactose', 'custom'>.
"epitope": run every possible monosaccharide is a root;
'N': the root for N-glycan, GlcNAc;
'O': the root for O-glycan, GalNAc;
'lactose': set the root as lactose, Gal(b1-4)Glc;
'custom': set custom root. You need to write your custom root in glycoCT format to a txt file and specify the file directory in -cr.
-cr, --custom_root '' The file directory to the txt file containing the custom root in glycoCT format. Only specify this if -r is set to 'custom'.
-d, --heatmap None Add this parameter if you want to draw the cluster map based on the output motif abundance table.
-i, --ignore None Add this parameter if you want to ignore unrecognized glycan structures and proceed the rest.

Composition data

python glyCompareCT.py composition -a <ABUNDANCE TABLE> -v <GLYCAN ANNOTATION> 
-o <OUTPUT_DIRECTORY> [-n <NORMALIZATION_MODE>, -i]

Required arguments:

Parameter Description
-a, --abundance The file directory to the abundance table, in csv format
-v, --var_annot The file directory to the glycan annotation table, in csv format
-o, --output The directory to save the outputs, folder

Optional arguments:

Parameter Default Description
-n, --norm 'none' Input glycans normalization within each glycoprofile, choose from <'none', 'min-max', 'prob-quot'>.
'none': no normalization;
'min-max': each element x
-i, --ignore None Add this parameter if you want to ignore unrecognized glycan compositions and proceed the rest.