Skip to content

Analysis of protein domain architectures by combining the linguistic approach n-gram analysis with network theory.

License

Notifications You must be signed in to change notification settings

NaegleLab/DANSy

Repository files navigation

Domain Architecture Network Syntax (DANSy)

This is our analysis that applies the linguistic technique n-gram analysis with network theory to protein domain architectures, to represent the proteome as an abstracts the functional connections between proteins to describe either proteome-wide (base DANSy) or phenotype-specific changes from differential expression results (deDANSy).

DANSy Overview

Overview of the general workflow

deDANSy Overview

Overview of the deDANSy workflow

How to cite: Please cite our bioRxiv paper, which contains further details and specific applications of the code provided here.

Documentation: Coming soon

Getting started

Create a virtual environment containing all the dependencies for the analysis using the following code in a terminal.

conda create env -f dansy.yml

Activate the environment using conda activate dansy for specific scripts or select the dansy kernel for jupyter notebooks.

Proteome Reference File

DANSy relies on reference files generated by CoDIAC. We have provided a reference file, which was generated on May 12th, 2025, and will be the default file used for analysis.

If you wish to generate the most up to date reference file to use for analysis, you will to take the following steps. First download the SwissProt ID list from Gencode and place in the main directory of your local copy of this repo. Then, go to the whole_proteome_reference.py file and change the reference file suffix variable to the current date. Finally, run the following code in a terminal to establish the environment that includes CoDIAC, which will query UniProt and InterPro for the domain architectures. (Note: This can take up to 2 hours after a fresh install, as it will also establish a pybiomart sqlite database.)

conda create env -f codiac.yml
conda activate codiac-env
python scripts/whole_proteome_reference.py
conda deactivate

The DANSy class

To run DANSy on a set of proteins of interest:

import ngramNets
import generateCompleteProteome

# Generate the reference proteome dataframe to save time.
ref_df, _ = generateCompleteProteome.import_proteome_files(reference_file_suffix = your_reference_file_suffix)

# Generating the DANSy object
my_dansy = ngramNets.dansy(ref=ref_df, protsOI = my_proteins, n = 10)

The DANSy object can then be used for downstream analysis and has several built-in methods to help. These include:

# To show the network
my_dansy.draw_network()

# To get a summary of the network and n-grams
my_dansy.summary()

# To get specific information on protein(s) of interest
my_dansy.retrieve_protein_info(prot = uniprot_ids_of_interest)

# To get information on the proteins containing a specific n-gram
my_dansy.retrieve_protein_info(ngram = ngram_of_interst)

Further, any analysis that can be done on a networkx Graph can be performed on the DANSy object by calling the G attribute as shown below

import networkx as nx

nx.number_of_nodes(my_dansy.G)
nx.spring_layout(my_dansy.G)

The deDANSy class

A deDANSy object is a subclass of a DANSy so requires many of the same inputs. However, you will have to provide a dataset that contains expression data to designate individual n-grams as either up- or down-regulated. We provide a quick start version of methods below, but recommend checking the Tutorial (Coming Soon, in the meantime visit DANSy_Applications) for more in-depth walkthrough of methods.

Step 1a. Generating a deDANSY: RNA-sequencing dataset containing only ENSEMBL/Entrez Gene IDs for each gene.

Both the DANSy and deDANSy classes use UniProt IDs for their analysis, but for deDANSy we have built-in methods to convert ENSEMBL or Entrez gene ids to UniProt IDs. This method relies on a pybiomart Dataset class being provided that contains all the database IDs as needed. An example of generating the deDANSy object based for these datasets:

from pybiomart import Dataset

# For ENSEMBL IDs
bm_dataset = Dataset(host = 'http://useast.ensembl.org', name='hsapiens_gene_ensembl',)
gene_ID_conv = bm_dataset.query(attributes=['ensembl_gene_id','external_gene_name','uniprotswissprot'])

# Assuming the ENSEMBL gene ids in your RNA-seq results are under a column labeled ensembl_gene_id

# Generating the deDANSy
my_dedansy = ngramNets.DEdansy(dataset=your_RNA_seq_results,
                               id_conv=gene_ID_conv,
                               conv_col = 'Gene stable ID',
                               data_ids = 'ensembl_gene_id',
                               uniprot_ref = ref_df)

The deDANSy object will then convert the IDs and use those for building the n-gram networks and use them for analysis. To get the ID conversions use the id_conversion_dict attribute.

# To find out which UniProt IDs correspond to which inputted gene IDs
my_dedansy.id_conversion_dict

Step 1b. Generating a deDANSY: RNA-sequencing/proteomics dataset containing a column with UniProt IDs

If your dataset already has a column containing UniProt IDs, then you can build the deDANSy object as follows:

# Generating the deDANSy
my_dedansy = ngramNets.DEdansy(dataset=your_dataset_OI,
                               data_ids = 'uniprot_id_column_name',
                               uniprot_ref = ref_df,
                               run_conversion = False)

Step 2: Defining DEGs

Currently, deDANSy assumes you will use a log2 fold change and a p-value cutoff to define differentially expressed genes (DEGs) or proteins. This can then be achieved with set fold change and a p-value cutoffs by:

my_dedansy.calc_DEG_ngrams(data_cols = ['fold_change_column','p-value_column'],
                           alpha = pval_cutoff,
                           fc_thres = fold_change_cutoff)

By default, the calc_DEG_ngrams function sets alpha to 0.05 and fc_thres to 1 so if you want to use those cutoffs you do not have to provide cutoffs. If you do not want to use either cutoff, you can set either of the cutoffs to 0 (Note: this assumes you have both positive and negative values for your fold-change values.)

Not Recommended alternative method: If you have predefined DEGs, we have a second method available to set DEGs, but only recommend this for users who need more control over DEG definition that uses other data values. This method does not create attributes in the deDANSy object to trace back how DEGs were calculated.

my_dedansy.set_DEG_ngrams(up_DEGs = your_up_DEGs,
                          down_DEGs = your_down_DEGs)

Step 3: Generating deDANSy Separation and Enriched n-gram Neighborhood Distribution Statistics

Currently, this is done by running the deDANSy_calculate.py file in the command line. We are in the process of making this a built-in method of the deDANSy class. An example call of performing this is:

conda activate dansy
python deDANSy_calculate.py path_to_datasetOI/datasetOI.csv output_path comparison_1 comparison_2 -mp 8 -sN 100 -fN 50
conda deactivate

This would perform the steps 1a and 2 from above with multiprocessing enabled and using 8 processes, 100 subsampled/random networks, and 50 false positive rate trials. This would calculate scores for both comparison 1 and comparison 2 using default values for fold change and p-value cutoffs. Note: This process can take 30+ minutes or multiple hours depending on the number of FPR trials, networks used, and if multiprocessing is enabled.

Step 4: Plotting and Generating Final deDANSy Scores

Once the statistics are generated from Step 3, we can then plot the data by running the follow:

import pandas as pd
from enrichment_plotting_helpers.py import *

# Importing the results
raw_results = pd.read_csv(deDANSy_results_file)
res = format_results(raw_results)
plot_functional_score(res)

This will produce a bubble plot, where the size of each bubble is the score (related to Cohen's d effect size) and color indicates whether it was significant (including FPR correction). Like Step 3, we are actively creating this as built-in method of the deDANSy class.

Example applications

For applications of DANSy or deDANSy, please see our DANSy_Applications repo. There, you will find Jupyter notebooks on applications on the whole proteome, the convergence of grammar during the evolution of reversible post-translational modification systems, cancer fusions genes, and differential gene expression from RNA-sequencing results (for deDANSy specifically).

About

Analysis of protein domain architectures by combining the linguistic approach n-gram analysis with network theory.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages