Domain Architecture Network Syntax (DANSy)

This is our analysis that applies the linguistic technique n-gram analysis with network theory to protein domain architectures, to represent the proteome as an abstracts the functional connections between proteins to describe either proteome-wide (base DANSy) or phenotype-specific changes from differential expression results (deDANSy).

DANSy Overview

deDANSy Overview

How to cite: Please cite our bioRxiv paper, which contains further details and specific applications of the code provided here.

Documentation: Coming soon

Getting started

Create a virtual environment containing all the dependencies for the analysis using the following code in a terminal.

conda create env -f dansy.yml

Activate the environment using conda activate dansy for specific scripts or select the dansy kernel for jupyter notebooks.

Proteome Reference File

DANSy relies on reference files generated by CoDIAC. We have provided a reference file, which was generated on May 12th, 2025, and will be the default file used for analysis.

If you wish to generate the most up to date reference file to use for analysis, you will to take the following steps. First download the SwissProt ID list from Gencode and place in the main directory of your local copy of this repo. Then, go to the whole_proteome_reference.py file and change the reference file suffix variable to the current date. Finally, run the following code in a terminal to establish the environment that includes CoDIAC, which will query UniProt and InterPro for the domain architectures. (Note: This can take up to 2 hours after a fresh install, as it will also establish a pybiomart sqlite database.)

conda create env -f codiac.yml
conda activate codiac-env
python scripts/whole_proteome_reference.py
conda deactivate

The DANSy class

To run DANSy on a set of proteins of interest:

import ngramNets
import generateCompleteProteome

# Generate the reference proteome dataframe to save time.
ref_df, _ = generateCompleteProteome.import_proteome_files(reference_file_suffix = your_reference_file_suffix)

# Generating the DANSy object
my_dansy = ngramNets.dansy(ref=ref_df, protsOI = my_proteins, n = 10)

The DANSy object can then be used for downstream analysis and has several built-in methods to help. These include:

# To show the network
my_dansy.draw_network()

# To get a summary of the network and n-grams
my_dansy.summary()

# To get specific information on protein(s) of interest
my_dansy.retrieve_protein_info(prot = uniprot_ids_of_interest)

# To get information on the proteins containing a specific n-gram
my_dansy.retrieve_protein_info(ngram = ngram_of_interst)

Further, any analysis that can be done on a networkx Graph can be performed on the DANSy object by calling the G attribute as shown below

import networkx as nx

nx.number_of_nodes(my_dansy.G)
nx.spring_layout(my_dansy.G)

The deDANSy class

A deDANSy object is a subclass of a DANSy so requires many of the same inputs. However, you will have to provide a dataset that contains expression data to designate individual n-grams as either up- or down-regulated. We provide a quick start version of methods below, but recommend checking the Tutorial (Coming Soon, in the meantime visit DANSy_Applications) for more in-depth walkthrough of methods.

Step 1a. Generating a deDANSY: RNA-sequencing dataset containing only ENSEMBL/Entrez Gene IDs for each gene.

Both the DANSy and deDANSy classes use UniProt IDs for their analysis, but for deDANSy we have built-in methods to convert ENSEMBL or Entrez gene ids to UniProt IDs. This method relies on a pybiomart Dataset class being provided that contains all the database IDs as needed. An example of generating the deDANSy object based for these datasets:

from pybiomart import Dataset

# For ENSEMBL IDs
bm_dataset = Dataset(host = 'http://useast.ensembl.org', name='hsapiens_gene_ensembl',)
gene_ID_conv = bm_dataset.query(attributes=['ensembl_gene_id','external_gene_name','uniprotswissprot'])

# Assuming the ENSEMBL gene ids in your RNA-seq results are under a column labeled ensembl_gene_id

# Generating the deDANSy
my_dedansy = ngramNets.DEdansy(dataset=your_RNA_seq_results,
                               id_conv=gene_ID_conv,
                               conv_col = 'Gene stable ID',
                               data_ids = 'ensembl_gene_id',
                               uniprot_ref = ref_df)

The deDANSy object will then convert the IDs and use those for building the n-gram networks and use them for analysis. To get the ID conversions use the id_conversion_dict attribute.

# To find out which UniProt IDs correspond to which inputted gene IDs
my_dedansy.id_conversion_dict

Step 1b. Generating a deDANSY: RNA-sequencing/proteomics dataset containing a column with UniProt IDs

If your dataset already has a column containing UniProt IDs, then you can build the deDANSy object as follows:

# Generating the deDANSy
my_dedansy = ngramNets.DEdansy(dataset=your_dataset_OI,
                               data_ids = 'uniprot_id_column_name',
                               uniprot_ref = ref_df,
                               run_conversion = False)

Step 2: Defining DEGs

Currently, deDANSy assumes you will use a log2 fold change and a p-value cutoff to define differentially expressed genes (DEGs) or proteins. This can then be achieved with set fold change and a p-value cutoffs by:

my_dedansy.calc_DEG_ngrams(data_cols = ['fold_change_column','p-value_column'],
                           alpha = pval_cutoff,
                           fc_thres = fold_change_cutoff)

By default, the calc_DEG_ngrams function sets alpha to 0.05 and fc_thres to 1 so if you want to use those cutoffs you do not have to provide cutoffs. If you do not want to use either cutoff, you can set either of the cutoffs to 0 (Note: this assumes you have both positive and negative values for your fold-change values.)

Not Recommended alternative method: If you have predefined DEGs, we have a second method available to set DEGs, but only recommend this for users who need more control over DEG definition that uses other data values. This method does not create attributes in the deDANSy object to trace back how DEGs were calculated.

my_dedansy.set_DEG_ngrams(up_DEGs = your_up_DEGs,
                          down_DEGs = your_down_DEGs)

Step 3: Generating deDANSy Separation and Enriched n-gram Neighborhood Distribution Statistics

Currently, this is done by running the deDANSy_calculate.py file in the command line. We are in the process of making this a built-in method of the deDANSy class. An example call of performing this is:

conda activate dansy
python deDANSy_calculate.py path_to_datasetOI/datasetOI.csv output_path comparison_1 comparison_2 -mp 8 -sN 100 -fN 50
conda deactivate

This would perform the steps 1a and 2 from above with multiprocessing enabled and using 8 processes, 100 subsampled/random networks, and 50 false positive rate trials. This would calculate scores for both comparison 1 and comparison 2 using default values for fold change and p-value cutoffs. Note: This process can take 30+ minutes or multiple hours depending on the number of FPR trials, networks used, and if multiprocessing is enabled.

Step 4: Plotting and Generating Final deDANSy Scores

Once the statistics are generated from Step 3, we can then plot the data by running the follow:

import pandas as pd
from enrichment_plotting_helpers.py import *

# Importing the results
raw_results = pd.read_csv(deDANSy_results_file)
res = format_results(raw_results)
plot_functional_score(res)

This will produce a bubble plot, where the size of each bubble is the score (related to Cohen's d effect size) and color indicates whether it was significant (including FPR correction). Like Step 3, we are actively creating this as built-in method of the deDANSy class.

Example applications

For applications of DANSy or deDANSy, please see our DANSy_Applications repo. There, you will find Jupyter notebooks on applications on the whole proteome, the convergence of grammar during the evolution of reversible post-translational modification systems, cancer fusions genes, and differential gene expression from RNA-sequencing results (for deDANSy specifically).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Figures		Figures
data/Current_Human_Proteome		data/Current_Human_Proteome
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codiac.yml		codiac.yml
dansy.yml		dansy.yml
deDANSy_calculate.py		deDANSy_calculate.py
enrichment_helpers.py		enrichment_helpers.py
enrichment_plotting_helpers.py		enrichment_plotting_helpers.py
generateCompleteProteome.py		generateCompleteProteome.py
networkAnalysisUtilities.py		networkAnalysisUtilities.py
ngramNets.py		ngramNets.py
ngramUtilities.py		ngramUtilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Domain Architecture Network Syntax (DANSy)

DANSy Overview

deDANSy Overview

Getting started

Proteome Reference File

The DANSy class

The deDANSy class

Step 1a. Generating a deDANSY: RNA-sequencing dataset containing only ENSEMBL/Entrez Gene IDs for each gene.

Step 1b. Generating a deDANSY: RNA-sequencing/proteomics dataset containing a column with UniProt IDs

Step 2: Defining DEGs

Step 3: Generating deDANSy Separation and Enriched n-gram Neighborhood Distribution Statistics

Step 4: Plotting and Generating Final deDANSy Scores

Example applications

About

Uh oh!

Releases

Packages

Languages

License

NaegleLab/DANSy

Folders and files

Latest commit

History

Repository files navigation

Domain Architecture Network Syntax (DANSy)

DANSy Overview

deDANSy Overview

Getting started

Proteome Reference File

The DANSy class

The deDANSy class

Step 1a. Generating a deDANSY: RNA-sequencing dataset containing only ENSEMBL/Entrez Gene IDs for each gene.

Step 1b. Generating a deDANSY: RNA-sequencing/proteomics dataset containing a column with UniProt IDs

Step 2: Defining DEGs

Step 3: Generating deDANSy Separation and Enriched n-gram Neighborhood Distribution Statistics

Step 4: Plotting and Generating Final deDANSy Scores

Example applications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages