GitHub - e-south/deg2tfbs: Pipeline for differential expression analysis (DEGs), transcription factor mapping, and TFBS identification.

deg2tfbs

deg2tfbs is a pipeline to derive transcription factor binding sites (TFBSs) from comparative RNA-seq and proteomic data, focusing on E. coli but extendable to other organisms. The process identifies differentially expressed genes (DEGs) between experimental conditions, maps them to upstream transcription factors (TFs) using RegulonDB or EcoCyc databases, and then retrieves cognate transcription factor binding sites (TFBSs) if available.

Installation

Prerequisites

Conda or Miniconda
(Optional) Mamba for faster dependency resolution

Step 1. Clone the repository

git clone https://github.com/e-south/deg2tfbs.git
cd deg2tfbs

Step 2. Create the Conda environment

conda env create -f environment.yml

This command will read the environment.yml file and create a Conda environment named deg2tfbs with all the necessary packages installed.

Step 3. Activate the environment

conda activate deg2tfbs

Step 4. Install the Local deg2tfbs Package in Editable Mode

(deg2tfbs) cd deg2tfbs
(deg2tfbs) pip install -e .

This allows Python to recognize deg2tfbs as an installed package while still linking directly to the source files in the repository. Any changes made to the source code will be immediately available without requiring reinstallation.

Directory Layout

deg2tfbs/
├── README.md
├── pyproject.toml
├── environment.yml
├── LICENSE
└── src/
    └── deg2tfbs/
        ├── __init__.py
        ├── main.py                         # CLI entry point
        ├── configs/                        # User-defined configurations
        │   └── example.yaml                # Customize to process different DEGs and retrieve different TFBSs
        └── pipeline/
            ├── utils.py                    # Dictionary of paths to datasets (from dnadesign-data)
            ├── degfetcher/                 # Step 1: Isolate DEGs
            │   ├── __init__.py 
            │   ├── <dataset>_module.py     # Each omics dataset has its own respective module
            │   └── degbatch_<date>/        # Batch of DEGs retrieved from degfetcher in a given run
            │       ├── csvs                
            │       └── plots
            ├── tffetcher/                  # Step 2: Map DEGs to TFs
            │   ├── __init__.py    
            │   ├── tffetcher.py            # Coordinates TF retrieval for a given DEG batch
            │   ├── parsers/                
            │   │   ├── __init__.py
            │   │   ├── ecocyc_parser.py
            │   │   ├── regdb_parser.py     
            │   │   └── ...                 
            │   └── tfbatch_<date>/         
            │       └── deg2tf_mapping.csv  
            └── tfbsfetcher/                # Step 3: Map TFs to TFBSs
                ├── __init__.py    
                ├── tfbsfetcher.py          # Coordinates TFBS retrieval for a given set of TFs
                ├── parsers/                
                │   ├── __init__.py
                │   ├── ecocyc_tfbs_parser.py
                │   ├── regdb_tfbs_parser.py    
                │   └── ...                 
                └── tfbsbatch_<date>/

Pipeline Steps

degfetcher (Step 1: Isolate DEGs)

Loads comparative omics datasets from the dnadesign-dna repository.

Produces tidy CSV outputs, such as ceroni_upregulated_degs.csv, containing columns:

gene: DEG identifier.
source: The source dataset (e.g., "ceroni").
thresholds: Optional column specifying user-defined DEG thresholds from the config file.
comparison: The experimental context defining the target vs. reference condition.

For example:

gene	source	thresholds	comparison
groS	ceroni	2.5	pLys-M1_versus_pLys
gadY	ceroni	2.5	pLys-M1_versus_pLys
azuC	ceroni	2.5	pLys-M1_versus_pLys
yadL	ceroni	2.5	pPD864-LacZ_versus_pPD864
rclC	ceroni	2.5	pPD864-LacZ_versus_pPD864
rclR	ceroni	2.5	pPD864-LacZ_versus_pPD864

tffetcher (Step 2: Map DEGs to TFs)

Reads CSV outputs, saved in batches, from degfetcher.
Loads regulatory network-type datasets curated from a resource (e.g., from RegulonDB or EcoCyc).
Fetches TFs that reportedly regulate these DEGs (the “regulatees”).

Produces a tidy CSV output, deg2tf_mapping, containing columns:

gene - DEG identifier.
regulator: TFs reported to regulated target gene.
polarity: Reported "type" of regulation applied to gene, if available.
source: Record of which regulatory network dataset(s) were used.
is_global_regulator: Boolean indicating whether the regulator is classified as a global regulator by EcoCyc.
is_sigma_factor: Boolean indicating whether the regulator is classified as a sigma factor by EcoCyc.
deg_source: Indicates the source dataset(s) in which this gene appears, as processed by degfetcher.

For example:

gene	regulator	polarity	source	is_global_regulator	is_sigma_factor	DEG Source
aaea	crp	+	ecocyc_28_AND_regdb_13	yes	no	houser_up
aaea	aaer	+	ecocyc_28	no	no	houser_up
aaeb	crp	+	ecocyc_28_AND_regdb_13	yes	no	houser_up
abga	nac	-	ecocyc_28_AND_regdb_13	yes	no	ceroni_up
acca	accd	-	ecocyc_28	no	no	houser_down
acca	rpod	+	ecocyc_28	no	yes	houser_down
adia	adiy	+	ecocyc_28_AND_regdb_13	no	no	houser_up-lu_up

tfbsfetcher (Step 3: Map TFs to TFBSs)

Reads CSV outputs, saved in batches, from tffetcher.
Loads TFBS-type data curated from a resource (e.g. RegulonDB .txt or .csv).
For each TF identified in step 2, fetch the corresponding binding site(s).

Saves a final CSV output, tf2tfbs_mapping, containing:

Column	Description
tf	The transcription factor name (lowercase)
tfbs	The TF binding site sequence (normalized to uppercase letters)
gene	The DEG (or gene) associated with this TF mapping
deg_source	A hyphen-delimited string indicating the source DEG datasets. For example, `houser_down-schmidt_down-durfee_down-bie_up` shows that multiple DEG datasets contributed to this mapping.
polarity	The regulatory polarity (e.g., "-" for repression, "+" for activation)
tfbs_source	A string indicating which TFBS resource(s) supplied the binding site (e.g., "regdb" or "ecocyc")
is_sigma_factor	Boolean flag (e.g., "no") indicating whether the TF is a sigma factor
is_global_regulator	Boolean flag (e.g., "no") indicating whether the TF is a global regulator

Caution: In tfbsfetcher.py, there is a final TFBS deduplication check that is stringent and does not account for cases where binding sites differ by ±1 nucleotide at either end.

Custom DEG CSV Grouping

In addition to a default “process all available DEGs” approach, you can specify custom combinations of DEG CSVs using the deg_csv_groups key within the tffetcher configuration. This functionality lets you define exactly which subset(s) of DEG CSVs to process downstream. There are three ways to define a group:

Global:
An empty group (e.g. all: {}) processes all DEG CSV files in a degbatch subdirectory.
Filter-Based:
Define a filter pattern (e.g. all_up: { filter: "up" }) to select only those CSVs whose filenames contain the specified substring.

Explicit Combination:
Specify an explicit list of files. For example:

deg_csv_groups:
  heat_shock_up:
    files:
      - { file: "kim_upregulated_degs.csv", comparison: "42C_versus_control" }
      - { file: "zhang_upregulated_degs.csv", comparison: "sigma32-I54N_expression_versus_control" }

Running the Pipeline

Clone the dnadesign-data repository to access a curated set of comparative omics datasets. Placing it as a sibling directory to deg2tfbs enables degfetcher to generate custom DEG tables from these sources.

Update configs/mycustomparams.yaml with the desired I/O paths, batch IDs, and custom DEG CSV groups. For instance:

# deg2tfbs/configs/example.yaml
 pipeline:
     name: "default"

     stages:
         degfetcher:
             root_dir: "pipeline/degfetcher"
             batch_id: "degbatch_20250130"
             modules:
                 - ceroni
                 - mori

         tffetcher:
             root_dir: "pipeline/tffetcher"
             batch_id: "tfbatch_20250130"
             input:
                 deg_batch_id: "degbatch_20250130"
                 deg_csv_subdir: "csvs"
                 # Define custom groups:
                 deg_csv_groups:
                     all: {}  # Process all CSVs
                     all_up:
                         filter: "up"  # Process all CSVs with "up" in the name
                     heat_shock_up:
                         files:
                         - { file: "kim_upregulated_degs.csv", comparison: "42C_versus_control" }
                         - { file: "zhang_upregulated_degs.csv", comparison: "s32-I54N_vs_control" }
             sources:
                 regulatory_networks:
                     ecocyc:
                         path: "ecocyc_28_reg_network"
                         enabled: true
                         parser: "ecocyc_network_v28-5"
             params:
                 network_strategy: "union"
                 include_master_regulators: true
                 include_sigma_factors: true

         tfbsfetcher:
             root_dir: "pipeline/tfbsfetcher"
             batch_id: "tfbsbatch_20250130"
             input:
                 tf_batch_id: "tfbatch_20250130"  # This pointer is updated per group
             sources:
                 binding_sites:
                 ecocyc:
                     path: "ecocyc_28_tfbs_smart_table"
                     ecocyc_motifs: true
                 regdb:
                     path: "regulondb_13_tf_ri_set"
                     regulondb_pssm: false
             params:
                 pass

After configuring your mycustomparams.yaml, run the pipeline as follows:
```
cd deg2tfbs
python main.py # Make sure that the bottom of this module references your config file.
```
- degfetcher processes the datasets and outputs DEG CSVs.
- tffetcher then processes these CSVs in each custom group (using deg_csv_groups) to produce distinct deg2tf_mapping.csv files.
- tfbsfetcher uses the appropriate TF mapping to produce a final tf2tfbs_mapping.csv for each group.

Extendability

The degfetcher, tffetcher, and tfbsfetcher steps are designed to be extendable. You can add your own dataset ingestion modules for degfetcher or develop custom parser modules for tffetcher and tfbsfetcher.

To use your own custom input genes, create a CSV with required columns: 'gene', 'source', 'comparison'. Then place it in a "batch" subdirectory within degfetcher, and point to it in your custom config file. Alternatively, if you added a new gene table in a cloned dnadesign-data repository, update the utils.py dictionary in pipeline to ensure datasets can be found via DATA_FILES[...].

When creating a custom parser, make sure it conforms to one of the following interfaces:

For tffetcher parsers:
Implement: Dict[gene, Set[Tuple[TF, polarity]]])
```
parse(...) -> Dict[str, Set[Tuple[str, str]]]
```
This signature allows tffetcher.py to process new regulatory network data the same.
For tfbsfetcher parsers:
Implement: Dict[TF, Set[TFBS_1, TFBS_2, TFBS_3, ...]]
```
parse(...) -> Dict[str, Set[str]]
```
This signature allows tfbsfetcher.py to process new TFBS data the same.

Analysis of Pipeline Results

The analysis folder contains scripts that visualize the results generated by the pipeline. These visualizations help answer the question:

"Does fetching different sets of DEGs lead to different sets of transcription factors?"

Generating TF Rosters

The analysis scripts read the tf2tfbs_mapping.csv files produced by tfbsfetcher. For each of these files—which are derived from a specific set of DEGs—a binary TF roster is generated. Each roster is a vector whose length equals the total number of unique transcription factors identified by the pipeline. An option is provided to remove “intersections” (i.e., to exclude transcription factors that appear in two defined sets) before further analysis. You specify which directories to include by listing them under the comparisons key in your configuration file (e.g., configs/example.yaml):

analysis:
  comparisons:
    - [ "tfbsbatch_20250209_ciprofloxacin_up", "tfbsbatch_20250209_ciprofloxacin_down" ]
    - [ "tfbsbatch_20250209_ampicillin_up", "tfbsbatch_20250209_ampicillin_down" ]
    - ...
  # When true, transcription factors that appear in both comparison groups 
  # (i.e., the intersection) are removed from the roster.
  exclude_intersections: true

Each roster is saved as a tab-delimited file with columns such as:

regulator   group1   group2   intersection   group1_only   group2_only
acrr        1        0        0              1             0
ada         0        1        0              0             1
agar        1        0        0              1             0
aidb        0        1        0              0             1
allr        0        0        0              0             0
alls        0        0        0              0             0
arac        0        0        0              0             0
arca        1        1        1              0             0
argp        0        0        0              0             0

Visualization

The main.py script in analysis produces several plots that are saved in analysis/plots. One of two possible subdirectories are created based on the exclude_intersections flag in your configuration:

all_regs: When intersecting transcription factors are not removed.
intersects_removed: When intersecting transcription factors are removed.

A heatmap is generated to provide a global view of all the TF rosters (representing the presence or absence of transcription factors). Each roster is treated as a multidimensional vector and then clustered using the Leiden algorithm. For example:

This clustering approach helps to determine whether different sets of DEGs, enriched under distinct environmental conditions, are associated with similar transcription factors. For instance, the TF rosters identified for DEGs downregulated in the presence of kanamycin and ciprofloxacin are similar. In contrast, these rosters are distinct from the transcription factors associated with DEGs upregulated under nutrient limitation.

Beyond viewing all TF rosters together, the value counts of unique TF binding sites for each TF roster (derived from tf2tfbs_mapping.csv files) can also be visualized. For example:

Data Source

deg2tfbs is designed to reference data from dnadesign-data. Update the config file to point to the correct data paths.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
images		images
src/deg2tfbs		src/deg2tfbs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deg2tfbs

Installation

Directory Layout

Pipeline Steps

Custom DEG CSV Grouping

Running the Pipeline

Extendability

Analysis of Pipeline Results

Generating TF Rosters

Visualization

Data Source

About

Releases

Packages

Languages

License

e-south/deg2tfbs

Folders and files

Latest commit

History

Repository files navigation

deg2tfbs

Installation

Directory Layout

Pipeline Steps

Custom DEG CSV Grouping

Running the Pipeline

Extendability

Analysis of Pipeline Results

Generating TF Rosters

Visualization

Data Source

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages