This repository contains a collection of Python modules and bioinformatic pipelines related to DNA sequence design.
dnadesign/
├── README.md # High-level project documentation
└── src/
└── dnadesign/
├── configs/
│ └── example.yaml # Global configuration for all pipelines
├── utils.py # Shared utilities (e.g., config loading, common functions)
├── seqfetcher/ # Data ingestion modules (one per dataset)
│ └── <dataset>_module.py
├── densegen/
│ ├── main.py # CLI entry point for densegen pipeline
│ └── ...
├── sequences/
│ ├── seqmanager.py # Tool for validating and inspecting .pt files
│ └── seqbatch_<name>/ # Each subdirectory contains:
│ ├── <batch>.pt # Torch file with a list-of-dicts (each dict represents a sequence)
│ └── summary.yaml # YAML summary of the batch (metadata, parameters, runtime)
├── evoinference/
│ ├── main.py # CLI entry point for evoinference pipeline
│ └── ...
└── clustering/
├── main.py # CLI entry point for clustering pipeline
└── ...
-
seqfetcher is a data ingestion pipeline that is designed to reference a sibling directory dnadesign-data, which includes bacterial promoter engineering datasets curated from primary literature, along with experimental datasets detailing other promoters and transcription factor binding sites derived from RegulonDB and EcoCyc.
-
densegen is a DNA sequence design pipeline built on the integer linear programming framework from the dense-arrays package. It assembles batches of synthetic promoters with densely packed transcription factor binding sites. The pipeline references curated datasets from the deg2tfbs repository, subsampling dozens of binding sites for the solver while enforcing time limits to prevent stalling.
-
sequences serves as the central storage location for nucleotide sequences within the project, organizing and updating outputs from seqfetcher, densegen, and evoinference into a standardized data structure. Subdirectories are prefixed with seqbatch or densebatch to indicate their source and contain both
.yaml
files, which provide batch summaries, and a corresponding.pt
file storing sequences and metadata. Each sequence file is structured as a list of dictionaries, following this format:example_sequence_entry = [ { "id": "90b4e54f-b5f9-48ef-882a-8763653ae826", "meta_date_accessed": "2025-02-19T12:01:30.602571", "meta_source": "deg2tfbs_all_DEG_sets", "sequence": "gtactgCTGCAAGATAGTGTGAATGACGTTCAATATAATGGCTGATCTTATTTCCAGGAAACCGTTGCCACA", "meta_type": "dense-array", "evo2_logits_mean_pooled": tensor([[[-10.3750, 10.3750, ..., 10.3750, 10.3750]]], dtype=torch.bfloat16), "evo2_logits_shape": [1, 512] }, # Additional dictionary entries extend the list ]
Note: To process custom sequences through downstream modules, format your data as a list of dictionaries matching the structure above and save it as a
.pt
file. -
evoinference is a wrapper for Evo 2 (checkpoint:
evo2_7b
), a genomic foundation model for molecular-to-genome-scale modeling and design. This pipeline processes batches of.pt
files from the sibling sequences directory, passing each sequence through Evo 2 and extracting tensors derived from output logits or intermediate layer embeddings from the LLM. The extracted data is then saved in place as additional keys within the original.pt
file.For more context, see the following papers:
-
clustering utilizes Scanpy for cluster analysis on nucleotide sequences stored in the sibling sequences directory. By default, it uses the mean-pooled output logits of Evo 2 along the sequence dimension as input. The pipeline generates UMAP embeddings, applies Leiden clustering, and also supports downstream analyses, such as cluster composition and diversity assessment.
This style is appropriate for workflows that do not require heavy dense array computations or Evo 2 inference.
-
Create and Activate a Conda Environment
conda create -n dnadesign_local python=3.11 -y conda activate dnadesign_local
-
Install Dependencies
conda install pytorch torchvision torchaudio scanpy=1.10.3 seaborn numpy pandas matplotlib pytest pyyaml leidenalg igraph openpyxl xlrd -c conda-forge -y
-
Clone and Install the
dnadesign
Repositorygit clone https://github.com/e-south/dnadesign.git cd dnadesign pip install -e . # Install the local dnadesign package in editable mode
Installing in editable mode ensures that changes to the source files are immediately reflected without needing a reinstall.
-
(Optional) Clone the
dense-arrays
PackageThe densegen workflow relies on the dense-arrays package. Install it as a sibling directory to
dnadesign
.git clone https://gitlab.com/dunloplab/dense-arrays.git cd dense-arrays pip install .
This setup is designed for running more resource-intensive workflows on a shared computing cluster, such as solving dense array with Gurobi, or performing inference with Evo 2. For Evo 2’s FP8 features, a GPU with compute capability 8.9 or higher is required.
Interactive Session Resource Request Example:
- densegen workflow:
- Modules: miniconda gurobi
- Cores: 16
- GPUs: 0
- evoinference workflow:
- Modules: cuda miniconda
- Cores: 3
- GPUs: 1
- GPU Compute Capability: 8.9
- Extra options:
-l mem_per_core=8G
(Check your cluster documentation for submission details.)
-
Set Up the CUDA Environment
Evo 2’s GPU-accelerated components require NVIDIA’s CUDA toolkit. This step loads the necessary CUDA and GCC modules, verifies the presence of the CUDA compiler (nvcc), and exports environment variables so that both Python and build scripts can locate the CUDA installation. These settings are crucial for compiling CUDA extensions and ensuring compatibility with PyTorch.
module load cuda/12.5 # Load the CUDA module appropriate for your cluster module load gcc/10.2.0 # Load a GCC version that is compatible with CUDA # Verify that nvcc is available: ls $CUDA_HOME/bin/nvcc # This should display the path to the nvcc binary # Export CUDA environment variables: export CUDA_HOME=/share/pkg.8/cuda/12.5/install export CUDA_PATH=/share/pkg.8/cuda/12.5/install export CUDA_TOOLKIT_ROOT_DIR=/share/pkg.8/cuda/12.5/install export CUDA_BIN_PATH=/share/pkg.8/cuda/12.5/install/bin export PATH=$CUDA_BIN_PATH:$PATH export NVCC=$CUDA_BIN_PATH/nvcc # Optional: Check versions to verify correct module load and installation nvcc --version gcc --version
-
Create and Activate the Conda Environment
conda create -n dnadesign_cu126 python=3.11 -y conda activate dnadesign_cu126
-
(Optional) Install Mamba and Upgrade Build Tools
Mamba speeds up dependency resolution and installation. Upgrading build tools (pip, setuptools, wheel) just ensures compatibility and access to the latest features.
conda install -c conda-forge mamba -y unset -f mamba # (Optional: Unset mamba shell function if conflicts occur) mamba install pip -c conda-forge -y pip install --upgrade pip setuptools wheel
-
Install PyTorch with CUDA Support
Installing PyTorch built for CUDA ensures that GPU acceleration is enabled for Evo 2’s computations. Here, we install a version built for CUDA 12.6, which is optimal for GPUs with compute capability ≥8.9.
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
Note: If your GPU does not support FP8 or if you encounter compatibility issues, consider installing a version built for an older CUDA (e.g., cu118) and try and adjust Evo 2’s configuration.
-
Install Additional Packages via Mamba
These scientific and plotting libraries are required by various subprojects within dnadesign.
mamba install scanpy=1.10.3 seaborn numpy pandas matplotlib pytest pyyaml leidenalg igraph openpyxl xlrd -c conda-forge -y
-
Install Evo 2
Cloning with submodules ensures that all dependencies, including those in external repositories, are included.
git clone --recurse-submodules [email protected]:ArcInstitute/evo2.git cd evo2
-
Override CUDA Paths in the Makefile (if necessary)
If Evo 2’s build system does not detect your CUDA installation correctly, update the Makefile in the vortex directory to use the correct paths:
# Change the ":=" to "?=" for these lines CUDA_PATH ?= /usr/local/cuda CUDA_HOME ?= $(CUDA_PATH) CUDACXX ?= $(CUDA_PATH)/bin/nvcc
Change these defaults so that your exported environment variables take precedence.
-
Install Evo 2 in Editable Mode
cd evo2 pip install -e .
-
(Optional) Build Additional Components
Some Evo 2 features, such as custom CUDA extensions, require a build step. Running make setup-full compiles these extensions.
cd vortex make setup-full CUDA_PATH=/share/pkg.8/cuda/12.5/install CUDACXX=/share/pkg.8/cuda/12.5/install/bin/nvcc CUDA_HOME=/share/pkg.8/cuda/12.5/install cd ..
-
Test the Evo 2 Installation
Running a test script verifies that the installation was successful and that Evo2 can access the necessary resources and configurations.
cd evo2 python ./test/test_evo2.py --model_name evo2_7b
-
(Optional) Clone the
dense-arrays
PackageThe densegen workflow relies on the dense-arrays package. Install it as a sibling directory to
dnadesign
.git clone https://gitlab.com/dunloplab/dense-arrays.git cd dense-arrays pip install .
-
Clone the dnadesign-data repository, and place it as a sibling directory to dnadesign. This will enable seqfetcher to generate custom lists of dictionaires from these sources.
-
Update the
configs/example.yaml
file as desired and try running different pipelines.