From bf3add7c220a637c1c1fd52d62fa02324a20341b Mon Sep 17 00:00:00 2001 From: Kevin Kaichuang Yang Date: Tue, 12 Sep 2023 10:19:03 -0400 Subject: [PATCH 1/2] Add model descriptions --- README.md | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 560e888..eed67ce 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,7 @@ to demonstrate their power for controllable protein design. Below, we provide do - [Installation](#installation) - [Datasets](#datasets) - [Loading pretrained models](#loading-pretrained-models) +- [Available models](#available-models) - [Unconditional generation](#unconditional-sequence-generation) - [Unconditional sequence generation](#unconditional-generation-with-evodiff-seq) - [Unconditional MSA generation](#unconditional-generation-with-evodiff-msa) @@ -59,7 +60,7 @@ scripts, please download the following packages in addition to EvoDiff: We refer to the setup instructions outlined by the authors of those tools. -## Datasets +### Datasets We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains approximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), @@ -76,7 +77,7 @@ test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To acces The filenames for train and validation Openfold splits are saved in `data/valid_msas.csv` and `data/train_msas.csv` -## Loading pretrained models +### Loading pretrained models To load a model: ``` from evodiff.pretrained import OA_DM_38M @@ -103,6 +104,22 @@ It is also possible to load our LRAR baseline models: Note: if you want to download a `BLOSUM` model, you will first need to download [data/blosum62-special-MSA.mat](https://github.com/microsoft/evodiff/blob/main/data/blosum62-special-MSA.mat). +## Available models + +We investigated two types of forward processes for diffusion over discrete data modalitiesto determine which would be most effective. +In order-agnostic autoregressive diffusion [OADM](https://arxiv.org/abs/2110.02037), one amino acid is converted to a special mask token at each step in the forward process. +After $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. +We additionally designed discrete denoising diffusion probabilistic models [D3PM](https://arxiv.org/abs/2107.03006) for protein sequences. +In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids. +In the reverse process for both, a neural network model is trained to undo the previous corruption. +The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. +We trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the [CARP](https://doi.org/10.1101/2022.05.19.492714) protein masked language model. +We trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding. + +To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the [MSA Transformer](https://proceedings.mlr.press/v139/rao21a.html) architecture on the [OpenFold](https://github.com/aqlaboratory/openfold) dataset}. +To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes. + + ## Unconditional sequence generation ### Unconditional generation with EvoDiff-Seq @@ -136,10 +153,6 @@ To use this evaluation script, you must have the dependencies listed under the [ ### Unconditional generation with EvoDiff-MSA -To explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture -on the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, -either by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”). - It is possible to unconditionally generate an entire MSA, using the following script: ``` python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming @@ -328,4 +341,4 @@ This project may contain trademarks or logos for projects, products, or services trademarks or logos are subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. -Any use of third party trademarks or logos is subject to those third-party's policies. \ No newline at end of file +Any use of third party trademarks or logos is subject to those third-party's policies. From 2f976af83e868c92ed58352229a6d7fc39ea4b9d Mon Sep 17 00:00:00 2001 From: Kevin Kaichuang Yang Date: Tue, 12 Sep 2023 10:20:45 -0400 Subject: [PATCH 2/2] Fix spacing --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 9fd5213..fa8dbfb 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,9 @@

+ ### Description + In this work, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional