Merge pull request #6 from microsoft/yangkky-readme-1

Add model descriptions
microsoft · Sep 12, 2023 · 34e3556 · 34e3556
2 parents 3733b6c + bf3add7
commit 34e3556
Showing 1 changed file with 20 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ Evodiff is described in this [preprint](todo:link)
 - [Installation](#installation)
     - [Datasets](#datasets)
     - [Loading pretrained models](#loading-pretrained-models)
+- [Available models](#available-models)
 - [Unconditional generation](#unconditional-sequence-generation) 
   - [Unconditional sequence generation](#unconditional-generation-with-evodiff-seq)
   - [Unconditional MSA generation](#unconditional-generation-with-evodiff-msa)
@@ -63,7 +64,7 @@ scripts, please download the following packages in addition to EvoDiff:
 
 We refer to the setup instructions outlined by the authors of those tools.
 
-## Datasets
+### Datasets
 We obtain sequences from the [Uniref50 dataset](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4375400/), which contains 
 approximately 42 million protein sequences. 
 The Multiple Sequence Alignments (MSAs) are from the [OpenFold dataset](https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2), 
@@ -80,7 +81,7 @@ test_data = UniRefDataset('data/uniref50/', 'rtest', structure=False) # To acces
 
 The filenames for train and validation Openfold splits are saved in `data/valid_msas.csv` and `data/train_msas.csv`
 
-## Loading pretrained models
+### Loading pretrained models
 To load a model:
 ```
 from evodiff.pretrained import OA_DM_38M
@@ -107,6 +108,22 @@ It is also possible to load our LRAR baseline models:
 
 Note: if you want to download a `BLOSUM` model, you will first need to download [data/blosum62-special-MSA.mat](https://github.com/microsoft/evodiff/blob/main/data/blosum62-special-MSA.mat).
 
+## Available models
+
+We investigated two types of forward processes for diffusion over discrete data modalitiesto determine which would be most effective. 
+In order-agnostic autoregressive diffusion [OADM](https://arxiv.org/abs/2110.02037), one amino acid is converted to a special mask token at each step in the forward process. 
+After $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. 
+We additionally designed discrete denoising diffusion probabilistic models [D3PM](https://arxiv.org/abs/2107.03006) for protein sequences.
+In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids.
+In the reverse process for both, a neural network model is trained to undo the previous corruption. 
+The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. 
+We trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the [CARP](https://doi.org/10.1101/2022.05.19.492714) protein masked language model.
+We trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding. 
+
+To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the [MSA Transformer](https://proceedings.mlr.press/v139/rao21a.html) architecture on the [OpenFold](https://github.com/aqlaboratory/openfold) dataset}. 
+To do so, we subsampled MSAs to a length of 512 residues per sequence and a depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes. 
+
+
 ## Unconditional sequence generation
 
 ### Unconditional generation with EvoDiff-Seq
@@ -141,10 +158,6 @@ To use this evaluation script, you must have the dependencies listed under the [
 
 ### Unconditional generation with EvoDiff-MSA
 
-To explicitly leverage evolutionary information, we design and train EvoDiff-MSA models using the MSA Transformer architecture 
-on the OpenFold dataset. To do so, we subsample MSAs to a length of 512 residues per sequence and a depth of 64 sequences, 
-either by randomly sampling the sequences (“Random”) or by greedily maximizing for sequence diversity (“Max”). 
-
 It is possible to unconditionally generate an entire MSA, using the following script:
 ``` 
 python evodiff/generate-msa.py --model-type msa_oa_dm_maxsub --batch-size 1 --n-sequences 64 --n-sequences 256 --subsampling MaxHamming
@@ -333,4 +346,4 @@ This project may contain trademarks or logos for projects, products, or services
 trademarks or logos are subject to and must follow 
 [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
 Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
-Any use of third party trademarks or logos is subject to those third-party's policies.
+Any use of third party trademarks or logos is subject to those third-party's policies.