Skip to content

Predicting protein inter-chain residue distances from sequences irrespective of paired MSA.

License

Notifications You must be signed in to change notification settings

NatureGeorge/ResNetPPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ResNetPPI: Predicting protein inter-chain residue distances from sequences irrespective of paired MSA

[email protected] (ORCID: 0000-0002-2761-3291)

A lab rotation project. Under development from 2021.12 to 2022.1.

Summary

summary_fig

Advantages

  • accept variable input size
    • protein of any length
    • any number of homologous sequences
  • does not rely on paired MSA, thus it can predict cross-species protein interaction

Training

Data Resources

Input Features

  • Protein Sequence * 2
    • build MSA via HHblits if possible
    • onehot encoding of amino acid type (20+gap+X)*2 + (hydrophoblic+hydrophilic)*2

Fitting Targets

Fitting Targets (i.e. Inter-chain Cβ-Cβ Distance Map) Examples

NOTE: Cα for GLY

PDB Inter-chain Cβ-Cβ Distance Map pdb_id human chain virus chain len(human chain) len(virus chain)
3wwt A B 123 107
1im3 E H 275 95
6bvv A B 416 24
4rf1 B A 75 321
6e5x B A 13 127

Real-valued distances are discretely binned:

  • from 2Å to 20Å
  • with bin size of 0.5Å
  • 36 bins + 1 bin for $[20, +\infty)$

Network Architecture

  • ResNet1D
    • 1d residual block: ((Conv1d + BatchNorm1d) + ELU + (Conv1d + BatchNorm1d)) (+) ELU
    • 8 1d-blocks (64 channels)
  • ResNet2D
    • 2d residual block: ((Conv2d + BatchNorm2d) + ELU + (Conv2d + BatchNorm2d)) (+) ELU
    • 16 2d-blocks (96 channels), cycling through dilations $(1,2,4,8)$
  • (mini-)batch size: 1
  • Cross-entropy Loss

Model Design

  • for each protein sequence (of length $L$) search homologous sequences and input the MSA (of size $K$) if possible, otherwise input the single sequence
    • use original MSA to calculate the weight $w_k$ for each homologous sequence, $w_k=\frac{1}{\text{count}(\text{sequence with identity}\ge 0.8)}$
    • calculate $M_\text{eff}=\sum_{k}^{K}w_k$
  • MSA Encoding: perform onehot-encoding for each pairwise alignment ($2\times L_k$, consider both insertion and deletion: $\rightarrow 48\times L_k$)
    • onehot-encoding including 22+2 channels for the reference sequence, 22+2 channels for the homologous sequence
    • 22: 20 amino acid types + 1 gap + 1 unknown type
    • 2: 1 hydrophoblic + 1 hydrophilic
    • for the single sequence input, all the homologous related channels are filled with the reference sequence's corresponding values
    • hence we get ${48\times L_k, k\in K}$
  • MSA Embedding: for each encoded pairwise alignment, feed into the ResNet1D and get embedded pairwise alignment ($64\times L_k$)
    • hence we get ${64\times L_k, k\in K}$
    • omit the insertion region of the homologous sequences, thus we can get a $K\times 64 \times L$ tensor
      • $x_k\in R^{64\times L}$
      • $x_k(i) \in R^{64}$
  • Paired Evolution Aggregation
    • calculate one body term
      • $f_1(i)=\frac{1}{M_{\text{eff}{1}}}\sum{k}^{K_1}w_{1_k} x_{1_k}(i)$
      • $f_2(j)=\frac{1}{M_{\text{eff}{2}}}\sum{k}^{K_2}w_{2_k} x_{2_k}(j)$
    • apply max function
      • $A(i,c) = \max{x_k(i,c), k \in K}$, c: channel; $A \in R^{64\times L}$
    • calculate two body term
      • $s(i,j) = \frac{1}{\sqrt{M_{\text{eff}{1}}\cdot M{\text{eff}_{2}}}}[A_1(i)\otimes A_2(j)], s(i,j)\in R^{64\times 64}$
    • concatenate
      • $h(i,j) = \text{concat}(f_1(i), f_2(j), s(i,j)), h(i,j)\in R^{4224}$
  • Inter-chain Distance Estimation
    • feed the $R^{4224 \times L_1\times L_2}$ tensor into the ResNet2D
    • convert the ResNet2D outputs into the discrete distance distribution $37 \times L_1 \times L_2$ throught a (Conv2d+BatchNorm2d+ELU) layer

During Training

  • randomly sample 1000 homologous sequences if $K > 1000$
  • randomly crop the distance matrix into $64 \times 64$ shape
    • for those input sequences of length $\le 64$, keep their original length
  • loss function ignore those missing residues of PDB structure
    • definition of missing
      1. complete without any modeled atoms for a residue
      2. GLY without Cα atom
      3. non-GLY without Cβ atom and without any anchoring atoms (Cα, N, C) that can infer Cβ atom

Current Training and Validation Results

Dataset Preparation

  • training dataset (362)
  • validation dataset (100)

NOTE: the datasets have not been carefully curated, they just serve as demo inputs for training and validation. Thus the training and validation results are just a demonstration of the model's learning ability.

TODO

  • increase mini-batch size
  • multiple GPU training
  • improve archi

Hardware & Software Prerequisites

  • Hardware (tested)
    • GPU: NVIDIA Tesla T4 (16G)
    • CUDA Version: 11.2
  • Software
    • Python 3.8 or later in a conda environment
    • PyTorch and other Python packages: see requirements

Acknowledgment

I would like to thank Jinyuan Guo and Prof. Huaiqiu Zhu for their helpful discussions and provide me with devices to develop this project.

Reference Papers

base on these papers

License

Apache-2.0 License

About

Predicting protein inter-chain residue distances from sequences irrespective of paired MSA.

Resources

License

Stars

Watchers

Forks