Skip to content

Latest commit

 

History

History

alphafold_casp13

AlphaFold 1

This package provides an implementation of the contact prediction network used in AlphaFold 1, associated model weights and CASP13 dataset as used for CASP13 (2018) and published in Nature. This is completely different code from that used in AlphaFold 2 which was used in CASP14 (2020). You can find AlphaFold 2 at https://github.com/deepmind/alphafold.

Any publication that discloses findings arising from using this source code must cite Improved protein structure prediction using potentials from deep learning by Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis.

The paper abstract can be found on Nature's site 10.1038/s41586-019-1923-7 and the full text can be accessed directly at https://rdcu.be/b0mtx.

Setup

This code can't be used to predict structure of an arbitrary protein sequence. It can be used to predict structure only on the CASP13 dataset (links below). The feature generation code is tightly coupled to our internal infrastructure as well as external tools, hence we are unable to open-source it. We give guide as to the features used for those accustomed to computing them below. See also issue #18 for more details.

This code works on Linux, we don't support other operating systems.

Dependencies

You can set up Python virtual environment (you might need to install the python3-venv package first) with all needed dependencies inside the forked deepmind_research repository using:

python3 -m venv alphafold_venv
source alphafold_venv/bin/activate
pip install wheel
pip install -r alphafold_casp13/requirements.txt

Alternatively, you can just use the run_eval.sh script provided which will run these commands for you. See the section on running the system below for more details.

Data

While the code is licensed under the Apache 2.0 License, the AlphaFold weights and data are made available for non-commercial use only under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode

You can download the data from:

Input data

The dataset to reproduce AlphaFold's CASP13 results can be downloaded from http://bit.ly/alphafold-casp13-data. The dataset is in a single zip file called casp13_data.zip which has about 43.5 GB.

The zip file contains 1 directory for each CASP13 target and a LICENSE.txt file. Each target directory contains the following files:

  1. TARGET.tfrec file. This is a TFRecord file with serialized tf.train.Example protocol buffers that contain the features needed to run the model.
  2. contacts/TARGET.pickle file(s) with the predicted distogram. These pickles were pickled using Python 2, so to unpickle them in Python 3 you will need to set the encoding='latin1' optional argument for pickle.load().
  3. contacts/TARGET.rr file(s) with the contact map derived from the predicted distogram. The RR format is described on the CASP website.

Note that for T0999 the target was manually split based on hits in HHSearch into 5 sub-targets, hence there are 5 distograms (contacts/T0999s{1,2,3,4,5}.pickle) and 5 RR files (contacts/T0999s{1,2,3,4,5}.rr).

The contacts/ folder is not needed to run the model, these files are included only for convenience so that you don't need to run the inference for CASP13 targets to get the contact map.

Model checkpoints

The model checkpoints can be downloaded from http://bit.ly/alphafold-casp13-weights. The model checkpoints are in a zip file called alphafold_casp13_weights.zip which has about 210 MB.

The zip file contains:

  1. A directory 873731. This contains the weights for the distogram model.
  2. A directory 916425. This contains the weights for the background distogram model.
  3. A directory 941521. This contains the weights for the torsion model.
  4. LICENSE.txt. The model checkpoints have a non-commercial license which is defined in this file.

Each directory with model weights contains a number of different model configurations. Each model has a config file and associated weights. There is only one torsion model. Each model directory also contains a stats file that is used for feature normalization specific to that model.

Distogram prediction

Running the system

You can use the run_eval.sh script to run the entire Distogram prediction system. There are a few steps you need to start with:

  1. Download the input data as described above. Unpack the data in the directory with the code.
  2. Download the model checkpoints as described above. Unpack the data.
  3. In run_eval.sh set the following:
    • DISTOGRAM_MODEL to the path to the directory with the distogram model.
    • BACKGROUND_MODEL to the path to the directory with the background model.
    • TORSION_MODEL to the path to the directory with the torsion model.
    • TARGET to the name of the target.
    • TARGET_PATH to the path to the directory with the target input data.
    • OUTPUT_DIR is by default set to a new directory with a timestamp within your home directory.

Then run alphafold_casp13/run_eval.sh from the deepmind_research parent directory (you will get errors if you try running run_eval.sh directly from the alphafold_casp13 directory).

The contact prediction works in the following way:

  1. 4 replicas (by replica we mean a configuration file describing the network architecture and a snapshot with the network weights), each with slightly different model configuration, are launched to predict the distogram.
  2. 4 replicas, each with slightly different model configuration are launched to predict the background distogram.
  3. 1 replica is launched to predict the torsions.
  4. The predictions from the different replicas are averaged together using ensemble_contact_maps.py.
  5. The predictions for the 64 × 64, 128 × 128 and 256 × 256 distogram crops are pasted together using paste_contact_maps.py.

When running run_eval.sh the output has the following directory structure:

  • distogram/: Contains 4 subfolders, one for each replica. Each of these contain the predicted ASA, secondary structure and a pickle file with the distogram for each crop (see below for more details). It also contains an ensemble directory with the ensembled distograms.
  • background_distogram/: Contains 4 subfolders, one for each replica. Each of these contain a pickle file with the background distogram for each crop. It also contains an ensemble directory with the ensembled background distograms.
  • torsion/: Contains 1 subfolder as there was only a single replica. This folder contains contains the predicted ASA, secondary structure, backbone torsions and a pickle file with the distogram for each crop. It also contains an ensemble directory, which contains a copy of the predicted output as there is only a single replica in this case.
  • pasted/: Contains distograms obtained from the ensembled distograms by pasting. An RR contact map file is computed from this pasted distogram. This is the final distogram that was used in the subsequent AlphaFold folding pipeline in CASP13.

Distogram output format

The distogram is a Python pickle file with a dictionary containing the following fields:

  • min_range: The minimum range in Angstroms to consider in distograms.
  • max_range: The range in Angstroms to consider in distograms, see num_bins below for clarification. The upper end of the distogram is min_range + max_range.
  • num_bins: The number of bins in the distance histogram being predicted. We divide the interval from min_range to min_range + max_range into this many bins. The distograms were trained so that distances lower than min_range were counted in the lowest bin and distances higher than min_range + max_range were added to the final bin. The num_bins - 1 boundaries between bins are thus np.linspace(0, max_range, num_bins + 1, endpoint=True)[1:-1] + min_range.
  • sequence: The target sequence of amino acids of length L.
  • target: The name of the target.
  • domain: The name of the target including the domain name.
  • probs: The distogram as a Numpy array of shape [L, L, num_bins].

Data splits

We used a version of PDB downloaded on 2018-03-15. The train/test split can be found in the train_domains.txt and test_domains.txt files in this repository. The split is based on the CATH 2018-03-16 database.

Features

There is currently no plan to open source the feature generation code as it is tightly coupled to our internal infrastructure as well as external tools which we cannot open source.

Some features are needed only as placeholders to construct the model. These can be set to all zeros when running the inference. Such features are marked in the table below as not needed and you can just fill them with zeros when running inference.

The table below provides an overview of the features we used to make it possible to reconstruct our feature generation code. Some features that require more thorough explanation are explained in the section below the table. Note that NR stands for number of residues, i.e. the length of the amino acid sequence:

Name Needed TF DType Shape Description
aatype ✔️ float32 (NR, 21) One hot encoding of amino acid types. The mapping is ARNDCQEGHILKMFPSTWYVX -> range(21). See below.
alpha_mask int64 (NR, 1) Mask for alpha_positions.
alpha_positions float32 (NR, 3) (x, y, z) Carbon Alpha coordinates.
beta_mask int64 (NR, 1) Mask for beta_positions.
beta_positions float32 (NR, 3) (x, y, z) Carbon Beta coordinates.
between_segment_residues int64 (NR, 1) The number of between segment residues (BSR) at the next position. E.g. ABCXXD (XX is BSR) would be [0,0,2,0].
chain_name string (1) The chain name. E.g. 'A', 'B', ...
deletion_probability ✔️ float32 (NR, 1) The fraction of sequences that had a deletion at this position. See below.
domain_name string (1) The domain name.
gap_matrix ✔️ float32 (NR, NR, 1) Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below.
hhblits_profile float32 (NR, 22) A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'.
hmm_profile ✔️ float32 (NR, 30) The HHBlits HHM profile (from the -ohhm HHBlits output file). Asterisks in the output are replaced by 0.0. See below.
key string (1) The unique id of the protein.
mutual_information float32 (NR, NR, 1) The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604.
non_gapped_profile ✔️ float32 (NR, 21) A profile from amino acids only (discounting gaps). See below.
num_alignments ✔️ int64 (NR, 1) The number of HHBlits multiple sequence alignments. Has to be repeated NR times. See below.
num_effective_alignments float32 (1) The number of effective alignments (neff at 62 % sequence similarity).
phi_angles float32 (NR, 1) The phi angles.
phi_mask int64 (NR, 1) Mask for phi_angles.
profile float32 (NR, 21) A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse.
profile_with_prior ✔️ float32 (NR, 22) A profile computed using HHBlits which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389.
profile_with_prior_without_gaps ✔️ float32 (NR, 21) Same as profile_with_prior but without gaps included.
pseudo_bias ✔️ float32 (NR, 22) The bias computed in the MSA pseudolikelihood computation.
pseudo_frob ✔️ float32 (NR, NR, 1) Frobenius norm of pseudolikelihood (gaps not included). Similar to the output of CCMPred.
pseudolikelihood ✔️ float32 (NR, NR, 484) The weights computed in the MSA pseudolikelihood computation.
psi_angles float32 (NR, 1) The psi angles.
psi_mask int64 (NR, 1) Mask for psi_angles.
residue_index ✔️ int64 (NR, 1) Index of each residue giong from 0 to NR - 1. See below.
resolution float32 (1) The protein structure resolution.
reweighted_profile ✔️ float32 (NR, 22) Profile where sequences are reweighted to weight rarer sequences higher. See below.
sec_structure int64 (NR, 8) Secondary structure generated by DSSP and one-hot encoded by the mapping -HETSGBI -> range(8).
sec_structure_mask int64 (NR, 1) Mask for sec_structure_mask.
seq_length ✔️ int64 (NR, 1) The length of the amino acid sequence. Has to be repeated NR times. See below.
sequence ✔️ string (1) The amino acid sequence (1-letter amino acid encoding). See below.
solv_surf float32 (NR, 1) Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility.
solv_surf_mask int64 (NR, 1) Mask for solv_surf.
superfamily string (1) The superfamily CATH code.

More details on needed features

aatype

One hot encoding of amino acid types. The following code converts an amino acid string into the one-hot encoding:

def sequence_to_onehot(sequence):
  """Maps the given sequence into a one-hot encoded matrix."""
  mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
  num_entries = max(mapping.values()) + 1
  one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)

  for aa_index, aa_type in enumerate(sequence):
    aa_id = mapping[aa_type]
    one_hot_arr[aa_index, aa_id] = 1

  return one_hot_arr

deletion_probability

The fraction of sequences that had an insert state (denoted by a lowercase letter in the A3M format) at this position. We used the following code to compute it from the HHBlits MSA in the A3M format:

deletion_matrix = []
for msa_sequence in hhblits_a3m_sequences:
  deletion_vec = []
  deletion_count = 0
  for j in msa_sequence:
    if j.islower():
      deletion_count += 1
    else:
      deletion_vec.append(deletion_count)
      deletion_count = 0
  deletion_matrix.append(deletion_vec)

deletion_matrix = np.array(deletion_matrix)
deletion_matrix[deletion_matrix != 0] = 1.0
deletion_probability = deletion_matrix.sum(axis=0) / len(deletion_matrix)

gap_matrix

Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. Example:

MSA = A A C D B D F J G B M A
      - - C D B D F J G B M A
      A A C D B - - J G B M A
gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
gap_matrix = np.matmul(gap_count.T, gap_count)

hmm_profile

The HHBlits HHM profile (from the -ohhm HHBlits output file). Asterisks in the output are replaced by 0.0. The following code parses the HHM file:

def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
  """Extracts information from the hmm file and replaces asterisks."""
  profile_part = hhm_file.split('#')[-1]
  profile_part = profile_part.split('\n')
  whole_profile = [i.split() for i in profile_part]
  # This part strips away the header and the footer.
  whole_profile = whole_profile[5:-2]
  gap_profile = np.zeros((len(sequence), 10))
  aa_profile = np.zeros((len(sequence), 20))
  count_aa = 0
  count_gap = 0
  for line_values in whole_profile:
    if len(line_values) == 23:
      # The first and the last values in line_values are metadata, skip them.
      for j, t in enumerate(line_values[2:-1]):
        aa_profile[count_aa, j] = (
            2**(-float(t) / 1000.0) if t != '*' else asterisks_replace)
      count_aa += 1
    elif len(line_values) == 10:
      for j, t in enumerate(line_values):
        if j <= 6:
          gap_profile[count_gap, j] = (
              2**(-float(t) / 1000.0) if t != '*' else asterisks_replace)
        else:
          # Neff_M, Neff_I, and Neff_D are given in units of 0.001.
          gap_profile[count_gap, j] = float(t) / 1000.0
      count_gap += 1
    elif not line_values:
      pass
    else:
      raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
                       'got %d'%(line_values, len(line_values)))
  hmm_profile = np.hstack([aa_profile, gap_profile])
  assert len(hmm_profile) == len(sequence)
  return hmm_profile

non_gapped_profile

A profile from amino acids only (discounting gaps).

def non_gapped_profile(amino_acids):
  """Computes a profile from only amino acids and discounting gaps."""
  profile = np.zeros(21)
  for aa in amino_acids:
    if aa != 21:  # Ignore gaps.
      profile[aa] += 1.
  return profile / np.sum(profile)

num_alignments

The number of HHBlits multiple sequence alignments. Has to be repeated NR times. For example, if there are 10 alignments for a sequence of length 8, then num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]].

pseudo_frob

This feature collapses the 484 channels of pseudolikelihood into one by taking the Frobenius norm of the 484 channels and then subtracting the Average Product Correction of the computed Frobenius norm. The Frobenius norm does not take into account the 22nd gap state.

pseudolikelihood

Parameters of a Potts Model coupling the amino acid types of particular residues estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707 for more details.

residue_index

Index of each residue giong from 0 to NR - 1. For example, the sequence AACR has residue_index = [[0], [1], [2], [3]].

reweighted_profile

Profile where sequences are reweighted to weight rarer sequences higher. The sequence weights are calculated like this:

def sequence_weights(sequence_matrix):
  """Compute sequence reweighting to weight rarer sequences higher."""
  num_rows, num_res = sequence_matrix.shape
  cutoff = 0.62 * num_res
  weights = np.ones(num_rows, dtype=np.float32)
  for i in range(num_rows):
    for j in range(i + 1, num_rows):
      similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
      if similarity > cutoff:
        weights[i] += 1
        weights[j] += 1
  return 1.0 / weights

seq_length

The length of the amino acid sequence. Has to be repeated NR times. For example, the sequence AACR would have seq_length = [[4], [4], [4], [4]].

sequence

The amino acid sequence (1-letter amino acid encoding). For example, a protein with Alanine, Lysine, Arginine has sequence = 'AKR'.

Disclaimer

This is not an official Google product.