Skip to content

Commit

Permalink
T032: Fix typos
Browse files Browse the repository at this point in the history
  • Loading branch information
dominiquesydow committed Nov 21, 2022
1 parent 3732157 commit 4e5da17
Showing 1 changed file with 9 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@
"source": [
"Proteochemometrics (PCM) models a biological endpoint (e.g. compound activity) via supervised ML algorithms based on a series of features derived from chemical compounds and target proteins. PCM is an extension of a more widespread bioactivity modeling technique, Quantitative Structure Activity Relationship (QSAR) modeling, which relies solely on chemical features and that was introduced on **Talktorial T007**. Explore that talktorial to know more about the basic principle of activity prediction using ML.\n",
"\n",
"To successfully apply PCM modeling, we need a large dataset of molecule-protein pairs with known bioactivity values, a way of describing molecules and proteins, and a ML algorithm to train a model. Then, we can make predictions for new molecule-protein pairs.\n"
"To successfully apply PCM modeling, we need a large dataset of molecule-protein pairs with known bioactivity values, a way of describing molecules and proteins, and an ML algorithm to train a model. Then, we can make predictions for new molecule-protein pairs.\n"
]
},
{
Expand Down Expand Up @@ -196,11 +196,11 @@
}
},
"source": [
"As done for molecules, the proteins of interest need to be converted to a list of features or protein descriptors. Protein descriptors used in PCM applications are commonly based on the protein sequence and represent physicochemical characteristics of the amino acids that make up the sequence (e.g. Z-scales). Other protein descriptors represent topological (e.g. ST-scales) or electrostatic properties (e.g. MS-WHIM) of the protein sequence. Moreover, if structural information is available, protein descriptors can be derived from the 3D structure of the protein (e.g. sPairs) or the ligand-protein interaction in 3D (e.g. interaction fingerprints). Finally, with the widespread use of deep learning, protein embeddings can be obtained after parsing the protein sequence through the network (e.g. UniRep, AlphaFold embeddings). To read more about protein descriptors, check out these selection of articles ([*Brief. Bioinform.*,18, (2017)](https://pubmed.ncbi.nlm.nih.gov/26873661/), [*Int. J. Mol. Sci.*, 22, (2021)](https://pubmed.ncbi.nlm.nih.gov/34884688/), [*Comput. Struct. Biotechnol. J.*, 20, (2022)](https://pubmed.ncbi.nlm.nih.gov/35222841/)).\n",
"As done for molecules, the proteins of interest need to be converted to a list of features or protein descriptors. Protein descriptors used in PCM applications are commonly based on the protein sequence and represent physicochemical characteristics of the amino acids that make up the sequence (e.g. Z-scales). Other protein descriptors represent topological (e.g. ST-scales) or electrostatic properties (e.g. MS-WHIM) of the protein sequence. Moreover, if structural information is available, protein descriptors can be derived from the 3D structure of the protein (e.g. sPairs) or the ligand-protein interaction in 3D (e.g. interaction fingerprints). Finally, with the widespread use of deep learning, protein embeddings can be obtained after parsing the protein sequence through the network (e.g. UniRep, AlphaFold embeddings). To read more about protein descriptors, check out this selection of articles ([*Brief. Bioinform.*,18, (2017)](https://pubmed.ncbi.nlm.nih.gov/26873661/), [*Int. J. Mol. Sci.*, 22, (2021)](https://pubmed.ncbi.nlm.nih.gov/34884688/), [*Comput. Struct. Biotechnol. J.*, 20, (2022)](https://pubmed.ncbi.nlm.nih.gov/35222841/)).\n",
"\n",
"For protein descriptors based on the protein sequence, an aspect to take into account is that for ML the length of the protein descriptor needs to be the same. However, most proteins do not have the same sequence length. To solve this issue, there are two main approaches:\n",
"\n",
"* **Multiple sequence alignment (MSA)**: If the entire protein is to be included in the model, a MSA can be performed. The final descriptor has as many entries as the number of features per amino acid multiplied by the number of aligned positions. To account for gaps in the alignment, zeros are introduced in the descriptor. A MSA is a tool to identify common patterns between three or more biological sequences, usually DNA, RNA, or protein. One of the most common tools to perform MSA is Clustal Omega (or ClustalO), available as a [webtool](https://www.ebi.ac.uk/Tools/msa/clustalo/).\n",
"* **Multiple sequence alignment (MSA)**: If the entire protein is to be included in the model, an MSA can be performed. The final descriptor has as many entries as the number of features per amino acid multiplied by the number of aligned positions. To account for gaps in the alignment, zeros are introduced in the descriptor. An MSA is a tool to identify common patterns between three or more biological sequences, usually DNA, RNA, or protein. One of the most common tools to perform MSA is Clustal Omega (or ClustalO), available as a [webtool](https://www.ebi.ac.uk/Tools/msa/clustalo/).\n",
"* **Binding pocket selection**: To avoid unnecessary features, a binding pocket of the same length can be selected for each protein. Normally, the binding pocket selection is preceded by a multiple sequence alignment and driven by known structural or mutagenesis data.\n",
"\n",
"Other options are available when proteins are not of the same family or do not share a binding pocket (see [*Drug Discov.* (2019), **32**, 89-98](https://www.sciencedirect.com/science/article/pii/S1740674920300111?via%3Dihub))\n",
Expand Down Expand Up @@ -433,7 +433,7 @@
"metadata": {},
"source": [
"**Note**: We will lateron use the ClustalO web service to align multiple sequences. In order to use the service, we need to provide an email address (see the [docs](https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Clustal+Omega+Help+and+Documentation)).\n",
"Please set your email address here; for the purpose of this template talktorial, we set the email to `None` and use pre-calculated data (see \"Practical\" section of this talktoria and [this discussion](https://github.com/volkamerlab/teachopencadd/discussions/283))."
"Please set your email address here; for the purpose of this template talktorial, we set the email to `None` and use pre-calculated data (see \"Practical\" section of this talktorial and [this discussion](https://github.com/volkamerlab/teachopencadd/discussions/283))."
]
},
{
Expand Down Expand Up @@ -648,7 +648,10 @@
},
"pycharm": {
"name": "#%%\n"
}
},
"tags": [
"nbshpinx-thumbnail"
]
},
"outputs": [
{
Expand Down Expand Up @@ -825,7 +828,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to ensure protein descriptors are of the same length, we first need to align the target sequences. We do this by creating a MSA with the software Clustal Omega (ClustalO). To begin with, we extract the protein sequences from the target files in Papyrus. The sequences could also be obtained from UniProt, but this way we ensure we are always retrieving the canonical isoform sequence.\n",
"In order to ensure protein descriptors are of the same length, we first need to align the target sequences. We do this by creating an MSA with the software Clustal Omega (ClustalO). To begin with, we extract the protein sequences from the target files in Papyrus. The sequences could also be obtained from UniProt, but this way we ensure we are always retrieving the canonical isoform sequence.\n",
"Since Papyrus also contains bioactivity data for different mutants and species, the main protein identifier (`target_id` variable) consists of the UniProt accession code and the mutant ('WT' for wild type). Even though we are interested in the wild type, to map our targets of interest we calculate a new variable called `accession` to be consistent with the rest of the talktorial."
]
},
Expand Down

0 comments on commit 4e5da17

Please sign in to comment.