Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T032: Compound activity: Proteochemometrics #278

Open
wants to merge 66 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
5514df0
Start branch
gorostiolam Oct 7, 2022
4d0136c
Start T029 talktorial on proteochemometrics (PCM)
gorostiolam Oct 11, 2022
4267bcc
Add environment file
gorostiolam Oct 11, 2022
d3efd75
Completed Theory draft
gorostiolam Oct 12, 2022
af8bdb0
Added first practical part to download, read and filter Papyrus dataset
gorostiolam Oct 13, 2022
11ff43f
Added figure with splitting methods
gorostiolam Oct 13, 2022
1d7f445
Add section to download clustalO (Win,Unix,Mac) and perform MSA
gorostiolam Oct 13, 2022
161a3bf
Add libraries for installing ClustalO and visualizing MSA
gorostiolam Oct 13, 2022
a460d8e
Update tutorial number to 032
gorostiolam Oct 17, 2022
45a60d1
Add PCM and QSAR training and validation and update tutorial number t…
gorostiolam Oct 17, 2022
74eee29
Change dependencies to use ClustalO REST API
gorostiolam Oct 18, 2022
1c122e5
Update code to use ClustalO REST API instead of binary download
gorostiolam Oct 18, 2022
054ccfe
Add ClustalO REST API client
gorostiolam Oct 18, 2022
73a6c11
Add modelling interpretation and discussion. Make modelling output neat.
gorostiolam Oct 18, 2022
11bee4e
Remove contribution cell and update practical list of contents
gorostiolam Oct 18, 2022
045b2de
Update README file with intro from talktorial
gorostiolam Oct 18, 2022
284d50e
Update data README file
gorostiolam Oct 18, 2022
a094049
Update figures README file
gorostiolam Oct 18, 2022
59b8d68
Update Papyrus workflow image
gorostiolam Oct 18, 2022
bed66ab
Update Papyrus workflow image
gorostiolam Oct 18, 2022
1616aeb
Proofread grammar
gorostiolam Oct 18, 2022
d5319f4
Grammar and code revision by Olivier
gorostiolam Oct 20, 2022
8b061b5
Theory revision based on Andrea's review.
gorostiolam Oct 26, 2022
494c8fe
Theory contents revision based on Andrea's review.
gorostiolam Oct 26, 2022
45cf83d
Run CI unit tests on T032 only (temporarily)
dominiquesydow Oct 27, 2022
101233d
Add T032 to docs
dominiquesydow Oct 27, 2022
8af34f8
Add pytest to tmp env
dominiquesydow Oct 27, 2022
fdd76f6
Update README
dominiquesydow Oct 27, 2022
a14321e
CI: Temporarily remove CLI test
dominiquesydow Oct 27, 2022
88acd01
Add output directory argument
gorostiolam Oct 27, 2022
34e813e
Add Dominique's comments
gorostiolam Oct 27, 2022
17326ca
Merge remote-tracking branch 'origin/mgg-032-compound_activity_proteo…
gorostiolam Oct 27, 2022
dea919b
Add more Dominique's comments
gorostiolam Oct 28, 2022
89e1b10
Move clustalo.py script to /scripts folder and create README
gorostiolam Oct 28, 2022
85506a9
Resize figures
gorostiolam Oct 28, 2022
e8951b0
CI: Revert back to test_env.yml and CLI tests
dominiquesydow Oct 28, 2022
3b0b60b
Env: Add T032 packages + tmp. comment other talktorial packages
dominiquesydow Oct 28, 2022
8aa6c67
Add latest Dominique comments
gorostiolam Oct 28, 2022
74fae96
Merge remote-tracking branch 'origin/mgg-032-compound_activity_proteo…
gorostiolam Oct 28, 2022
441fb2b
Run black-nb
gorostiolam Oct 28, 2022
a3ca4ee
Automatically generate README file (and correct accents in author's s…
gorostiolam Oct 28, 2022
982adc9
Improve aesthetics of outputs
gorostiolam Oct 28, 2022
5ee3305
Update code for new location of ClustalO client
gorostiolam Oct 28, 2022
e7566aa
Update ClustalO client check time
gorostiolam Oct 28, 2022
c84f5dd
Docs config: Set language to "en" (not None)
dominiquesydow Nov 7, 2022
feacb89
README: Fix broken conda-forge badge
dominiquesydow Nov 7, 2022
56d1398
Docs: Add T032 nblink file
dominiquesydow Nov 7, 2022
1717749
T032: Add pre-calculated alignments (ClustalO)
dominiquesydow Nov 7, 2022
c76c555
T032: Set email to None; more formatting
dominiquesydow Nov 7, 2022
e016da8
Update HTML formatting to Markdown
gorostiolam Nov 8, 2022
fadc2d6
Update conflicting formatting in regression evaluation metrics
gorostiolam Nov 8, 2022
778062d
T032: Move pip installs from env file to notebook itself
dominiquesydow Nov 21, 2022
13f7b5c
Satisfy black-nb
dominiquesydow Nov 21, 2022
5d52451
Regenerate READMEs
dominiquesydow Nov 21, 2022
3732157
T032: Rerun notebook & add NBVAL_CHECK_OUTPUT checks
dominiquesydow Nov 21, 2022
4e5da17
T032: Fix typos
dominiquesydow Nov 21, 2022
a6cd9df
T032: Remove thumbnail (talktorial has no pure png outputs we can use)
dominiquesydow Nov 21, 2022
5fd7aae
T032: Fix typo [skip ci]
dominiquesydow Nov 21, 2022
3c0eb9d
Env: Sync env with latest master
dominiquesydow Jan 2, 2023
a4ba93f
README: Sync with master README
dominiquesydow Jan 2, 2023
4ee706d
CI: Sync with master CI
dominiquesydow Jan 2, 2023
8769fd5
Merge branch 'master' into mgg-032-compound_activity_proteochemometrics
dominiquesydow Jan 2, 2023
265ae70
CI: Drop T032 under Windows
dominiquesydow Jan 2, 2023
4bf8bbc
CI: Add env list after T032-specific package installations (tmp)
dominiquesydow Jan 2, 2023
1c99cde
Merge pull request #327 from volkamerlab/master
jesperswillem Mar 16, 2023
63ea5c6
trigger CI
hamzaibrahim21 May 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ on:
pull_request:
branches:
- "master"
- "base-ci-env-fix"
schedule:
# Run a cron job once weekly on Monday
- cron: "0 3 * * 1"
Expand Down Expand Up @@ -75,6 +74,8 @@ jobs:

# Ignore T019 under Windows, see https://github.com/volkamerlab/teachopencadd/issues/313
PYTEST_IGNORE_T019="--ignore=teachopencadd/talktorials/T019_md_simulation/talktorial.ipynb"
# Ignore T032 under Windows, see
PYTEST_IGNORE_T032="--ignore=teachopencadd/talktorials/T032_compound_activity_proteochemometrics/talktorial.ipynb"

# Temporarily ignored notebooks, see https://github.com/volkamerlab/teachopencadd/issues/303
PYTEST_IGNORE_T008="--ignore=teachopencadd/talktorials/T008_query_pdb/talktorial.ipynb"
Expand All @@ -83,9 +84,15 @@ jobs:
# Temporarily ignore T019
pytest $PYTEST_ARGS teachopencadd/talktorials/ $PYTEST_IGNORE_T008 $PYTEST_IGNORE_T019
else
pytest $PYTEST_ARGS teachopencadd/talktorials/ $PYTEST_IGNORE_T008 $PYTEST_IGNORE_T019
pytest $PYTEST_ARGS teachopencadd/talktorials/ $PYTEST_IGNORE_T008 $PYTEST_IGNORE_T019 $PYTEST_IGNORE_T032
fi

- name: Environment Information (after T032 installation)
shell: bash -l {0}
run: |
conda info --all
conda list

format:
name: Black
runs-on: ubuntu-latest
Expand Down
10 changes: 9 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -155,4 +155,12 @@ node_modules/
# Talktorial outputs
# T018
teachopencadd/talktorials/T018_automated_cadd_pipeline/data/Outputs
teachopencadd/talktorials/T018_automated_cadd_pipeline/data/PipelineInputData_Project2.csv
teachopencadd/talktorials/T018_automated_cadd_pipeline/data/PipelineInputData_Project2.csv

# Talktorial outputs
# T032
teachopencadd/talktorials/T032_compound_activity_proteochemometrics/data/papyrus
teachopencadd/talktorials/T032_compound_activity_proteochemometrics/data/sequences.fasta
teachopencadd/talktorials/T032_compound_activity_proteochemometrics/data/aligned*
# Keep this file for CI purposes (ClustalO w/o email)
!teachopencadd/talktorials/T032_compound_activity_proteochemometrics/data/aligned_sequences.aln-fasta.fasta
11 changes: 11 additions & 0 deletions devtools/test_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ dependencies:
# https://github.com/volkamerlab/teachopencadd/issues/299
- numpy<1.24
- scikit-learn
- scipy
# API changed after v2.6, see https://github.com/volkamerlab/teachopencadd/issues/265
- tensorflow<=2.6
- seaborn
Expand Down Expand Up @@ -44,6 +45,7 @@ dependencies:
- tqdm
- lxml
- kissim
- mordred
## CI tests
- pytest
- pytest-xdist
Expand All @@ -67,3 +69,12 @@ dependencies:
- sphinxext-opengraph
# TeachOpenCADD itself
- ../

# T032
# The following pip packages are currently installed in the notebook itself because they are only used there, thereby avoiding the addition of more dependencies to our already quite large environment file.
# Follow this discussion on how we try to simplify our environment setup in the future: https://github.com/volkamerlab/teachopencadd/discussions/277
# - https://github.com/OlivierBeq/Papyrus-scripts/tarball/master
# - prodec
# - rich-msa
# Dependency for ClustalO webservice (also conda installable via -c bioconda)
# - xmltramp2
1 change: 1 addition & 0 deletions docs/all_talktorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,4 @@ This is the complete list of talktorials available for online reading. Take into
talktorials/T026_kinase_similarity_ifp.nblink
talktorials/T027_kinase_similarity_ligand_profile.nblink
talktorials/T028_kinase_similarity_compare_perspectives.nblink
talktorials/T032_compound_activity_proteochemometrics.nblink
1 change: 1 addition & 0 deletions docs/talktorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ The basis for computer-aided drug discovery
talktorials/T013_query_pubchem.nblink
talktorials/T021_one_hot_encoding.nblink
talktorials/T022_ligand_based_screening_neural_network.nblink
talktorials/T032_compound_activity_proteochemometrics.nblink

Structural biology
------------------
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"path": "../../teachopencadd/talktorials/T032_compound_activity_proteochemometrics/talktorial.ipynb", "extra-media": ["../../teachopencadd/talktorials/T032_compound_activity_proteochemometrics/images"]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# T032 · Compound activity: Proteochemometrics

**Note:** This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

- Marina Gorostiola González, 2022, [Computational Drug Discovery](https://www.universiteitleiden.nl/en/science/drug-research/drug-discovery-and-safety/computational-drug-discovery), Drug Discovery & Safety Leiden University (The Netherlands)
- Olivier J.M. Béquignon, 2022, Computational Drug Discovery, Drug Discovery & Safety Leiden University (The Netherlands)
- Willem Jespers, 2022, Computational Drug Discovery, Drug Discovery & Safety Leiden University (The Netherlands)


## Aim of this talktorial

While activity data is very abundant for some protein targets, there are still a number of underexplored proteins where the use of machine learning (ML) for activity prediction is very difficult due to the lack of data. This issue can be partially solved by leveraging similarities and differences between proteins. In this talktorial, we use proteochemometrics (PCM) modeling to enrich our activity models with protein data to predict the activity of novel compounds against the four [adenosine receptor](https://journals.physiology.org/doi/full/10.1152/physrev.00049.2017) isoforms (A1, A2A, A2B, A3).


### Contents in *Theory*
* Proteochemometrics (PCM) modeling
* Data preparation
* Papyrus dataset
* Molecule encoding: molecular descriptors
* Protein encoding: protein descriptors
* Machine learning principles: regression
* Data splitting methods
* Regression evaluation metrics
* ML algorithm: Random Forest
* Applications of PCM in drug discovery


### Contents in *Practical*

* Download Papyrus dataset
* Data preparation
* Filter activity data for targets of interest
* Align target sequences
* Calculate protein descriptors
* Calculate compound descriptors
* Proteochemometrics modeling
* Helper functions
* Preprocessing
* Model training and validation
* Random split PCM model
* Random split QSAR models
* Leave one target out split PCM model


### References

* Papyrus scripts [GitHub](https://github.com/OlivierBeq/Papyrus-scripts)
* Papyrus dataset preprint: [*ChemRvix* (2021)](https://chemrxiv.org/engage/chemrxiv/article-details/617aa2467a002162403d71f0)
* Molecular descriptors (Modred): [*J. Cheminf.*, 10, (2018)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y)
* Protein descriptors (ProDEC) [GitHub](https://github.com/OlivierBeq/ProDEC)
* Regression metrics [(Scikit learn)](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)
* XGBoost [Documentation](https://xgboost.readthedocs.io/en/stable/index.html)
* Proteochemometrics review: [*Drug Discov.* (2019), **32**, 89-98](https://www.sciencedirect.com/science/article/pii/S1740674920300111?via%3Dihub)


Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: teachopencadd_t032
channels:
- conda-forge
- defaults
dependencies:
- python>=3.8
- pip
- jupyter
- jupyterlab>=3
- nglview>=3
- pandas
- numpy
- biopython<=1.77
- rdkit==2021.09.5
- scikit-learn
- scipy
- seaborn
# Dependencies for PCM and papyrus scripts
- mordred
- pytest
- pip:
- https://github.com/OlivierBeq/Papyrus-scripts/tarball/master
- prodec
- rich-msa
# Dependency for ClustalO webservice (also conda installable via -c bioconda)
- xmltramp2

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# T032 · Compound activity: Proteochemometrics
## Data

This folder stores input and output data for the Jupyter notebook.

- `papyrus`: Directory with Papyrus bioactivity dataset downloads.
- `sequences.fasta`: Sequences of the targets of interest for PCM modelling, in FASTA format.
- `aligned_sequences.aln-fasta.fasta`: ClustalO multiple sequence alignment output, in FASTA format.
- `aligned_sequences.[...]`: Additional ClustalO output files, not needed for the talktorial.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
>0 AA2BR_HUMAN Homo sapiens (Human) Membrane receptor->Family A G protein-coupled receptor->Small molecule receptor (family A GPCR
-----MLLETQDALYVALELVIAALSVAGNVLVCAAVGTANTLQTPTNYFLVSLAAADVA
VGLFAIPFAITISLGFCTDFYGCLFLACFVLVLTQSSIFSLLAVAVDRYLAICVPLRYKS
LVTGTRARGVIAVLWVLAFGIGLTPFLGWNSKDSATNNCTEPWDGTTNESCC---LVKCL
FENVVPMSYMVYFNFFGCVLPPLLIMLVIYIKIFLVACRQLQRTEL----MDHSRTTLQR
EIHAAKSLAMIVGIFALCWLPVHAVNCVTLFQPAQGKNKPKWAMNMAILLSHANSVVNPI
VYAYRNRDFRYTFHKIISRYLLCQADVKSGNGQ----------AGVQPALGVGL------
------------------------------------------------------------
------
>1 AA1R_HUMAN Homo sapiens (Human) Membrane receptor->Family A G protein-coupled receptor->Small molecule receptor (family A GPCR)
---MPPSISAFQAAYIGIEVLIALVSVPGNVLVIWAVKVNQALRDATFCFIVSLAVADVA
VGALVIPLAILINIGPQTYFHTCLMVACPVLILTQSSILALLAIAVDRYLRVKIPLRYKM
VVTPRRAAVAIAGCWILSFVVGLTPMFGWNNLSAVER----AWA---ANGSMGEPVIKCE
FEKVISMEYMVYFNFFVWVLPPLLLMVLIYLEVFYLIRKQLNKKVSAS--SGDPQKYYGK
ELKIAKSLALILFLFALSWLPLHILNCITLFCPSC--HKPSILTYIAIFLTHGNSAMNPI
VYAFRIQKFRVTFLKIWNDHFRCQPAPPIDEDLPEE------------------------
----------RPDD----------------------------------------------
------
>2 AA2AR_HUMAN Homo sapiens (Human) Membrane receptor->Family A G protein-coupled receptor->Small molecule receptor (family A GPCR
------MPIMGSSVYITVELAIAVLAILGNVLVCWAVWLNSNLQNVTNYFVVSLAAADIA
VGVLAIPFAITISTGFCAACHGCLFIACFVLVLTQSSIFSLLAIAIDRYIAIRIPLRYNG
LVTGTRAKGIIAICWVLSFAIGLTPMLGWNN-------CGQPKEGKNHSQGCGEGQVACL
FEDVVPMNYMVYFNFFACVLVPLLLMLGVYLRIFLAARRQLKQMESQPLPGERARSTLQK
EVHAAKSLAIIVGLFALCWLPLHIINCFTFFCPDC-SHAPLWLMYLAIVLSHTNSVVNPF
IYAYRIREFRQTFRKIIRSHVLRQQEPFKAAGTSARVLAAHGSDGEQVSLRLNGHPPGVW
ANGSAPHPERRPNGYALGLVSGGSAQESQGNTGLPDVELLSHELKGVCPEPPGLDDPLAQ
DGAGVS
>3 AA3R_HUMAN Homo sapiens (Human) Membrane receptor->Family A G protein-coupled receptor->Small molecule receptor (family A GPCR)
MPNNSTALSLANVTYITMEIFIGLCAIVGNVLVICVVKLNPSLQTTTFYFIVSLALADIA
VGVLVMPLAIVVSLGITIHFYSCLFMTCLLLIFTHASIMSLLAIAVDRYLRVKLTVRYKR
VTTHRRIWLALGLCWLVSFLVGLTPMFGWNMKLTSEYH-------------RNVTFLSCQ
FVSVMRMDYMVYFSFLTWIFIPLVVMCAIYLDIFYIIRNKLSLNLSN---SKETGAFYGR
EFKTAKSLFLVLFLFALSWLPLSIINCIIYFNG----EVPQLVLYMGILLSHANSMMNPI
VYAYKIKKFKETYLLILKACVVCHPSDSLDTSIEKNSE----------------------
------------------------------------------------------------
------
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# T032 · Compound activity: Proteochemometrics

## Images

This folder stores images used in the Jupyter notebook.
- `PCM_model_text-01.png`
- `papyrus_workflow.png`
- `splitting_methods.png`
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# T032 · Compound activity: Proteochemometrics
## Scripts

This folder stores scripts needed for the Jupyter notebook.

- `clustalo.py`: ClustalO REST API client.
Loading