Our goal is to capitalize on Graph Layer Relevance Propogation method which explores a Graph Convolutional Neural Network, to decode the pathological or node level difference between Alzheimer Disease subjects and control patients.
Alzheimer’s disease (AD) is the most common form of dementia (60-70%) mainly affecting the elderly (age >65) with an estimated annual cost of about $300 billion USD (“2020 Alzheimer’s Disease Facts and Figures,” 2020; Dementia, n.d.; Winston Wong, 2020).
There is no cure for AD, and in the past twenty years only two drugs (Aducanumab and Gantenerumab) have had a potential to show clinically meaningful results (Commissioner, 2021; Gantenerumab | ALZFORUM, n.d.; How Is Alzheimer’s Disease Treated?, n.d.; Ostrowitzki et al., 2017; Tolar et al., 2020). Exploration of additional biomarkers for this complex disease is, therefore, warranted and could potentially aid in the early detection or therapeutic intervention of AD patients.
We wish to develop a multiplex machine learning (ML) approach to identify [gene]omics biomarkers in AD and mild cognitive impairment (MCI) compared to healthy controls (HC).
- Identify best ML model that predicts AD or MCI versus HC
- Apply this model on a validation set to confirm the performance
- Combine multiple datasets to see if model performance improves
As for the deep learning model and relevance propagation method, we will follow the GCN Paper that has applied this method in the cancer biology filed with slight changes such as:
- Expression Dataset from ROSMAP
- PPI network from HPRD, or test other suitable network
- Hyperparameter tuning
Dataset's Used:
The ROSMAP data was obtained from ROSMAP project and preprocessed, uploaded in the data folder while the other datasets include
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150693 (MCI to AD converters and non converters, about 100 samples each) and
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63063 (AD, HC, MCI).
We are planning to build this whole pipeline into python file with config for easy installation and running. It would be as simple as providing the expression set file, PPI network file and hyperparameters in a config file.
Installation simply requires fetching the source code. Following are required:
- Git
To fetch source code, change in to directory of your choice and run:
git clone -b main \
https://github.com/u-brite/TeamADGuy
OS:
Works in all available OS.
Tools:
-
Anaconda3
- Tested with version: 2020.02
-
Docker
- Another alternative is to use the docker file
docker build -t gnc src/Docker/ #See running containers docker container ls # See all containers docker ps -a # Stop the container docker stop <container name> #eg. docker stop gnn # Start the container docker start <container name> #eg, docker start gnn # What if I want an interactive terminal session inside the container? docker exec -it <container name> /bin/bash cd ~ #eg. docker exec -it metanets /bin/bash cd /root/
Change in to root directory and run the commands below to run the deep learning model:
# create conda environment. Needed only the first time.
conda env create --file configs/environment.yml
# if you need to update existing environment
conda env update --file configs/environment.yml
# activate conda environment
conda activate gcn
To run the deep learning model, the first step would include downloading your required file with expression dataset having subjects as columns and genes as rows while also a column under the name 'Probe' with the gene names for reference in the future. The final output data,which is the subject disease condition is also required for the prediction and finally, we would require the network, for which we used the HPRD PPI and you can freely use the PPI network that suits best.
The rough test files are present inside the Test folder for reference.
Running the model requires completion of the config file with self explained headers present inside. Finally, the input of the python file would just be the config file itself.
python src/DeepLearningModel.py
The config file has to have all the values and the default values to adjust the hyperparameters for the model are also provided.
input_files:
path_to_feature_val: "x_rosmap_whole_gene_expression_downsampled.csv"
path_to_feature_graph: "hprd_rosmap_whole_ppi.csv"
path_to_labels: "y_rosmap_whole_gene_expression_downsampled.csv"
dl_params:
epochs: 200
batch_size: 100
test_ratio: 0.20
eval_freq: 40
filter: chebyshev5
brelu: b1relu
pool: mpool1
graph_cnn_filters: 16
polynomial_ord: 8
pooling_size: 2
regularization: 0.0001
dropout: 0.95
learning_rate: 0.00095
decay_rate: 0.9625
momentum: 0.99
output_loc:
res_dir: "output_directory/"
Output from this step includes -
output_directory/
├── prediction.csv
└── Relevences.csv - has the weights and relevances for each of the gene for each subject
To run the machine learning models, add both the X and Y datasets from the corresponding data folder in Github to the content folder in Google Colab. Run the respective code blocks within file. The ML models targetted here were lasso and RandomForest for all three datasets and the main features obtained were used for exploratory analysis.
Pradeep Varathan | [email protected] | Team Leader.
Karen Bonilla| [email protected] | Member.
Mehmet Enes Inam | [email protected] | Member.
Karolina Willicott | [email protected] | Member.