-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About gene expression input to the model #9
Comments
Hi @sunset222, Thanks for the interest in our work.
For RL optimization of G, we used GEPs publicly available from GDSC (Yang et al., 2012) and
CCLE (Barretina et al., 2012) databases. Since the RNA-Seq of these cancer cell line databases were
passed through the PVAE (pretrained on human samples from TCGA (Weinstein et al., 2013)), we
compared the standardized gene expression distributions for the selected genes across these databases
and found good agreement (compare Figure S4 in Supplementary Material S2), in alignment with
the reported consensus between transcriptomic data in CCLE and TCGA (Ghandi et al., 2019). To
train the critic (C), IC50 drug sensitivity data from GDSC and CCLE was utilized. The reason why we didnt use the pickled file in the PaccMann predictor is because the project grew organically. paccMann predictor existed first, then we conceived PaccmannRL Hope this helps, let me know if you need more help |
Thank you for your fast and kind reply. I would use the pickle file to rebuild your framework. :) |
Hi, thank you for your amazing work.
However, I am quite confused about the gene expression input to the model (both paccmann, and RL).
1. In PaccMann predictor
Your recent model used rna-seq data, and the dataset you uploaded (~400 cell lines) cannot cover the full cell lines in GDSC (~1000 cell lines). And also there are some missing genes among the 2,128 selected genes.
Can you explain how the model handles the expression values of missing genes? and also how does the model handle the data points that are missing in the cell line - gene expression dictionary?
2. In PaccMann RL (Generator)
As you mentioned in the readme, you used the rna-seq gex data for the whole framework.
but It seems like the input gene expression for conditional generation (the pickle file) was RMA-normalized gene expression.
The reason why I thought like that is because of the reasons that I mentioned above. (RMA data covered the most of cell lines (985) and it contains 2,128 selected genes)
You mentioned in the paper, the PVAE is trained with TCGA rna-seq data. Thus I think there might exist a discrepancy when you encode the RMA gene expression with the PVAE encoder.
Can you explain the exact source of the pickle file (gdsc_transcriptomics_for_conditional_generation.pkl) and the reason why you do not use that pickle file in the other part? (PaccMann predictor)
The text was updated successfully, but these errors were encountered: