Combining approaches is required for the successful application of machine learning to rare diseases
We have described multiple approaches for maximizing the success of ML applications in rare disease, but it is rarely sufficient to use any of these techniques in isolation. Below, we highlight two examples in the rare disease domain that use concepts of feature-representation-transfer, use of prior data, and regularization.
Our first example includes a large dataset of acute myeloid leukemia (AML) patient samples with no drug response data and a small in vitro experiment with drug response data [@doi:10.1038/s41467-017-02465-5]. Training an ML model on the small in vitro dataset alone faced the curse of dimensionality and the dataset size prohibited representation learning. Dincer et al. trained a variational autoencoder (VAE, Box 3) on a reasonably large dataset of AML patient samples from 96 independent studies to learn meaningful representations in an approach termed DeepProfile [@doi:10.1101/278739] (Figure[@fig:6]a). The representations or encodings learned by the VAE were then transferred to the small in vitro dataset reducing its number of features from thousands to eight, and improving the performance of the final LASSO linear regression model (Box 1b). In addition to improving performance, the encodings learned by the VAE captured more biological pathways than PCA, possibly due to the constraints on the encodings imposed during training (Box 3). Similar results were observed for prediction of histopathology in another rare cancer dataset [@doi:10.1101/278739].
While DeepProfile was centered on training on an individual disease and tissue combination, some rare diseases affect multiple tissues that a researcher may want to study (e.g., Box 1d). Studying multiple tissues poses significant challenges and a cross-tissue analysis may require comparing representations from multiple models. Models trained on a low number of samples may learn representations that "lump together" multiple biological signals, reducing the interpretability of the results. To address these challenges, Taroni et al. trained a Pathway-Level Information ExtractoR (PLIER) (a matrix factorization approach that takes prior knowledge in the form of gene sets or pathways) [@doi:10.1038/s41592-019-0456-1] on a large generic collection of human transcriptomic data [@doi:10.1016/j.cels.2019.04.003]. PLIER used constraints (regularization) that learned latent variables aligned with a small number of input gene sets, making it suitable for rare disease data. The authors transferred the representations or latent variables learned by the model to describe transcriptomic data from the unseen rare diseases antineutrophil cytoplasmic antibody (ANCA)-associated vasculitis (AAV) and medulloblastoma in an approach termed MultiPLIER [@doi:10.1016/j.cels.2019.04.003]. (Figure[@fig:6]b) MultiPLIER used one model to describe multiple datasets instead of reconciling output from multiple models, making it possible to identify commonalities among disease manifestations or affected tissues.
DeepProfile [@doi:10.1101/278739] and MultiPLIER [@doi:10.1016/j.cels.2019.04.003] exemplify modeling approaches incorporating prior knowledge – thereby constraining the model space according to plausible or expected biology – or sharing information across datasets. These two methods capitalize on similar biological processes observed across different biological contexts and the fact that the methods underlying the approaches can effectively learn about those processes.