Reducing misinterpretation of model output with statistical techniques

The successful application of ML can be improved by meeting certain conditions. First, the dataset contains sufficient representation from each class such that relevant variability from that class is captured. Second, the dataset is complete; all samples have measurements for all variables in the dataset (i.e., the dataset is not “sparse”, it is not missing data for some of the samples). Third, there is no ambiguity about the labels for the samples in the dataset (i.e., no “label-noise”).

Rare disease datasets violate many of these assumptions. Small number of samples for specific classes fail to fully capture the sample variability in those classes, e.g. only a few patients with a particular rare disease in a health records dataset, which can require special consideration for evaluation (Box 2). The data are also often sparse, and there may be abundant label-noise due to incomplete understanding of the disease. All of these contribute to low signal to noise ratio in rare disease datasets. Applying ML to such data without addressing these shortcomings may lead to models that have poor generalizability or are hard to interpret.

Class imbalance in datasets can be addressed using decision tree-based ensemble learning methods (e.g., random forests). [@doi:10.1007/s11634-019-00354-x] (Figure[@fig:3]a) Random forests use sampling with replacement based techniques to form a consensus about the important predictive features identified by the decision trees (e.g., Box 1c). [@https://doi.org/10.1023/A:1010933404324; @doi:10.1186/1472-6947-13-134] Additional approaches like combining random forests with sampling without replacement can generate confidence intervals for the model predictions (for applications like Box 1d) by mimicking real world cases where most rare disease datasets are incomplete [@doi:10.3390/genes11020226]. Resampling approaches are most helpful in constructing confidence intervals for algorithms that generate the same outcome every time they are run (i.e., deterministic models). For decision trees that choose features at random for selecting a path to the outcome (i.e., are non-deterministic), resampling approaches can be helpful in estimating the reproducibility of the model.

When decision tree-based ensemble methods fail for rare disease datasets, cascade learning is a viable alternative. [@pmc:PMC6371307] In cascade learning, multiple methods leveraging distinct statistical techniques are used to identify stable patterns in the dataset [@https://doi.org/10.1109/CVPR.2001.990537; @doi:10.1007/978-3-540-75175-5_16; @doi:10.1109/icpr.2004.1334680]. For example, a cascade learning approach for identifying rare disease patients from electronic health record data (Box 1a) incorporated independent steps for feature extraction (word2vec [@arxiv:1301.3781]), preliminary prediction with ensembled decision trees, and then prediction refinement using data similarity metrics. [@pmc:PMC6371307] Combining these three methods resulted in better overall prediction when implemented on a silver standard dataset, as compared to a model that used ensemble-based prediction alone. In addition to cascade learning, approaches that better represent rare classes using class re-balancing techniques like inverse sampling probability weighting [@doi:10.1186/s12911-021-01688-3], inverse class frequency weighting [@doi:10.1197/jamia.M3095], oversampling of rare classes [@https://doi.org/10.1613/jair.953], or uniformly random undersampling of majority class [@doi:10.48550/arXiv.1608.06048] may also help minimize issues associated with class imbalance.

The presence of label-noise and sparsity in the data can lead to poor generalizability or overfitting, meaning that the models show high prediction accuracy on the training data but low prediction accuracy on new evaluation data. Overfit models tend to rely on patterns that are unique to the training data, such as the clinical coding practices at a hospital, and not generalize to new data such as data collected at different hospitals. [@isbn:0262035618; @pmc:PMC8238368] Regularization approaches can help mitigate these scenarios by adding constraints to a model to avoid making large prediction errors. This protects ML models from poor generalizability by reducing model complexity and minimizing model feature space [@doi:10.1371/journal.pgen.1004754, @doi:10.1002/sim.6782]. (Figure[@fig:3]a) Examples of ML methods with regularization include ridge regression, LASSO regression, and elastic net regression [@doi:10.1111/j.1467-9868.2005.00503.x], among others. LASSO regularization helped select a few informative genes as features to include in models classifying amyotrophic lateral sclerosis (ALS) patients and healthy patients with high accuracy based on brain tissue gene expression, thus making the models more interpretable. [@doi:10.1186/s10020-023-00603-y] In the context of rare immune cell signature discovery, where a few genes or features are expected to distinguish between immune cell types, elastic-net regression was able to exclude groups of uninformative genes by reducing their contribution to zero. [@doi:10.1186/s12859-019-2994-z, @https://doi.org/10.1111/j.1467-9868.2005.00503.x] In a study using a variational autoencoder (VAE) (see Box 3) for dimensionality reduction in gene expression data from acute myeloid leukemia (AML) samples, the KL loss between the input data and its low dimensional representation provided the regularizing penalty for the model. [@doi:10.1101/278739; @doi:10.48550/arXiv.1312.6114] A study using a convolutional neural network (CNN) to identify tubers in MRI images from tuberous sclerosis patients (an application that can facilitate Box 1a), minimized overfitting using the dropout regularization method which removed randomly chosen network nodes in each iteration of the CNN model generating simpler models in each iteration.[@doi:10.1371/journal.pone.0232376] Thus, depending on the learning method used, regularization approaches should be considered when working with rare disease datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!