iLR is a method that selects small sets of informative genes that distinguish subtle differences between cell states (i.e., disease, treatment) by iteratively selecting genes using a logistic regression framework coupled with penalized Pareto front optimization.
Depends:
Python (>= 3.9.2)
Requirements:
scanpy >= 1.8.2
numpy >= 1.21.6
pandas >= 1.5.3
sklearn >= 1.5.3
This repo has large data files tracked by Git LFS. Before cloning or pulling the repo, make sure you have Git LFS installed before git clone
:
git lfs install
Make sure iLR.py is in the same folder as your working script, then import iLR.py
by
from iLR import iLR
The parameters can be changed as below.
An Anndata object can be accepted directly as input. obs_names
corresponds to cells, and var_names
corresponds to genes. Lognorm transformed expression and preselection of genes by the Wilcoxon test are recommended.
Example of preparing data (from paper: https://www.science.org/doi/10.1126/science.aav8130, L4 cell type only):
info = pd.DataFrame(columns = ['train_ASD_number', 'train_control_number', 'num_genes_started','test_ASD_number', 'test_control_number'])
adata = L4
sc.tl.rank_genes_groups(adata, groupby="diagnosis", method = 'wilcoxon')
df = sc.get.rank_genes_groups_df(adata, group="ASD")
selected = df[df['pvals_adj'] < 0.05]['names'].tolist()
info.at[0,'num_genes_started'] = len(selected)
X_all = adata.X
y_all = np.asarray(adata.obs["diagnosis"])
rs = ShuffleSplit(n_splits=1, test_size=0.3, random_state=0)
rs.get_n_splits(X_all, y_all)
train_index, test_index = next(rs.split(X_all, y_all))
train_adata = adata[train_index,:]
test_adata = adata[test_index,:]
info.at[0,'train_ASD_number'] = np.sum(train_adata.obs['diagnosis'] == 'ASD')
info.at[0,'train_control_number'] = np.sum(train_adata.obs['diagnosis'] == 'Control')
info.at[0,'test_ASD_number'] = np.sum(test_adata.obs['diagnosis'] == 'ASD')
info.at[0,'test_control_number'] = np.sum(test_adata.obs['diagnosis'] == 'Control')
info
# output
| train_ASD_number | train_control_number | num_genes_started | test_ASD_number | test_control_number |
|--------------------|------------------------|---------------------|-------------------|-----------------------|
| 2531 | 2250 | 2941 | 1057 | 993 |
Preprocessed demo data available at \test_data
()
iLR
returns a dictionary with penalty as key and corresponding dictionary of genes as values. Its inputs are listed below.
train_data
: Anndata objectadata
with pre-filtered genes with lognorm transformed gene expression stored inadata.X
test_data
: Anndata objectadata
with pre-filtered genes with lognorm transformed expression stored inadata.X
observation
: obs name of interest (i.e., treatments)num_repeats
: number of repeats of iLR (default: 1)ia
: L2 regularization parameter (default: 0.1)per_remove
: proportion of genes removed at each iteration of iLR (default: 0.2)e
: penalty of Pareto Front (list, default: [0])min_num
: minimal number of genes desired in the final gene set (default: 10)rs
: random seed of logistic regression (default: 1)plot
: if plot the gene number vs classification AUC scatter plot (boolean, default: True)
Run iLR()
, then it will output an evaluation table containing AUCs and gene sets with different Pareto front penalties.
counts, eval_table = iLR(adata_train_filtered, adata_test_filtered, 'diagnosis', ia = 0.1, e = [0, 1, 2], min_num = 10, plot = False)
eval_table
# output
| penalty | train_mean_cv_acc | train_mean_cv_auc | number_selected_genes | train_acc | train_auc | test_acc | test_auc |
|---------|-------------------|-------------------|------------------------|-----------|-----------|----------|----------|
| 0.0 | 0.911108 | 0.971766 | 314.0 | 0.942481 | 0.986558 | 0.829268 | 0.908691 |
| 1.0 | 0.816149 | 0.895648 | 64.0 | 0.822004 | 0.902639 | 0.782439 | 0.865368 |
| 2.0 | 0.755698 | 0.831388 | 25.0 | 0.758000 | 0.835404 | 0.744878 | 0.826975 |
The evaluation table contains:
penalty
: penalty of the Pareto fronttrain_mean_cv_acc/auc
: 5-CV accuracy or AUC at the selected gene set on the training datasetnumber_selected_genes
: number of genes selectedtrain_acc/auc
: accuracy or AUC on training datasettest_acc/auc
: accuracy or AUC on testing dataset
counts[2]
# output
Counter({'FAM73B': 1,
'GTF2H2': 1,
'SPDYE2': 1,
'PELI2': 1,
'PEBP1': 1,
'RP11-711K1.8': 1,
'RP11-611E13.2': 1,
'TRABD2A': 1,
'CLTCL1': 1,
'NDUFAB1': 1,
'BCYRN1': 1,
'FAM153A': 1,
'FBLN7': 1,
'TRIML2': 1,
'OR2L13': 1,
'HSPA1A': 1,
'LINC01482': 1,
'HNRNPA3P6': 1,
'TNC': 1,
'HSPB1': 1,
'AC105402.4': 1,
'MX2': 1,
'RP11-159J3.1': 1,
'NPAS4': 1,
'ZNF208': 1})
If num_repeats
> 1, the values in each sub-counter will show how many times a gene appears out of the number of repeats. The evaluation table will have results for each repeat.