iLR

1. Introduction

iLR is a method that selects small sets of informative genes that distinguish subtle differences between cell states (i.e., disease, treatment) by iteratively selecting genes using a logistic regression framework coupled with penalized Pareto front optimization.

2. Import

Depends:

  Python (>= 3.9.2)

Requirements:

  scanpy >= 1.8.2
  numpy >= 1.21.6
  pandas >= 1.5.3
  sklearn >= 1.5.3

3. Quick start

This repo has large data files tracked by Git LFS. Before cloning or pulling the repo, make sure you have Git LFS installed before git clone:

git lfs install

Make sure iLR.py is in the same folder as your working script, then import iLR.py by

from iLR import iLR

The parameters can be changed as below.

3.1 Prepare data

An Anndata object can be accepted directly as input. obs_names corresponds to cells, and var_names corresponds to genes. Lognorm transformed expression and preselection of genes by the Wilcoxon test are recommended.

Example of preparing data (from paper: https://www.science.org/doi/10.1126/science.aav8130, L4 cell type only):

info = pd.DataFrame(columns = ['train_ASD_number', 'train_control_number', 'num_genes_started','test_ASD_number', 'test_control_number'])
adata = L4

sc.tl.rank_genes_groups(adata, groupby="diagnosis", method = 'wilcoxon')
df = sc.get.rank_genes_groups_df(adata, group="ASD")
selected = df[df['pvals_adj'] < 0.05]['names'].tolist()
info.at[0,'num_genes_started'] = len(selected)

X_all = adata.X
y_all = np.asarray(adata.obs["diagnosis"])
rs = ShuffleSplit(n_splits=1, test_size=0.3, random_state=0)
rs.get_n_splits(X_all, y_all)
train_index, test_index = next(rs.split(X_all, y_all)) 
train_adata = adata[train_index,:]
test_adata = adata[test_index,:]

info.at[0,'train_ASD_number'] = np.sum(train_adata.obs['diagnosis'] == 'ASD')
info.at[0,'train_control_number'] = np.sum(train_adata.obs['diagnosis'] == 'Control')
info.at[0,'test_ASD_number'] = np.sum(test_adata.obs['diagnosis'] == 'ASD')
info.at[0,'test_control_number'] = np.sum(test_adata.obs['diagnosis'] == 'Control')

info

# output
|   train_ASD_number |   train_control_number |   num_genes_started |   test_ASD_number |   test_control_number |
|--------------------|------------------------|---------------------|-------------------|-----------------------|
|               2531 |                   2250 |                2941 |              1057 |                   993 |

Preprocessed demo data available at \test_data()

3.2 Run iLR

iLR returns a dictionary with penalty as key and corresponding dictionary of genes as values. Its inputs are listed below.

train_data: Anndata object adata with pre-filtered genes with lognorm transformed gene expression stored in adata.X
test_data: Anndata object adata with pre-filtered genes with lognorm transformed expression stored in adata.X
observation: obs name of interest (i.e., treatments)
num_repeats: number of repeats of iLR (default: 1)
ia: L2 regularization parameter (default: 0.1)
per_remove: proportion of genes removed at each iteration of iLR (default: 0.2)
e: penalty of Pareto Front (list, default: [0])
min_num: minimal number of genes desired in the final gene set (default: 10)
rs: random seed of logistic regression (default: 1)
plot: if plot the gene number vs classification AUC scatter plot (boolean, default: True)

Run iLR(), then it will output an evaluation table containing AUCs and gene sets with different Pareto front penalties.

counts, eval_table  = iLR(adata_train_filtered, adata_test_filtered, 'diagnosis',  ia = 0.1, e = [0, 1, 2], min_num = 10, plot = False)
eval_table

# output
| penalty | train_mean_cv_acc | train_mean_cv_auc | number_selected_genes  | train_acc | train_auc | test_acc | test_auc |
|---------|-------------------|-------------------|------------------------|-----------|-----------|----------|----------|
| 0.0     | 0.911108          | 0.971766          | 314.0                  | 0.942481  | 0.986558  | 0.829268 | 0.908691 |
| 1.0     | 0.816149          | 0.895648          | 64.0                   | 0.822004  | 0.902639  | 0.782439 | 0.865368 |
| 2.0     | 0.755698          | 0.831388          | 25.0                   | 0.758000  | 0.835404  | 0.744878 | 0.826975 |

The evaluation table contains:

penalty: penalty of the Pareto front
train_mean_cv_acc/auc: 5-CV accuracy or AUC at the selected gene set on the training dataset
number_selected_genes: number of genes selected
train_acc/auc: accuracy or AUC on training dataset
test_acc/auc: accuracy or AUC on testing dataset

counts[2]

# output
Counter({'FAM73B': 1,
         'GTF2H2': 1,
         'SPDYE2': 1,
         'PELI2': 1,
         'PEBP1': 1,
         'RP11-711K1.8': 1,
         'RP11-611E13.2': 1,
         'TRABD2A': 1,
         'CLTCL1': 1,
         'NDUFAB1': 1,
         'BCYRN1': 1,
         'FAM153A': 1,
         'FBLN7': 1,
         'TRIML2': 1,
         'OR2L13': 1,
         'HSPA1A': 1,
         'LINC01482': 1,
         'HNRNPA3P6': 1,
         'TNC': 1,
         'HSPB1': 1,
         'AC105402.4': 1,
         'MX2': 1,
         'RP11-159J3.1': 1,
         'NPAS4': 1,
         'ZNF208': 1})

If num_repeats > 1, the values in each sub-counter will show how many times a gene appears out of the number of repeats. The evaluation table will have results for each repeat.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
simulated scRNA-seq		simulated scRNA-seq
test_data		test_data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

iLR

1. Introduction

2. Import

3. Quick start

3.1 Prepare data

3.2 Run iLR

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

maclean-lab/iLR

Folders and files

Latest commit

History

Repository files navigation

iLR

1. Introduction

2. Import

3. Quick start

3.1 Prepare data

3.2 Run iLR

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages