scGCO is a method to identify genes demonstrating position-dependent differential expression patterns, also known as spatially viable genes, using the powerful graph cuts algorithm. ScGCO can analyze spatial transcriptomics data generated by diverse technologies, including but not limited to single-cell RNA-sequencing, or in situ FISH based methods.What's more, scGCO can easy scale to millions of cells.
This repository contains source codes of scGCO, and tutorials on running the program.
The primary implementation is as a Python 3 package, and can be installed from the command line by
pip install scgco
scGCO has been tested on Ubuntu Linux (18.04.1), Mac OS X (10.14.1) and Windows(Windows 7 Professional).
MIT Licence, see LICENSE file.
See AUTHORS file.
For bugs, feedback or help please contact Peng Wang [email protected].
The following codes demonstrate the typical data analysis flow of scGCO.
The tutorial has also been provided as a Jupyter Notebook in the Tutorial directory (scGCO_tutorial.ipynb)
The entire process should only take 1-2 minutes on a typical desktop computer with 8 cores.
The required matrix format is the ST data format, a matrix of counts where spot coordinates are row names and the gene names are column names. This default matrix format (.TSV ) is split by tab.
As an example, let’s analyze spatially variable gene expression in Mouse Olfactory Bulb using a data set published in Ståhl et al 2016.
import warnings
warnings.filterwarnings('ignore')
from scGCO import *
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# read spatial expression data and processing
j=11
ff = './README_file/Rep11_MOB_count_matrix-1.tsv'
locs,data=read_spatial_expression(ff,sep='\t',num_exp_genes=0.01, num_exp_spots=0.01, min_expression=1)
# normalize expression and use 1000 genes to test the algorithm
data_norm = normalize_count_cellranger(data)
data_norm = data_norm.iloc[:,0:1000]
print('Rep11_processing: {}'.format(data_norm.shape))
raw data dim: (262, 16218)
Rep11_processing: (259, 1000)
Run the main scGCO function to identify genes with a non-random spatial variability.
import time
# estimate smooth factor that minimizes the number of false positives
from sklearn.utils import shuffle
data_norm_rand = shuffle(data_norm)
start_ts = time.time()
unary_scale_factor=100
label_cost=10
algorithm='expansion'
smooth_factor = estimate_smooth_factor(locs, data_norm_rand,start_sf=20,fdr_cutoff=0.01)
print('Rep{}: '.format(j),smooth_factor)
# run the main algorithm to identify genes with non-random
# spatial patterns
# this should take less than a minute
result_df = identify_spatial_genes(locs, data_norm, smooth_factor )
end_ts = time.time()
print('seconds to run: ', end_ts-start_ts)
100%|██████████| 16/16 [00:03<00:00, 4.90it/s]
========iteration1
25 153
100%|██████████| 16/16 [00:03<00:00, 4.83it/s]
========iteration2
30 3
Rep11: 30
100%|██████████| 16/16 [00:03<00:00, 4.26it/s]
100%|██████████| 16/16 [00:03<00:00, 4.55it/s]
seconds to run: 14.518628597259521
write_result_to_csv(result_df,'./README_file//Rep{}_test_result_df.csv'.format(j))
result_df=read_result_to_dataframe('./README_file/Rep11_test_result_df.csv')
print(result_df.shape)
(993, 266)
Select genes with significant spatial non-random patterns using a specific fdr cutoff.
fdr_cutoff=0.01
fdr_df=result_df.loc[result_df.fdr<fdr_cutoff,]
print(fdr_df.shape)
(181, 266)
Select genes with significantly conserved spatial patterns.
## select genes with significantly conserved sptail patterns
_,hamming_df=simulate_hamming(locs,data_norm,fdr_df,cutoff=0.0001)
hamming_cutoff=round(hamming_df,2)
print('Rep{} {}'.format(j,hamming_cutoff))
Rep11 0.76
To automatically learn cell clusters from the input data,
set fixed_k=None; If the number of clusters (K) is known from priori knowledge, set fixed_k=K.
fdr_opt=fdr_df.sort_values(by='fdr')
hamming_cutoff=0.76
seg_max=data_norm.shape[0]*0.6
pattern_df,tissue_mat,targe_df,hamming_df=identify_pattern_conserved_genes_iteration(
locs, data_norm, fdr_opt,
similarity_cutoff=hamming_cutoff,cluster_size=5,smooth_factor=10,
perplexity=30,fixed_k=5,seg_min=8,seg_max=seg_max)
print('Rep{}_step2{}'.format(j,pattern_df.shape))
5
[ 84. 84. 54. 126. 133. 72. 48. 32. 113. 66. 72. 38. 20. 31.
98.]
(97, 266)
================iteration1
k: 5
5
[ 78. 78. 123. 136. 72. 51. 66. 40. 105. 48. 36. 35. 31. 92.
65.]
(86, 266)
================iteration2
k: 5
5
[ 77. 77. 72. 50. 122. 137. 66. 37. 108. 48. 40. 29. 31. 93.
66.]
(86, 266)
Rep11_step2(86, 266)
## visualize the learned conserved spatial domains
print('Rep{}_tissue{}'.format(j,tissue_mat.shape))
image='./data/Raw_data//HE-MOB-breast-cancer/HE_Rep{}_MOB.jpg'.format(j)
colors=['green','blue']
title='Rep11_tissue_pattern'
plot_tissue_pattern(locs,data_norm,tissue_mat,image,colors,title,nrows=5,ncols=5)
plt.savefig('./README_file/Rep11_tissue_patterns.pdf')
plt.show()
Rep11_tissue(15, 259)
Identify genes with expression fold changes above threshold.
This is the final set of spatially variable genes identified by scGCO.
fdr_cutoff=0.01
import time
start_ts=time.time()
new_result_df,zero_exp_genes=recalc_exp_diff(data_norm,result_df,fdr_cutoff=fdr_cutoff,cluster_k=3)
end_ts=time.time()
ts=(end_ts-start_ts)/60
print('Rep{} ts: {}'.format(j,ts))
exp_cutoff=estimate_exp_diff_cutoff(new_result_df,cutoff=fdr_cutoff*2,q=0.95)
exp_cutoff=round(exp_cutoff,2)
print(exp_cutoff)
Rep11 ts: 0.0015866676966349283
0.77
exp_cutoff=0.77
final_df=pattern_df[abs(pattern_df.exp_diff)>exp_cutoff]
final_df.shape
(55, 266)
Visualize some identified genes.
# visualize top genes
visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0.2)
# save top genes to pdf
multipage_pdf_visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0,
fileName='./README_file//top10_genes.pdf')
Perform t-SNE and visualize the clustering of identified genes.
# Do PCA + t-SNE to visualize the clustering patterns of identified genes
# Though only 1000 genes are used, the pattern should resemble Fig. 2b in the manuscript
fig,ax=plt.subplots()
tsne_proj=spatial_pca_tsne(data_norm,final_df.index,perplexity = 20)
title='MOB Rep {}'.format(j)
zz=visualize_tsne_density(tsne_proj,title=title,bins=200,threshold=0.01,ax=ax,fig=fig,
fileName='./README_file/Rep11_tsne.pdf')
Perform graph cuts for a single gene.
# You can also analyze one gene of interest
geneID='Apod' # Lets use Apod as an example
unary_scale_factor = 100 # scale factor for unary energy, default value works well
# set smooth factor to 20;
# use bigger smooth_factor to get less segments
# use small smooth_factor to get more segments
smooth_factor=20
ff = './README_file//Rep11_MOB_count_matrix-1.tsv'
# read in spatial gene expression data
locs, data = read_spatial_expression(ff,sep='\t')
# normalize gene expression
data_norm = normalize_count_cellranger(data)
# select Apod's expression
exp = data_norm.loc[:,geneID]
# log transform
exp=(log1p(exp)).values
# create graph representation of spatial coordinates of cells
cellGraph = create_graph_with_weight(locs, exp)
# do graph cut
newLabels, gmm = cut_graph_general(cellGraph, exp, unary_scale_factor, smooth_factor)
# calculate p values
p, node, com = compute_p_CSR(locs,newLabels, gmm, exp, cellGraph)
# Visualize graph cut results
plot_voronoi_boundary(geneID, locs, exp, newLabels, min(p))
# save the graph cut results to pdf
pdf_voronoi_boundary(geneID, locs, exp, newLabels, min(p),
fileName='./README_file//{}.pdf'.format(geneID),
point_size=0)
raw data dim: (262, 16218)
/ [root]
| ├── Analysis
| | ├── FIG2_a_b_c_d.ipynb:this notebook will reproduce main figure2a_2b_2c_2d
| | ├── FIG2_e_f.ipynb:this notebook will reproduce main figure2e_2f
| | ├── MouseOB
| | ├── gen_Suppl_Fig1.ipynb : this notebook will reproduce Suppl Figure1
| | ├── gen_Suppl_Fig2.ipynb : this notebook will reprodece Suppl Figure2
| | ├── gen_Suppl_Fig6.ipynb : this notebook will reproduce Suppl Figure6
| | └── ...
| | ├── Breast_Cancer
| | ├── gen_Layer2_Fig : this notebook will reproduce Figure2_e_f and Suppl Figure11a
| | ├── gen_Suppl_Fig10 : this notebook will reprodece Suppl Figure10
| | └── ...
| | ├── MERFISH
| | ├── gen_Suppl_Fig14 : this notebook will reprodece Suppl Figure14
│ ├── Simulation
| ├── Fig2g_Compare_memory_simulation_data.ipynb # this notebook is for comparing occpuied memory of three three methods and shown in main fig2g
| ├── Fig2h_Compare_time_simulation_data.ipynb # this notebook is for comparing running speed of three three methods and shown in main fig2h
| | ├── Simulate_script
| ├── scGCO_simulate_script.ipynb # the code is for testing scGCO running speed and occupied CPU memory with small simulate data
| ├── scGCO_simulate_1M_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with millions simulate data
| ├── scGCO_simulate_500K_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with 500K simulate data
| ├── spatialDE_simulate_script.ipynb # the code is for testing spatialDE running speed and occupied CPU memory with simulate data
│ ├── README.md
│ ├── Table_Of_Contents.ipynb # Start here in an interactive session. Includes hyperlinks to individual analysis notebooks
|
├── data
| ├── MouseOB
| ├── [scGCO spatialDE trendseeck DESeq2 results]
| ├── Breast_Cancer
| ├── [scGCO spatialDE trendseeck results]
| ├── MERFISH
| ├── [scGCO spatialDE trendseeck results]
| ├── HighVariableGenes
| ├── [All datasets seruat results]
| ├── Raw_data
| ├── [All datasets counts data]
| ├── Simulation_data
| ├── [simulation data]
└── figures
| ├── [All figures manuscript]
└── ...
Several Jupyter Notebooks are provided in the Analysis directory to reproduce figures of the paper.
Several Jupyter Notebooks are provided in the Simulation directory to reproduce the running time simulation results reported in the main text.
- scGCO_simulate_script.ipynb: benchmark the running time of scGCO using simulated data with cell numbers upto 100K.
This script should take about 10 minutes to finish on a typical 8 cores computer.
The following two scripts simulated greater numbers of cells and will take substantially more time to finish. The 1M simulation takes about 3 hours using a typical 8 cores computer.
- scGCO_simulate_500K_large.ipynb: benchmark the running time of scGCO using simulated data with 500K cells.
- scGCO_simulate_1M_large.ipynb: benchmark the running time of scGCO using simulated data with 1M cells.
This script takes 1-2 hours to finish on a typical 8 cores computer.
- Compare_memory_simulation_data.ipynb: This notebook generates Fig. 2g using precomputed data.
- Compare_time_simulation_data.ipynb: This notebook generates Fig. 2h using precomputed data.
- spatialDE_simulate_script.ipynb: benchmark the running time of spatialDE using simulated data with cell numbers upto 15K.
This script takes about 20 hours to finish on a typical 8 cores computer.