Skip to content

WangPeng-Lab/scGCO-1.0.0

Repository files navigation

Single-cell Graph Cuts Optimization

(scGCO)

Overview

scGCO is a method to identify genes demonstrating position-dependent differential expression patterns, also known as spatially viable genes, using the powerful graph cuts algorithm. ScGCO can analyze spatial transcriptomics data generated by diverse technologies, including but not limited to single-cell RNA-sequencing, or in situ FISH based methods.What's more, scGCO can easy scale to millions of cells.

Repo Contents

This repository contains source codes of scGCO, and tutorials on running the program.

Installation Guide

The primary implementation is as a Python 3 package, and can be installed from the command line by

 pip install scgco

scGCO has been tested on Ubuntu Linux (18.04.1), Mac OS X (10.14.1) and Windows(Windows 7 Professional).

License

MIT Licence, see LICENSE file.

Authors

See AUTHORS file.

Contact

For bugs, feedback or help please contact Peng Wang [email protected].

Example usage of scGCO

The following codes demonstrate the typical data analysis flow of scGCO.

The tutorial has also been provided as a Jupyter Notebook in the Tutorial directory (scGCO_tutorial.ipynb)

The entire process should only take 1-2 minutes on a typical desktop computer with 8 cores.

Input Format

The required matrix format is the ST data format, a matrix of counts where spot coordinates are row names and the gene names are column names. This default matrix format (.TSV ) is split by tab.

As an example, let’s analyze spatially variable gene expression in Mouse Olfactory Bulb using a data set published in Ståhl et al 2016.

import warnings
warnings.filterwarnings('ignore')


from scGCO import *
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# read spatial expression data and processing 

j=11
ff = './README_file/Rep11_MOB_count_matrix-1.tsv'
locs,data=read_spatial_expression(ff,sep='\t',num_exp_genes=0.01, num_exp_spots=0.01, min_expression=1)

# normalize expression and use 1000 genes to test the algorithm
data_norm = normalize_count_cellranger(data)

data_norm = data_norm.iloc[:,0:1000]
print('Rep11_processing: {}'.format(data_norm.shape))
raw data dim: (262, 16218)
Rep11_processing: (259, 1000)

Run the main scGCO function to identify genes with a non-random spatial variability.

import time
# estimate smooth factor that minimizes the number of false positives
from sklearn.utils import shuffle
data_norm_rand = shuffle(data_norm)
start_ts = time.time()
unary_scale_factor=100
label_cost=10
algorithm='expansion'
smooth_factor = estimate_smooth_factor(locs, data_norm_rand,start_sf=20,fdr_cutoff=0.01)
print('Rep{}: '.format(j),smooth_factor)

# run the main algorithm to identify genes with non-random 
# spatial patterns
# this should take less than a minute 
result_df = identify_spatial_genes(locs, data_norm, smooth_factor )
end_ts = time.time()
print('seconds to run: ', end_ts-start_ts)
100%|██████████| 16/16 [00:03<00:00,  4.90it/s]

========iteration1
25 153



100%|██████████| 16/16 [00:03<00:00,  4.83it/s]

========iteration2
30 3
Rep11:  30



100%|██████████| 16/16 [00:03<00:00,  4.26it/s]
100%|██████████| 16/16 [00:03<00:00,  4.55it/s]

seconds to run:  14.518628597259521
write_result_to_csv(result_df,'./README_file//Rep{}_test_result_df.csv'.format(j))
result_df=read_result_to_dataframe('./README_file/Rep11_test_result_df.csv')
print(result_df.shape)
(993, 266)

Select genes with significant spatial non-random patterns using a specific fdr cutoff.

fdr_cutoff=0.01
fdr_df=result_df.loc[result_df.fdr<fdr_cutoff,]
print(fdr_df.shape)
(181, 266)

Select genes with significantly conserved spatial patterns.

## select genes with significantly conserved sptail patterns

_,hamming_df=simulate_hamming(locs,data_norm,fdr_df,cutoff=0.0001)
hamming_cutoff=round(hamming_df,2)
print('Rep{} {}'.format(j,hamming_cutoff))
    
Rep11 0.76


To automatically learn cell clusters from the input data,
set fixed_k=None; If the number of clusters (K) is known from priori knowledge, set fixed_k=K.

fdr_opt=fdr_df.sort_values(by='fdr')
hamming_cutoff=0.76
seg_max=data_norm.shape[0]*0.6 

pattern_df,tissue_mat,targe_df,hamming_df=identify_pattern_conserved_genes_iteration(
    locs, data_norm, fdr_opt,
                            similarity_cutoff=hamming_cutoff,cluster_size=5,smooth_factor=10, 
                            perplexity=30,fixed_k=5,seg_min=8,seg_max=seg_max)

print('Rep{}_step2{}'.format(j,pattern_df.shape))
5
[ 84.  84.  54. 126. 133.  72.  48.  32. 113.  66.  72.  38.  20.  31.
  98.]
(97, 266)
================iteration1
k: 5
5
[ 78.  78. 123. 136.  72.  51.  66.  40. 105.  48.  36.  35.  31.  92.
  65.]
(86, 266)
================iteration2
k: 5
5
[ 77.  77.  72.  50. 122. 137.  66.  37. 108.  48.  40.  29.  31.  93.
  66.]
(86, 266)
Rep11_step2(86, 266)
## visualize the learned conserved spatial domains
print('Rep{}_tissue{}'.format(j,tissue_mat.shape))
image='./data/Raw_data//HE-MOB-breast-cancer/HE_Rep{}_MOB.jpg'.format(j)
colors=['green','blue']
title='Rep11_tissue_pattern'
plot_tissue_pattern(locs,data_norm,tissue_mat,image,colors,title,nrows=5,ncols=5)
plt.savefig('./README_file/Rep11_tissue_patterns.pdf')
plt.show()
Rep11_tissue(15, 259)

png

Identify genes with expression fold changes above threshold.

This is the final set of spatially variable genes identified by scGCO.

fdr_cutoff=0.01
import time
start_ts=time.time()
new_result_df,zero_exp_genes=recalc_exp_diff(data_norm,result_df,fdr_cutoff=fdr_cutoff,cluster_k=3)
end_ts=time.time()
ts=(end_ts-start_ts)/60
print('Rep{} ts: {}'.format(j,ts))

exp_cutoff=estimate_exp_diff_cutoff(new_result_df,cutoff=fdr_cutoff*2,q=0.95)
exp_cutoff=round(exp_cutoff,2)
print(exp_cutoff)
Rep11 ts: 0.0015866676966349283
0.77
exp_cutoff=0.77
final_df=pattern_df[abs(pattern_df.exp_diff)>exp_cutoff]
final_df.shape
(55, 266)

Visualize some identified genes.

# visualize top genes
visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0.2)

png

png

png

png

png

# save top genes to pdf
multipage_pdf_visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0, 
                                      fileName='./README_file//top10_genes.pdf')

Perform t-SNE and visualize the clustering of identified genes.

# Do PCA + t-SNE to visualize the clustering patterns of identified genes
# Though only 1000 genes are used, the pattern should resemble Fig. 2b in the manuscript

fig,ax=plt.subplots()
tsne_proj=spatial_pca_tsne(data_norm,final_df.index,perplexity = 20)

title='MOB Rep {}'.format(j)
zz=visualize_tsne_density(tsne_proj,title=title,bins=200,threshold=0.01,ax=ax,fig=fig,
                          fileName='./README_file/Rep11_tsne.pdf')

png

Perform graph cuts for a single gene.

# You can also analyze one gene of interest

geneID='Apod' # Lets use Apod as an example
unary_scale_factor = 100 # scale factor for unary energy, default value works well

# set smooth factor to 20; 
# use bigger smooth_factor to get less segments
# use small smooth_factor to get more segments
smooth_factor=20 

ff = './README_file//Rep11_MOB_count_matrix-1.tsv' 
# read in spatial gene expression data
locs, data = read_spatial_expression(ff,sep='\t')

# normalize gene expression
data_norm = normalize_count_cellranger(data)

# select Apod's expression
exp =  data_norm.loc[:,geneID]

# log transform
exp=(log1p(exp)).values

# create graph representation of spatial coordinates of cells
cellGraph = create_graph_with_weight(locs, exp)

# do graph cut
newLabels, gmm = cut_graph_general(cellGraph, exp, unary_scale_factor, smooth_factor)
# calculate p values
p, node, com = compute_p_CSR(locs,newLabels, gmm, exp, cellGraph)

# Visualize graph cut results
plot_voronoi_boundary(geneID, locs, exp,  newLabels, min(p)) 

# save the graph cut results to pdf
pdf_voronoi_boundary(geneID, locs, exp, newLabels, min(p),
                     fileName='./README_file//{}.pdf'.format(geneID),
                    point_size=0)
raw data dim: (262, 16218)

png

png

Reproducibility

The container file system

/ [root]
|   ├── Analysis
|   |        ├── FIG2_a_b_c_d.ipynb:this notebook will reproduce main figure2a_2b_2c_2d
|   |        ├── FIG2_e_f.ipynb:this notebook will reproduce main figure2e_2f
|   |        ├── MouseOB
|   |               ├── gen_Suppl_Fig1.ipynb : this notebook will reproduce Suppl Figure1
|   |               ├── gen_Suppl_Fig2.ipynb : this notebook will reprodece Suppl Figure2
|   |               ├── gen_Suppl_Fig6.ipynb : this notebook will reproduce Suppl Figure6
|   |                    └── ...
|   |        ├── Breast_Cancer
|   |               ├── gen_Layer2_Fig : this notebook will reproduce Figure2_e_f and Suppl Figure11a
|   |               ├── gen_Suppl_Fig10 : this notebook will reprodece Suppl Figure10
|   |                    └── ...
|   |        ├── MERFISH
|   |               ├── gen_Suppl_Fig14 : this notebook will reprodece Suppl Figure14
│   ├── Simulation
|         ├── Fig2g_Compare_memory_simulation_data.ipynb # this notebook is for comparing occpuied memory of three three methods and shown in main fig2g
|         ├── Fig2h_Compare_time_simulation_data.ipynb # this notebook is for comparing running speed of three three methods and shown in main fig2h
|   |     ├── Simulate_script
|                   ├── scGCO_simulate_script.ipynb # the code is for testing scGCO running speed and occupied CPU memory with small simulate data
|                   ├── scGCO_simulate_1M_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with millions simulate data
|                   ├── scGCO_simulate_500K_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with 500K simulate data
|                   ├── spatialDE_simulate_script.ipynb # the code is for testing spatialDE running speed and occupied CPU memory with simulate data
│   ├── README.md
│   ├── Table_Of_Contents.ipynb # Start here in an interactive session. Includes hyperlinks to individual analysis notebooks
|
├── data
|   ├── MouseOB
|           ├── [scGCO spatialDE trendseeck DESeq2 results]
|   ├── Breast_Cancer
|           ├── [scGCO spatialDE trendseeck results]
|   ├── MERFISH
|           ├── [scGCO spatialDE trendseeck results]
|   ├── HighVariableGenes
|           ├── [All datasets seruat results]
|   ├── Raw_data
|           ├── [All datasets counts data]
|   ├── Simulation_data
|           ├── [simulation data] 
└── figures 
|   ├── [All figures manuscript]
       └── ...

Several Jupyter Notebooks are provided in the Analysis directory to reproduce figures of the paper.

Simulating small data sets

Several Jupyter Notebooks are provided in the Simulation directory to reproduce the running time simulation results reported in the main text.

This script should take about 10 minutes to finish on a typical 8 cores computer.

Simulating large data sets

The following two scripts simulated greater numbers of cells and will take substantially more time to finish. The 1M simulation takes about 3 hours using a typical 8 cores computer.

This script takes 1-2 hours to finish on a typical 8 cores computer.

Generate memroy profiling plot (Fig. 2g)

Generate running time profiling plot (Fig. 2h)

Simulating spatialDE

This script takes about 20 hours to finish on a typical 8 cores computer.

About

Single-cell Graph Cuts Optimization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published