GitHub - WangPeng-Lab/scGCO-1.0.0: Single-cell Graph Cuts Optimization

Single-cell Graph Cuts Optimization

(scGCO)

Overview

scGCO is a method to identify genes demonstrating position-dependent differential expression patterns, also known as spatially viable genes, using the powerful graph cuts algorithm. ScGCO can analyze spatial transcriptomics data generated by diverse technologies, including but not limited to single-cell RNA-sequencing, or in situ FISH based methods.What's more, scGCO can easy scale to millions of cells.

Repo Contents

This repository contains source codes of scGCO, and tutorials on running the program.

Installation Guide

The primary implementation is as a Python 3 package, and can be installed from the command line by

 pip install scgco

scGCO has been tested on Ubuntu Linux (18.04.1), Mac OS X (10.14.1) and Windows(Windows 7 Professional).

License

MIT Licence, see LICENSE file.

Authors

See AUTHORS file.

Contact

For bugs, feedback or help please contact Peng Wang [email protected].

Example usage of scGCO

The following codes demonstrate the typical data analysis flow of scGCO.

The tutorial has also been provided as a Jupyter Notebook in the Tutorial directory (scGCO_tutorial.ipynb)

The entire process should only take 1-2 minutes on a typical desktop computer with 8 cores.

Input Format

The required matrix format is the ST data format, a matrix of counts where spot coordinates are row names and the gene names are column names. This default matrix format (.TSV ) is split by tab.

As an example, let’s analyze spatially variable gene expression in Mouse Olfactory Bulb using a data set published in Ståhl et al 2016.

import warnings
warnings.filterwarnings('ignore')


from scGCO import *
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# read spatial expression data and processing 

j=11
ff = './README_file/Rep11_MOB_count_matrix-1.tsv'
locs,data=read_spatial_expression(ff,sep='\t',num_exp_genes=0.01, num_exp_spots=0.01, min_expression=1)

# normalize expression and use 1000 genes to test the algorithm
data_norm = normalize_count_cellranger(data)

data_norm = data_norm.iloc[:,0:1000]
print('Rep11_processing: {}'.format(data_norm.shape))

raw data dim: (262, 16218)
Rep11_processing: (259, 1000)

Run the main scGCO function to identify genes with a non-random spatial variability.

import time
# estimate smooth factor that minimizes the number of false positives
from sklearn.utils import shuffle
data_norm_rand = shuffle(data_norm)
start_ts = time.time()
unary_scale_factor=100
label_cost=10
algorithm='expansion'
smooth_factor = estimate_smooth_factor(locs, data_norm_rand,start_sf=20,fdr_cutoff=0.01)
print('Rep{}: '.format(j),smooth_factor)

# run the main algorithm to identify genes with non-random 
# spatial patterns
# this should take less than a minute 
result_df = identify_spatial_genes(locs, data_norm, smooth_factor )
end_ts = time.time()
print('seconds to run: ', end_ts-start_ts)

100%|██████████| 16/16 [00:03<00:00,  4.90it/s]

========iteration1
25 153



100%|██████████| 16/16 [00:03<00:00,  4.83it/s]

========iteration2
30 3
Rep11:  30



100%|██████████| 16/16 [00:03<00:00,  4.26it/s]
100%|██████████| 16/16 [00:03<00:00,  4.55it/s]

seconds to run:  14.518628597259521

write_result_to_csv(result_df,'./README_file//Rep{}_test_result_df.csv'.format(j))

result_df=read_result_to_dataframe('./README_file/Rep11_test_result_df.csv')
print(result_df.shape)

(993, 266)

Select genes with significant spatial non-random patterns using a specific fdr cutoff.

fdr_cutoff=0.01
fdr_df=result_df.loc[result_df.fdr<fdr_cutoff,]
print(fdr_df.shape)

(181, 266)

Select genes with significantly conserved spatial patterns.

## select genes with significantly conserved sptail patterns

_,hamming_df=simulate_hamming(locs,data_norm,fdr_df,cutoff=0.0001)
hamming_cutoff=round(hamming_df,2)
print('Rep{} {}'.format(j,hamming_cutoff))

Rep11 0.76

To automatically learn cell clusters from the input data,
set fixed_k=None; If the number of clusters (K) is known from priori knowledge, set fixed_k=K.

fdr_opt=fdr_df.sort_values(by='fdr')
hamming_cutoff=0.76
seg_max=data_norm.shape[0]*0.6 

pattern_df,tissue_mat,targe_df,hamming_df=identify_pattern_conserved_genes_iteration(
    locs, data_norm, fdr_opt,
                            similarity_cutoff=hamming_cutoff,cluster_size=5,smooth_factor=10, 
                            perplexity=30,fixed_k=5,seg_min=8,seg_max=seg_max)

print('Rep{}_step2{}'.format(j,pattern_df.shape))

5
[ 84.  84.  54. 126. 133.  72.  48.  32. 113.  66.  72.  38.  20.  31.
  98.]
(97, 266)
================iteration1
k: 5
5
[ 78.  78. 123. 136.  72.  51.  66.  40. 105.  48.  36.  35.  31.  92.
  65.]
(86, 266)
================iteration2
k: 5
5
[ 77.  77.  72.  50. 122. 137.  66.  37. 108.  48.  40.  29.  31.  93.
  66.]
(86, 266)
Rep11_step2(86, 266)

## visualize the learned conserved spatial domains
print('Rep{}_tissue{}'.format(j,tissue_mat.shape))
image='./data/Raw_data//HE-MOB-breast-cancer/HE_Rep{}_MOB.jpg'.format(j)
colors=['green','blue']
title='Rep11_tissue_pattern'
plot_tissue_pattern(locs,data_norm,tissue_mat,image,colors,title,nrows=5,ncols=5)
plt.savefig('./README_file/Rep11_tissue_patterns.pdf')
plt.show()

Rep11_tissue(15, 259)

Identify genes with expression fold changes above threshold.

This is the final set of spatially variable genes identified by scGCO.

fdr_cutoff=0.01
import time
start_ts=time.time()
new_result_df,zero_exp_genes=recalc_exp_diff(data_norm,result_df,fdr_cutoff=fdr_cutoff,cluster_k=3)
end_ts=time.time()
ts=(end_ts-start_ts)/60
print('Rep{} ts: {}'.format(j,ts))

exp_cutoff=estimate_exp_diff_cutoff(new_result_df,cutoff=fdr_cutoff*2,q=0.95)
exp_cutoff=round(exp_cutoff,2)
print(exp_cutoff)

Rep11 ts: 0.0015866676966349283
0.77

exp_cutoff=0.77
final_df=pattern_df[abs(pattern_df.exp_diff)>exp_cutoff]
final_df.shape

(55, 266)

Visualize some identified genes.

# visualize top genes
visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0.2)

# save top genes to pdf
multipage_pdf_visualize_spatial_genes(final_df.iloc[0:10,], locs, data_norm,point_size=0, 
                                      fileName='./README_file//top10_genes.pdf')

Perform t-SNE and visualize the clustering of identified genes.

# Do PCA + t-SNE to visualize the clustering patterns of identified genes
# Though only 1000 genes are used, the pattern should resemble Fig. 2b in the manuscript

fig,ax=plt.subplots()
tsne_proj=spatial_pca_tsne(data_norm,final_df.index,perplexity = 20)

title='MOB Rep {}'.format(j)
zz=visualize_tsne_density(tsne_proj,title=title,bins=200,threshold=0.01,ax=ax,fig=fig,
                          fileName='./README_file/Rep11_tsne.pdf')

Perform graph cuts for a single gene.

# You can also analyze one gene of interest

geneID='Apod' # Lets use Apod as an example
unary_scale_factor = 100 # scale factor for unary energy, default value works well

# set smooth factor to 20; 
# use bigger smooth_factor to get less segments
# use small smooth_factor to get more segments
smooth_factor=20 

ff = './README_file//Rep11_MOB_count_matrix-1.tsv' 
# read in spatial gene expression data
locs, data = read_spatial_expression(ff,sep='\t')

# normalize gene expression
data_norm = normalize_count_cellranger(data)

# select Apod's expression
exp =  data_norm.loc[:,geneID]

# log transform
exp=(log1p(exp)).values

# create graph representation of spatial coordinates of cells
cellGraph = create_graph_with_weight(locs, exp)

# do graph cut
newLabels, gmm = cut_graph_general(cellGraph, exp, unary_scale_factor, smooth_factor)
# calculate p values
p, node, com = compute_p_CSR(locs,newLabels, gmm, exp, cellGraph)

# Visualize graph cut results
plot_voronoi_boundary(geneID, locs, exp,  newLabels, min(p)) 

# save the graph cut results to pdf
pdf_voronoi_boundary(geneID, locs, exp, newLabels, min(p),
                     fileName='./README_file//{}.pdf'.format(geneID),
                    point_size=0)

raw data dim: (262, 16218)

Reproducibility

The container file system

/ [root]
|   ├── Analysis
|   |        ├── FIG2_a_b_c_d.ipynb:this notebook will reproduce main figure2a_2b_2c_2d
|   |        ├── FIG2_e_f.ipynb:this notebook will reproduce main figure2e_2f
|   |        ├── MouseOB
|   |               ├── gen_Suppl_Fig1.ipynb : this notebook will reproduce Suppl Figure1
|   |               ├── gen_Suppl_Fig2.ipynb : this notebook will reprodece Suppl Figure2
|   |               ├── gen_Suppl_Fig6.ipynb : this notebook will reproduce Suppl Figure6
|   |                    └── ...
|   |        ├── Breast_Cancer
|   |               ├── gen_Layer2_Fig : this notebook will reproduce Figure2_e_f and Suppl Figure11a
|   |               ├── gen_Suppl_Fig10 : this notebook will reprodece Suppl Figure10
|   |                    └── ...
|   |        ├── MERFISH
|   |               ├── gen_Suppl_Fig14 : this notebook will reprodece Suppl Figure14
│   ├── Simulation
|         ├── Fig2g_Compare_memory_simulation_data.ipynb # this notebook is for comparing occpuied memory of three three methods and shown in main fig2g
|         ├── Fig2h_Compare_time_simulation_data.ipynb # this notebook is for comparing running speed of three three methods and shown in main fig2h
|   |     ├── Simulate_script
|                   ├── scGCO_simulate_script.ipynb # the code is for testing scGCO running speed and occupied CPU memory with small simulate data
|                   ├── scGCO_simulate_1M_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with millions simulate data
|                   ├── scGCO_simulate_500K_large.ipynb # the code is for testing scGCO running speed and occupied CPU memory with 500K simulate data
|                   ├── spatialDE_simulate_script.ipynb # the code is for testing spatialDE running speed and occupied CPU memory with simulate data
│   ├── README.md
│   ├── Table_Of_Contents.ipynb # Start here in an interactive session. Includes hyperlinks to individual analysis notebooks
|
├── data
|   ├── MouseOB
|           ├── [scGCO spatialDE trendseeck DESeq2 results]
|   ├── Breast_Cancer
|           ├── [scGCO spatialDE trendseeck results]
|   ├── MERFISH
|           ├── [scGCO spatialDE trendseeck results]
|   ├── HighVariableGenes
|           ├── [All datasets seruat results]
|   ├── Raw_data
|           ├── [All datasets counts data]
|   ├── Simulation_data
|           ├── [simulation data] 
└── figures 
|   ├── [All figures manuscript]
       └── ...

Several Jupyter Notebooks are provided in the Analysis directory to reproduce figures of the paper.

Simulating small data sets

Several Jupyter Notebooks are provided in the Simulation directory to reproduce the running time simulation results reported in the main text.

scGCO_simulate_script.ipynb: benchmark the running time of scGCO using simulated data with cell numbers upto 100K.

This script should take about 10 minutes to finish on a typical 8 cores computer.

Simulating large data sets

The following two scripts simulated greater numbers of cells and will take substantially more time to finish. The 1M simulation takes about 3 hours using a typical 8 cores computer.

scGCO_simulate_500K_large.ipynb: benchmark the running time of scGCO using simulated data with 500K cells.
scGCO_simulate_1M_large.ipynb: benchmark the running time of scGCO using simulated data with 1M cells.

This script takes 1-2 hours to finish on a typical 8 cores computer.

Generate memroy profiling plot (Fig. 2g)

Compare_memory_simulation_data.ipynb: This notebook generates Fig. 2g using precomputed data.

Generate running time profiling plot (Fig. 2h)

Compare_time_simulation_data.ipynb: This notebook generates Fig. 2h using precomputed data.

Simulating spatialDE

spatialDE_simulate_script.ipynb: benchmark the running time of spatialDE using simulated data with cell numbers upto 15K.

This script takes about 20 hours to finish on a typical 8 cores computer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-cell Graph Cuts Optimization

(scGCO)

Overview

Repo Contents

Installation Guide

License

Authors

Contact

Example usage of scGCO

Input Format

Reproducibility

The container file system

Simulating small data sets

Simulating large data sets

Generate memroy profiling plot (Fig. 2g)

Generate running time profiling plot (Fig. 2h)

Simulating spatialDE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Analysis		Analysis
Biology_functions		Biology_functions
README_file		README_file
Simulation		Simulation
Tutorial		Tutorial
data		data
figures		figures
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
Table_Of_Contents.ipynb		Table_Of_Contents.ipynb
setup.py		setup.py

License

WangPeng-Lab/scGCO-1.0.0

Folders and files

Latest commit

History

Repository files navigation

Single-cell Graph Cuts Optimization

(scGCO)

Overview

Repo Contents

Installation Guide

License

Authors

Contact

Example usage of scGCO

Input Format

Reproducibility

The container file system

Simulating small data sets

Simulating large data sets

Generate memroy profiling plot (Fig. 2g)

Generate running time profiling plot (Fig. 2h)

Simulating spatialDE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages