This repository contains the code to analyze the single cell multiome data of human lung as part of the manuscript "Context-aware single-cell multiome approach identified cell-type specific lung cancer susceptibility genes".
The data are deposited in the Gene Expression Omnibus (GEO) database under the accession GSE241468.
We used DropletQC(v.0.9) to remove “empty” droplets containing ambient RNA from the gene expression matrices. The resulting expression matrices were processed individually in R (v.4.1.3) using Seurat (v.4.0.6) and Gencode v.27 for gene identification. Filtered gene–barcode matrices were normalized with the ‘SCTransform’ function of Seurat, and the top 2,000 variable genes were identified. Gene expression matrices were scaled and centered using the ‘ScaleData’ function. See details in 01.QC.
The snATAC peak calling and annotation were performed following the Signac pipeline. Specifically, peaks were called using MACS2 with default parameters after combining the reads of all the cells in each cell type to determine the genomic regions enriched for Tn5 accessibility from snATAC fragments, resulting in 330,453 peaks in total. Peaks were then annotated according to distance to protein-coding genes using ChIPseeker. The gene activity score was calculated via the "GeneActivity" function of Signac. See details in 02.peak_calling.
Using the normalized gene expression data, we performed principal component analysis (PCA) with 50 PCs to compute and store. A uniform manifold approximation and projection (UMAP)-based approach was applied for expression matrices with the first 50 PCs and for chromatin accessibility matrices with the 2nd through 50th PCs (the first PC was excluded as this is typically correlated with sequencing depth). Both expression and chromatin accessibility matrices ere corrected for batch effect using Harmony. A Weighted Nearest Neighbor (WNN) method was applied to integrate the weighted combination of RNA and ATAC-seq modalities. The ‘FindClusters’ function was applied for clustering using smart local moving (SLM) algorithm for modularity optimization at a resolution of 0.5. See details in 03.clustering.
We performed co-embedding analysis between snRNA and snATAC modalities following the Seurat pipeline. See details in 04.co-embedding.
Smoking-responsive genes for each cell type were inferred by pseudobulk differential gene expression analysis using DESeq2 (v1.41.1). Gender was incorporated as the covariate into the model. See details in 05.DEG.
We employed the "FindMarkers" function in Seurat to identify peaks of differential accessibility between the smoking and non-smoking states for each cell type. See details in 06.DAR.
CellChat (version 1.6.0) was used to infer ligand–receptor interactions based on scRNA-seq data. The intercellular communication analyses were performed separately using cells from ever-smokers and never-smokers for comparison at different levels. Then the integrated Human Lung Cell Atlas (HLCA) core dataset was used to validate the trend of MHC-I and MHC-II communication. See details in 07.cellchat.
We inferred the lung-cancer-associated score for each cell using SCAVENGE pipeline, based on snATAC-seq and lung cancer GWAS data. See details in 08.SCAVENGE.
Allelic effects of predicted TF binding, cell-type-specific TF abundance assessment and TF footprinting
Prediction of variant effects on TF binding sites was performed with the motifbreakR package. "Abundant TF" was identified by expression level and expression percent of TF. TF footprint analysis was performed for each allelic-binding TF using the ‘Footprint’ function in Signac by restricting to the peak regions. See details in 09.TF_analysis.
Allelic transcriptional activity of CCVs from the 15_5p15.33 locus was assessed as part of MPRA. See details in 10.MPRA.
The co-accessible cCRE modules of two or more cCREs were identified by Cicero with Louvain community detection algorithm. For cCRE-gene correlation, we identified cCREs that may regulate a given gene by computing the correlation between gene expression and accessibility at nearby cCREs. See details in 11.cCRE_analysis.