This repository describes the bioinformatics component of the ERG transcription factor study to be published. There are several distinct components to this mirrored in the repository's directory structure. However, there are a few recurring themes which will be detailed here. Deviations in the individual workflows are captured in the subdirectory's README's.
All downstream analyses will require alignment of high-throughput sequencing data. We use hg38 from Ensembl for our genome annotation and assembly. To retrieve these,
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
gunzip *.gz
mv Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa hg38.fa
The general bioinformatics workflow for a bulk RNA-seq experiment is: RNA-seq --> recommended QC with FastQC --> STAR --> feature counting (subread or htseq) --> DESeq2 or edgeR In this study, we chose subread and DESeq2. In all cases, we begin with FASTQ files, which may come from various sources (for more details, see the respective subdirectories).
We can generate QC metrics on our FASTQ files
mkdir fastqc_reports
fastqc *.fastq -t 6 -d fastqc_reports/
In certain cases, we use these QC metrics to perform some trimming of the FASTQ files, such as cutting out adapters or low-quality reads. Most of the FASTQ files in this study do not require additional processing. Therefore, we defer documentation of such to the affected subdirectories.
Next, we begin the alignment process using STAR. We used version 2.7.3a, which was the latest version available on Indiana University's HPC at the time. First, using the assembly/annotation files obtained at the beginning, we create an index of the genome.
mkdir indices
mkdir STAR
STAR --runThreadN 6 --runMode genomeGenerate --genomeFastaFiles hg38.fa --sjdbGTFfile Homo_sapiens.GRCh38.99.chr.gtf --genomeDir indices
With the index created, we can finally align our reads. Of course, this step requires the specific FASTQ file as input. Here, we document the general command
files=($(ls *.fastq))
for file in ${files[@]}
do
STAR --runThreadN 6 --genomeDir indices --readFilesIn $file --outFileNamePrefix STAR/${file}_ --genomeDir indices --outSAMtype BAM SortedByCoordinate
done
The alignment output is in SAM format and we pass this into subreads to do the feature counting.
featureCounts -T 6 -s 2 -a ../Homo_sapiens.GRCh38.99.chr.gtf -o counts.tsv *.fastq_Aligned.out.sam
Subreads has an important parameter to set the strand-specific information -s "Since your RNA-seq will likely be stranded, which means it knows which of the two DNA strands were used as a template for the RNA, it is criticial to put the correct strandedness when doing feature counting."
Finally, we do the differential gene analysis using the Bioconductor package DESeq2. All runs of this are conveniently bundled in an R markdown and HTML files placed in their corresponding subdirectories. Notably, the R files can be run as a script and the necessary files for the downstream analyses will be generated, in case the graphs generated by DESeq2 are not desired.
In our study, we used ChIP-seq for various purposes. For more information, please see the corresponding subdirectories. Here, we outline the alignment process (using bowtie2) and annotating the corresponding peaks, the former being necessary for all the subsequent ChIP-seq analyses.
As in the RNA-seq, we begin with FASTQ files and should QC them using FASTQC. We refer to the above section and skip to the alignment process. In this alignment step, we use bowtie2, version 2.3.2. As before, we must first build our index.
mkdir index
bowtie2-build hg38.fa index
Then we do the alignment. The general command looks like
bowtie2 --no-unal --threads 3 -x index/index -U .fastq -S .sam
with the appropriate filenames filled in. To perform the downstream analysis, we want sorted BAM files. We accomplish this with samtools version 1.9.
samtools view -b .sam > .bam
samtools sort .bam -o _sorted.bam
samtools index _sorted.bam
In addition to the general commands provided above, scripts to perform these to utilize multiprocessing are included in a few of the subdirectories.
Many of the analyses require information about the peaks. We call peaks using macs2, version 2.1.2.
mkdir peaks
macs2 callpeak -t <treated>_sorted.bam -c <control>_sorted.bam -f BAM -g hs -n ERG --outdir peaks
We can further annotate the peaks, which is accomplished with an R script annoPeaks.R, contained in the appropriate subdirectories.
The ChIP plots such as heatmaps, were generated using the resulting alignment files from above in DeepTools, version 3.4.3. The documentation is placed in the resulting subdirectory.