ERG

This repository describes the bioinformatics component of the ERG transcription factor study to be published. There are several distinct components to this mirrored in the repository's directory structure. However, there are a few recurring themes which will be detailed here. Deviations in the individual workflows are captured in the subdirectory's README's.

All downstream analyses will require alignment of high-throughput sequencing data. We use hg38 from Ensembl for our genome annotation and assembly. To retrieve these,

wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.chr.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
gunzip *.gz
mv Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa hg38.fa

RNA-seq

The general bioinformatics workflow for a bulk RNA-seq experiment is: RNA-seq --> recommended QC with FastQC --> STAR --> feature counting (subread or htseq) --> DESeq2 or edgeR In this study, we chose subread and DESeq2. In all cases, we begin with FASTQ files, which may come from various sources (for more details, see the respective subdirectories).

Preprocessing

We can generate QC metrics on our FASTQ files

mkdir fastqc_reports
fastqc *.fastq -t 6 -d fastqc_reports/

In certain cases, we use these QC metrics to perform some trimming of the FASTQ files, such as cutting out adapters or low-quality reads. Most of the FASTQ files in this study do not require additional processing. Therefore, we defer documentation of such to the affected subdirectories.

Aligning and Feature Count

Next, we begin the alignment process using STAR. We used version 2.7.3a, which was the latest version available on Indiana University's HPC at the time. First, using the assembly/annotation files obtained at the beginning, we create an index of the genome.

mkdir indices
mkdir STAR
STAR --runThreadN 6 --runMode genomeGenerate --genomeFastaFiles hg38.fa --sjdbGTFfile Homo_sapiens.GRCh38.99.chr.gtf --genomeDir indices

With the index created, we can finally align our reads. Of course, this step requires the specific FASTQ file as input. Here, we document the general command

files=($(ls *.fastq))
for file in ${files[@]}
do
    STAR --runThreadN 6 --genomeDir indices --readFilesIn $file --outFileNamePrefix STAR/${file}_ --genomeDir indices --outSAMtype BAM SortedByCoordinate
done

The alignment output is in SAM format and we pass this into subreads to do the feature counting.

featureCounts -T 6 -s 2 -a ../Homo_sapiens.GRCh38.99.chr.gtf -o counts.tsv *.fastq_Aligned.out.sam

Subreads has an important parameter to set the strand-specific information -s "Since your RNA-seq will likely be stranded, which means it knows which of the two DNA strands were used as a template for the RNA, it is criticial to put the correct strandedness when doing feature counting."

Differential Expression Analysis

Finally, we do the differential gene analysis using the Bioconductor package DESeq2. All runs of this are conveniently bundled in an R markdown and HTML files placed in their corresponding subdirectories. Notably, the R files can be run as a script and the necessary files for the downstream analyses will be generated, in case the graphs generated by DESeq2 are not desired.

ChIP-seq

In our study, we used ChIP-seq for various purposes. For more information, please see the corresponding subdirectories. Here, we outline the alignment process (using bowtie2) and annotating the corresponding peaks, the former being necessary for all the subsequent ChIP-seq analyses.

Aligning

As in the RNA-seq, we begin with FASTQ files and should QC them using FASTQC. We refer to the above section and skip to the alignment process. In this alignment step, we use bowtie2, version 2.3.2. As before, we must first build our index.

mkdir index
bowtie2-build hg38.fa index

Then we do the alignment. The general command looks like

bowtie2 --no-unal --threads 3 -x index/index -U .fastq -S .sam

with the appropriate filenames filled in. To perform the downstream analysis, we want sorted BAM files. We accomplish this with samtools version 1.9.

samtools view -b .sam > .bam
samtools sort .bam -o _sorted.bam
samtools index _sorted.bam

In addition to the general commands provided above, scripts to perform these to utilize multiprocessing are included in a few of the subdirectories.

Peaks

Many of the analyses require information about the peaks. We call peaks using macs2, version 2.1.2.

mkdir peaks
macs2 callpeak -t <treated>_sorted.bam -c <control>_sorted.bam -f BAM -g hs -n ERG --outdir peaks

We can further annotate the peaks, which is accomplished with an R script annoPeaks.R, contained in the appropriate subdirectories.

Plots

The ChIP plots such as heatmaps, were generated using the resulting alignment files from above in DeepTools, version 3.4.3. The documentation is placed in the resulting subdirectory.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
CLIP		CLIP
RWPE1		RWPE1
chip2		chip2
roar		roar
s2p		s2p
vcap		vcap
README.md		README.md
ids2bed.R		ids2bed.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ERG

RNA-seq

Preprocessing

Aligning and Feature Count

Differential Expression Analysis

ChIP-seq

Aligning

Peaks

Plots

About

Releases

Packages

Languages

timlai4/taylor_ERG

Folders and files

Latest commit

History

Repository files navigation

ERG

RNA-seq

Preprocessing

Aligning and Feature Count

Differential Expression Analysis

ChIP-seq

Aligning

Peaks

Plots

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages