Skip to content

Lab 03: Mapping with STAR

Ryan edited this page Aug 15, 2023 · 6 revisions

Lab 3: Read Mapping

Trimmed Reads are Mapped onto the Reference Genome

SOFTWARE

STAR - Spliced Transcripts Alignment Publication

STAR Manual 2.7.0a

Mapping Reads with STAR

Aligning reads with STAR is a two-step process:

  • Create a Genome Index: Star creates an indexed genome using reference genome (Fasta Files) and annotations (GFF file). The indexes created are used in the second step of this process. For every new genome and annotation combination, new indexes needs to be creeated.
  • Aligning Reads: This step involves using genome indexes files in combination with RNA-seq sequences (FASTA or FASTQ). STAR will map reads to the genome and generate outputs with alignment file (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal tracks etc.

Running STAR

1. Make folder to store index files

spack load star
mkdir 03_mapping_star
cd 03_mapping_star
ln -s /pickett_shared/teaching/RNASeq_workshop/final_outputs/02_trimming/cut*.fastq.gz ./

mkdir Star_index
cd Star_index
ln -s /pickett_shared/teaching/RNASeq_workshop/raw_data/reference/* ./
cd ..

2. Generation of Index Files: Input files (Reference Genome, FASTA) and (Annotation, GFF)

(~7 minutes to run)

STAR \
--runMode genomeGenerate \
--genomeDir ./Star_index \
--genomeFastaFiles ./Star_index/Athaliana_447_TAIR10.fa \
--runThreadN 1 \
--sjdbGTFfile ./Star_index/Athaliana_447_Araport11.gene_exons.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 149

--runMode: genomeGenerate mode directs STAR to run genome indices generation job.

--genomeDir: /path/to/store/genome_indices

--genomeFastaFiles: /path/to/FASTA_file. Multiple reference sequences (chromosomes) are allowed for each fasta file.

--runThreadN: number of threads (depends on available cores on the server node.)

--sjdbGTFfile: /path/to/GTF_file (path to annotation file) STAR can be run without annotations but is not recommended.

In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.

--sjdbGTFtagExonParentTranscript: For GFF3 formatted annotations you need to use "Parent".

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

3. Mapping Reads to Reference Genome

(~25 minutes to run)

STAR \
 --genomeDir ./Star_index \
 --readFilesIn ./cut_SRR17062759_1.fastq.gz ./cut_SRR17062759_2.fastq.gz \
 --runThreadN 1 \
 --readFilesCommand zcat \
 --outFileNamePrefix ./SRR17062759. \
 --limitBAMsortRAM 13905678955 \
 --outSAMtype BAM SortedByCoordinate

--genomeDir: /path/to/genome_indices_directory

--readFilesIn: /path/to/FASTQ_file (RNA-seq trimmed data). If the read files are compressed, use the --readFilesCommand UncompressionCommand option, where UncompressionCommand is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use --readFilesCommand zcat OR --readFilesCommand gunzip -c. For bzip2-compressed files, use --readFilesCommand bunzip2 -c.

***Pair-end data will have two input sequence files and Single-end data will have one input file. (Both pari-end data file needs to be analyzed together).

--outFileNamePrefix: prefix for all output files. You will need to add a period at the end to separate your sample name from STAR's output file extensions.

--outSAMtype: output filetype (SAM default)

--runThreadN: number of threads / cores

Output Formats

The BAM alignment file is binary in nature and is not readable by humans. Convert the BAM file to SAM file (human readable form) by using SAMTOOLS:

samtools view -h SRR17062759.Aligned.sortedByCoord.out.bam > SRR17062759.Aligned.sortedByCoord.out.sam

Running STAR in a loop

Useful when multiple RNA-seq data needs to be run.

for R1 in /pickett_shared/teaching/RNASeq_workshop/final_outputs/02_trimming/cut*_1.fastq.gz; do
  R2=`sed 's/_1/_2/' <(echo $R1)`
  BASE=$( basename $R1 | sed 's/_1.fastq.gz//g')
  BASE=`sed 's/cut_//' <(echo $BASE)`

  STAR \
    --genomeDir ./Star_index \
    --readFilesIn $R1 $R2 \
    --runThreadN 2 \
    --readFilesCommand zcat \
    --outFileNamePrefix ${BASE}. \
    --limitBAMsortRAM 13905678955 \
    --outSAMtype BAM SortedByCoordinate
done

The BAM output can be viewed graphically using IGV (Integrative Genomics Viewer)

Clone this wiki locally