-
Notifications
You must be signed in to change notification settings - Fork 6
Syllabus
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan
Group 2 (lncRNA): Savannah, Michael, Thao, Dan, Graycen
Group 3 (TE): Alison, Devin, Tom, Kevin, Guilia
Lecture 1: Introduction and overview
Read Chapters 1-3 for Monday Jan 22
Bioinformatics Data Skills by Vince Buffalo
First we will install the following tools, and after that work through some command line exercises.
Mac:
- preferred: iterm2
- alternate: Mac Terminal
PC:
- preferred: Windows Subsystem for Linux -- Ubuntu 18.04
- alternate: Git Bash
Mac:
- iterm will work for this!
PC:
- preferred: putty
Mac:
PC:
- preferred: FileZilla
- Sign up for an account at github.com
- You can get lots of deals with the GitHub student pack
Lecture 2: Data Reproducibility in Science / Intro to Transcriptional regulation
Overview of Encode / ChIP and Transposons as missing regulatory regions
What is a promoter and transcription factor?
select samples, click columns add control, click on table and then download .tsv
Use ls, head, tail, cat, awk / grep to explore this metadata table.
Taking notes in Markdown
-
In addition to the sample-level metadata you retrieved last time, download this table of file-level metadata from ENCODE
-
As a group, choose one transcription factors that you would like to analyze.
-
We're going to subset and organize our metadata file to include just those files that you would like to download and the columns that will be useful to you using
awk
andgrep
. -
We'll also make a file which contains the URLs to retrieve the fastq files from Encode.
Read chapters 4-5
Lecture 3: Where does data live in Biology, how do we get it, and did we get the right file?
SFTP, SSH, SCP
wget -i file.txt
md5sum
Lecture 4: git & gitting GitHub
Class Exercises:
- Create a git repository and commit some changes
- Create one GitHub repository per group and commit your sample sheet script
We're going to create a file that matches the ChIP samples to their control samples. The format of this file is specified by the pipeline that we will be running.
THESE ARE THE REQUIRED COLUMNS FOR THE DESIGN FILE
group,replicate,fastq_1,fastq_2,antibody,control (**** fastq1 and fastq2 URLS ****)
Make a design file by Friday January 31 for your TF
- Hint: this maybe easiest in excel. Look up file accession number for YTF. Then look for "paired with" you will see a new File accession number -- that needs to be in your control column.
- If your "paired with" identifier is not in the sample sheet (Jan 22 lecture notes) -- then go to encode portal and find it :)
- Advanced exercise : Script this in bash (going to need a few greps & joins :)
- Meet at Space Sciences
Please take notes on the key rules and regulations — to do and not to do’s !
- Layout of class directories -- where will you be doing work?
- Get a local git repo -- set up ssh key for fiji-GitHub
- Moving files to and from fiji
/Users/<identikey>
Scratch: THe wild west no limits (within reason) here is where we will start doing analysis and set up git etc.
scratch/Users/<identikey>
/Shares/rinn_class/students/<identikey>
/Shares/rinn_class/data
Design File presentations
rsync
- SCREEN (screen -list / ctr-d + a/ screen -r)
- Get fastq's for your TF
- SLURM review (interactive & batch jobs)
- md5sum -c
- Go over class design file
exchange design files to have a total of 3 TFs (e.g., collaborate with another group)
`cp` design files.
What happened? How can we solve this?
Discuss and catch up on what we have learned about unix and commands etc
Lecture 5: Flowing with NEXTFlow
Nextflow paper: Nextflow enables reproducible computational workflows
Read basic documentation and install nextflow in your path!
- design.csv
- nextflow.config
- run.sh
- blacklist
- fastq directory w/ fastqs downloaded
- checked by John or Michael
- run pipeline
sbatch run.sh
squeue -u X000
scancel jobid
https://www.encodeproject.org/help/file-formats/
Read next flow documentation and nextflow.out
Homework google the programs used in nextflow.out
Fastqc
TimaGalore
BWAMem
SortBAM
MergeBAM
BigWig
MACSCallPeak
Peak QC
Lets cover some of the basic statistics being used in the NF-Core Chip-Seq pipeline. Probability Distributions: Poisson, Binomial, negative binomial Scan Statistics
Recomended reading: Biometry Chapter 4
Class exercise: each group presents a statistical principle and how it is used in NF-CORE ChIPseq
Class UCSC Account: MyData > Sign in:
BCHM_5631
Pswd : will tell y'all in class
Lecture 7: DNA Binding Proteins from Structures to "Meta-analysis"
#Genome file
/Shares/rinn_class/data/genomes/human/gencode/v32/GRCh38.p13.genome.fa \
#Annotation file
/Shares/rinn_class/data/genomes/human/gencode/v32/gencode.v32.annotation.gtf \
/Shares/rinn_class/data/k562_chip/
- Your tracks
- Encode tracks (go to portal download bigwigs)
- Pre-baked tracks
Install x2go to use IGV on fiji.
Class exercise: download and view:
BigWig and BroadPeak Files from this run versus on of yours
Are the results similar?
First: a quick tour of UCSC table browser
- How to load a RMSK track into IGV or R
Try loading a peak file into R
- they are just tab seperated tables and can be loaded with read.table(sep = "\t)
Each group presents three questions they would like to address based on the TE-DNA, TE-RBP, E-CLIP study designs.
Each person 3 questions.
Presentation outline:
- Introduce yourself and your research
- Present the question that you would like to pursue with the class dataset
- Discuss how you'd like to use lessons from this class in your own research
Lecture 9: Intro to R -- part II
- Continuation of R data types
- Introduction to ggplot2 and tidyverse
- Exercise -- plotting gene profiles
- Git from R
Good R tutorial:
https://www.youtube.com/watch?v=fDRa82lxzaU
Lecture 10: R for Genomics -- part I
Install the following packages in fiji-viz/RStudio
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
- Review your solutions to the for loop/plotting exercise
- Introduce GRanges and findOverlaps
- Read in peak files, repeatMasker files, and find overlaps
Exercise: Make some plots to characterize the overlap of ChIP-seq peaks with TEs.
Can be as simple as plotting the number of overlaps of one particular TF with a class of TEs - OR - since you have data for all the TFs, you can plot each protein's peaks and where they fall in relation to the center of the repeat -- i.e. a metaplot heatmap or profile plot.
If you get stuck, ask your group members for help and if you're still stuck, ask in the general slack channel. We will go over your plots and code on Wednesday.
- 3 Groups of 5
Group 1 (mRNA): Kristen, Shelby, Ben, Soroya, Arpan
Group 2 (lncRNA): Savannah, Michael, Tao, Dan, Graycen
Group 3 (TE): Alison, Devon, Tom, Kevin, Guilia
- Granges Gencode
- Granges consensus.peak.file
- Intersect Granges
- Go over TE intersection plots and problems
- Fix RMarkdown with Jon
- Introduction to RMarkdown and functions
- Git structure -- how teams will be committing to class repository
/scratch/Users/<identikey>
- Discussion: Clustering
Each person contributes commmits to the README.md in each group. Submit a pull request to the master branch.
Write a function that will require peaks to be present in all replicates per TF. Then iterate over all TFs to create peak sets (GRanges objects) that consist of peaks present in all replicates. Write these peaks to one bed file per TF. Copy these peak files to your class directory /Shares/rinn_class/students/<identikey>
. We will be reviewing these files on Monday.
Bonus: Write the function such that the number or percent of replicates required is adjustable.
Considerations: Do you want to merge the peak regions? What is the minimum overlap required? How do the results change when this parameter is varied? How many peaks do we lose by doing this approach?
- going remote as of Friday March 13.
- Browsing / spot checking consensus peaks in UCSC (session example "consensus_peaks" in UCSC class session list -- Randomly sampled peaks to check out)
- Peak files for each replicate
/Shares/rinn_class/data/ucsc_peaks
- Consensus peak files
/Shares/rinn_class/data/k562_chip/analysis/00_consensus_peaks/ucsc_peak_tracks
- BigWig file link bigWigs
Class Exercise:
Look through the profile plots and remake the plots for your
favorite TF(s) or all of them.
Find two TFs that have different profile plots.
Find examples of their consensus peaks with bigWig replicates.
Present interesting aspects about these TFs from literature (NCBI Gene).
Prepare a presentation per group for Friday.
Slack a ppt or keynote to the general channel before class on Friday.
- Welcome to Zoooooom !
- Break out rooms for groups
- Slack and zoom / trello
- Presentations
Do intersects in class for your "biotype"
Class exercise (presentations Friday March 20): Find some interesting examples for your group (5 TFs).
Is there a trend with number of peaks and number of overlaps? How could we "shuffle" to understand if this is significant or happens by chance?
Which ones bind your biotype more than others? What is the most unique DNA binding protein for your group?
##### March 30: Functions, Features and Fun and git organization for analyses
[Paper to read on mRNA and lncRNA promoter properties](https://www.dropbox.com/s/ux3e7xzl9lsflxz/Mele_et_al.pdf?dl=0)
[Second paper to read on promoter properties](https://www.dropbox.com/s/m4832lsedpt826f/Genome%20Res.-2019-Mattioli-gr.242222.118.pdf?dl=0)
##### April 1: No class : APRIL-FOOLs <- Present interesting promoters that have many DNA binding protein events.
Clustering
##### April 3: Findings from clustering & paper figure presentations
- Present a figure and associated analysis/findings from each paper (Mele et al. & Mattioli et al.)
- Present findings from your clustering exercise:
- What groupings make sense?
- Are there different clustering groupings when you compare all promoters vs your subset?
##### April 6: Expression comparisons -- recapitulate Mele et al finding that more TFs higher expression.
- Class excercise: are there promoters with lots of TFs that are not expressed? At what point would we say there are a lot of TFs bound :) ? Hint: histogram of co-occurrence matrix.
##### April 8:
- Walk through results (ghosts)
- Other questions to analyze? Distribute analyses.
- Prepare questions for Michael Snyder
##### April 10: Michael Snyder guest lecture/interview
##### April 13: Permutation test class exercise I
[Intuitive Statistics Lecture](https://www.dropbox.com/s/95iq9veg5e7qp1y/Permuation_false_discovery.pdf?dl=0)
Groups will work on `permutation_test_class.Rmd`
##### April 15: Permutation test class exercise II
##### April 17: Design manuscript outline
##### April 20: Work through making figures -- clean code and figures in .Rmd
##### April 22: Work through making figures -- clean code and figures in .Rmd
##### April 24: Figure from each group due in .Rmd
##### April 27: Finalize Figures and git
#### April 29: Sweep up the workshop !
Can we use data standards and reproducibility to write a paper on our findings? Let's set up the Paper-Pository on Git
+++++++++++++++++++++++