Skip to content

Latest commit

 

History

History
76 lines (61 loc) · 6.56 KB

README.md

File metadata and controls

76 lines (61 loc) · 6.56 KB

CRISPR_Screen_Processing

A basic working example that you can follow through to semi-automate the analysis of pooled CRISPR screen data in a uniform way using MAGeCK and/or DrugZ for enrichment/depletion analysis. See ReadMe for more details. Testing table files are templates, not working examples.

Contents

  1. Main Folder

    • CRISPR.sh: bash script to process raw FASTQ files through cutadapt, bowtie 1.3, and MAGeCK (counting/RRA testing)
    • drugz.sh: bash script to process count table generated by MAGeCK using the drugz.py script
    • MAGeCK_ or DrugZ_ Tests_Table.txt: tab-delimited tables that inform the above bash scripts on how to conduct enrichment/depletion tests (Row from first to last: Output Name, Treatment Groups, Control Groups)
    • Both bash scripts were run using the default settings that are indicated at the top of the scripts, except for the threads/cores option, all other settings use the defaults for the respective packages
  2. Raw_FASTQ: Original (compressed) FASTQ.gz files. You should input your files (ideally with sensible replicate/sample labels here!)

    • A script to download some sample data (Brunello plasmid library input and dropout data for 3x replicates each of BCBL1 Cas9 clonal/pooled cell lines) from the Gottwein Lab publication, "Gene essentiality landscape and druggable oncogenic dependencies in herpesviral primary effusion lymphoma" Manzano et al., Nat Comm. 2018 can be found in Nat_Comm_BCBL1_Download.sh

    • To experiment, try comparing BCBL1 clonal & pooled cell lines to the plasmid input to observe dropout/essentiality using MAGeCK.sh, then use drugz.sh to characterize differences between clonal/pooled Cas9 cell lines--in theory few genes should be differential.

  3. Trim_FASTQ: Cutadapt output (generated after running CRISPR.sh)

  4. Libraries: sgRNA sequences/IDs and control sgRNA ID lists for MAGeCK/DrugZ

  5. Bowtie: Bowtie 1.3 index files and .bam alignments of trimmed reads to library (generated after running CRISPR.sh)

  6. MAGeCk: Output for MAGeCK (generated after running CRISPR.sh)

    • Count table is in main folder
    • Counts folder contains three subfolders
      • Logs
      • Other--contains median normalized versions of read count table and summary table
      • R_Output--contains scripts generated by MAGeCK for rough analysis/visualization in R/RStudio
    • Tests folder contains five subfolders
      • Logs
      • sgRNA_Results--contains tables of sgRNA-level output for each test
      • Gene_Results--contains tables of gene-level output for each test
      • R_Output--contains scripts generated by MAGeCK for rough analysis/visualization in R/RStudio
      • Figures--Figures generated by myself or MAGeCK's R Output (code/potentially copies of data in folder)
  7. DrugZ: Output for DrugZ (generated after running drugz.sh --> run CRISPR.sh first)

    • Contains only a single output file (DrugZ statistical output)
    • Note: You will need to download drugz.py to run drugz.sh
    • Note: DrugZ requires equal numbers of replicates for control/treatment groups to run.

CRISPR.sh command-line arguments

Currently this script supports a few optional arguments with default values based on the most common usage scenarios I've come across.

  • -p Cores/Threads argument passed to cutadapt, bowtie, and MAGeCK count (default = 1)
  • -n Project name/file prefix for this run, passed to MAGeCK for output prefixes. (defaults to the directory name that CRISPR.sh is found in)
  • -l CRISPR library to use. (defaults to Brunello)
    • FASTA files/control guide tables have already been included for libraries in use by the Gottwein Lab (Brunello, Human_GeCKOv2_A, Human_GeCKOv2_B, Human_GeCKOv2_Full, Human_SAM).
    • You can also download these files yourself. Files need to be in a tab-delimited format with columns of "sgRNA ID", "Sequence", and "Gene/Target" often, they will be comma-separated so just convert the commas to \t.
    • You should also generate a control guide list suffixed with _controls.txt as shown in Libraries/Controls for Brunello (note: I haven't added control guide lists GeCKO/SAM yet)
    • If your library does not have control guides/they are not included for some reason, you will need to modify the script at line 121 or provide a blank .txt file with your library name and the suffix _controls under Libraries/Controls.
  • -a single adapter sequence (5' or 3') to trim, passed to cutadapt. (currently defaults to "g cgaaacaccg" which should work for most LentiCRISPRv2-based libraries if sequenced in the forward sense direction relative to transcription)
    • (prefix desired sequence with g to indicate 5', a to indicate 3')
    • current workflow assumes that your libraries are prepared in such a way that only one end of the read needs to be trimmed and the other can be trimmed to length (20 bp for a sgRNA, see -t)
  • -t trimming length, passed to cutadapt for length trimming (defaults to 20 nt)
  • -m minimum trimmed read size, passed to cutadapt for min/max read filtering (defaults to 20)
  • x Non-aligned direction, [assed to Bowtie 1.3 as the alignment direction to ignore (defaults to rc, equivalent to --norc; can also take fw for --nofw)
  • c Normalized method for MAGeCK (defaults to median)

drugz.sh command-line arguments

drugz.sh supports the -l and -n arguments above, to indicate the library used and an output prefix.

Dependencies/Other notes

MAGeCK.sh relies on the following command-line tools:

  • Cutadapt (tested w/ ver 0.5.9.4)
  • Bowtie (tested w/ v3.1)
  • Samtools (tested w/ ver 1.9)
    • Samtools frequently encounters installation issues on many Linux distros (not sure about MacOS)
    • You can solve this by explicitly installing bzip2 ver 1.0.8
  • MAGeCK (tested w/ ver 1.3.0; note: MAGeCK is not Windows compatible)
  • I also recommend installed pigz for multi-core decompression support if you are sticking with compressed fastq.gz files

DrugZ relies on drugz.py and its dependencies (six, pandas, numpy, scipy) -- most of these except for six are usually present in most data science/bioinformatics python environments.

To setup my environment, I used conda with the following command (conda-forge and bioconda repos are needed) conda create -n crisprenv -c conda-forge -c bioconda -c default mageck=0.5.9.4 cutadapt=3.1 bowtie=1.3.0 samtools=1.9 bzip2=1.0.8 six pandas scipy numpy

I've only tested this script using the bash and dash shells. It runs properly on bash, but encountered issues between lines 119-122 (mageck test/IFS while loop)on Ubuntu 20.04 using the default dash shell. The #!/bin/bash shebang should be preserved for that reason, as I cannot promise it will run on alternative shells, like zshell.