Skip to content

Latest commit



76 lines (61 loc) · 6.56 KB

File metadata and controls

76 lines (61 loc) · 6.56 KB


A basic working example that you can follow through to semi-automate the analysis of pooled CRISPR screen data in a uniform way using MAGeCK and/or DrugZ for enrichment/depletion analysis. See ReadMe for more details. Testing table files are templates, not working examples.


  1. Main Folder

    • bash script to process raw FASTQ files through cutadapt, bowtie 1.3, and MAGeCK (counting/RRA testing)
    • bash script to process count table generated by MAGeCK using the script
    • MAGeCK_ or DrugZ_ Tests_Table.txt: tab-delimited tables that inform the above bash scripts on how to conduct enrichment/depletion tests (Row from first to last: Output Name, Treatment Groups, Control Groups)
    • Both bash scripts were run using the default settings that are indicated at the top of the scripts, except for the threads/cores option, all other settings use the defaults for the respective packages
  2. Raw_FASTQ: Original (compressed) FASTQ.gz files. You should input your files (ideally with sensible replicate/sample labels here!)

    • A script to download some sample data (Brunello plasmid library input and dropout data for 3x replicates each of BCBL1 Cas9 clonal/pooled cell lines) from the Gottwein Lab publication, "Gene essentiality landscape and druggable oncogenic dependencies in herpesviral primary effusion lymphoma" Manzano et al., Nat Comm. 2018 can be found in

    • To experiment, try comparing BCBL1 clonal & pooled cell lines to the plasmid input to observe dropout/essentiality using, then use to characterize differences between clonal/pooled Cas9 cell lines--in theory few genes should be differential.

  3. Trim_FASTQ: Cutadapt output (generated after running

  4. Libraries: sgRNA sequences/IDs and control sgRNA ID lists for MAGeCK/DrugZ

  5. Bowtie: Bowtie 1.3 index files and .bam alignments of trimmed reads to library (generated after running

  6. MAGeCk: Output for MAGeCK (generated after running

    • Count table is in main folder
    • Counts folder contains three subfolders
      • Logs
      • Other--contains median normalized versions of read count table and summary table
      • R_Output--contains scripts generated by MAGeCK for rough analysis/visualization in R/RStudio
    • Tests folder contains five subfolders
      • Logs
      • sgRNA_Results--contains tables of sgRNA-level output for each test
      • Gene_Results--contains tables of gene-level output for each test
      • R_Output--contains scripts generated by MAGeCK for rough analysis/visualization in R/RStudio
      • Figures--Figures generated by myself or MAGeCK's R Output (code/potentially copies of data in folder)
  7. DrugZ: Output for DrugZ (generated after running --> run first)

    • Contains only a single output file (DrugZ statistical output)
    • Note: You will need to download to run
    • Note: DrugZ requires equal numbers of replicates for control/treatment groups to run. command-line arguments

Currently this script supports a few optional arguments with default values based on the most common usage scenarios I've come across.

  • -p Cores/Threads argument passed to cutadapt, bowtie, and MAGeCK count (default = 1)
  • -n Project name/file prefix for this run, passed to MAGeCK for output prefixes. (defaults to the directory name that is found in)
  • -l CRISPR library to use. (defaults to Brunello)
    • FASTA files/control guide tables have already been included for libraries in use by the Gottwein Lab (Brunello, Human_GeCKOv2_A, Human_GeCKOv2_B, Human_GeCKOv2_Full, Human_SAM).
    • You can also download these files yourself. Files need to be in a tab-delimited format with columns of "sgRNA ID", "Sequence", and "Gene/Target" often, they will be comma-separated so just convert the commas to \t.
    • You should also generate a control guide list suffixed with _controls.txt as shown in Libraries/Controls for Brunello (note: I haven't added control guide lists GeCKO/SAM yet)
    • If your library does not have control guides/they are not included for some reason, you will need to modify the script at line 121 or provide a blank .txt file with your library name and the suffix _controls under Libraries/Controls.
  • -a single adapter sequence (5' or 3') to trim, passed to cutadapt. (currently defaults to "g cgaaacaccg" which should work for most LentiCRISPRv2-based libraries if sequenced in the forward sense direction relative to transcription)
    • (prefix desired sequence with g to indicate 5', a to indicate 3')
    • current workflow assumes that your libraries are prepared in such a way that only one end of the read needs to be trimmed and the other can be trimmed to length (20 bp for a sgRNA, see -t)
  • -t trimming length, passed to cutadapt for length trimming (defaults to 20 nt)
  • -m minimum trimmed read size, passed to cutadapt for min/max read filtering (defaults to 20)
  • x Non-aligned direction, [assed to Bowtie 1.3 as the alignment direction to ignore (defaults to rc, equivalent to --norc; can also take fw for --nofw)
  • c Normalized method for MAGeCK (defaults to median) command-line arguments supports the -l and -n arguments above, to indicate the library used and an output prefix.

Dependencies/Other notes relies on the following command-line tools:

  • Cutadapt (tested w/ ver
  • Bowtie (tested w/ v3.1)
  • Samtools (tested w/ ver 1.9)
    • Samtools frequently encounters installation issues on many Linux distros (not sure about MacOS)
    • You can solve this by explicitly installing bzip2 ver 1.0.8
  • MAGeCK (tested w/ ver 1.3.0; note: MAGeCK is not Windows compatible)
  • I also recommend installed pigz for multi-core decompression support if you are sticking with compressed fastq.gz files

DrugZ relies on and its dependencies (six, pandas, numpy, scipy) -- most of these except for six are usually present in most data science/bioinformatics python environments.

To setup my environment, I used conda with the following command (conda-forge and bioconda repos are needed) conda create -n crisprenv -c conda-forge -c bioconda -c default mageck= cutadapt=3.1 bowtie=1.3.0 samtools=1.9 bzip2=1.0.8 six pandas scipy numpy

I've only tested this script using the bash and dash shells. It runs properly on bash, but encountered issues between lines 119-122 (mageck test/IFS while loop)on Ubuntu 20.04 using the default dash shell. The #!/bin/bash shebang should be preserved for that reason, as I cannot promise it will run on alternative shells, like zshell.