Skip to content

Commit

Permalink
Merge pull request #198 from broadinstitute/merge_dev_to_main
Browse files Browse the repository at this point in the history
Merge dev to main
  • Loading branch information
atancoder authored Mar 4, 2024
2 parents 25e7de9 + 1b45c6b commit 813a182
Show file tree
Hide file tree
Showing 52 changed files with 14,869 additions and 11,803 deletions.
6 changes: 4 additions & 2 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
biosamplesTable: "config/config_biosamples_chr22.tsv" # replace with your own config/config_biosamples.tsv

### OUTPUT DATA
predictions_results_dir: "results"
results_dir: "results/"


### REFERENCE FILES
Expand All @@ -16,6 +16,7 @@ ref:
genes: "reference/hg38/CollapsedGeneBounds.hg38.bed"
genome_tss: "reference/hg38/CollapsedGeneBounds.hg38.TSS500bp.bed"
qnorm: "reference/EnhancersQNormRef.K562.txt"
abc_thresholds: "reference/abc_thresholds.tsv"

### RULE SPECIFIC PARAMS
params_macs:
Expand All @@ -37,10 +38,11 @@ params_predict:
flags: "--scale_hic_using_powerlaw"
hic_gamma: 1.024238616787792 # avg hic gamma
hic_scale: 5.9594510043736655 # avg hic scale
hic_pseudocount_distance: 5000 # powerlaw at this distance is added to the contact for all predictions

params_filter_predictions:
score_column: 'ABC.Score'
threshold: .02
threshold: null # null => Automatic determination based on input
include_self_promoter: True
only_expressed_genes: False

Expand Down
20 changes: 12 additions & 8 deletions docs/tables/perf_summary.csv
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
Activity,Contact,AUPRC,Precision @ 70% recall,Threshold @ 70% recall
DNase-seq x H3K27ac,K562 Hi-C,0.61,0.56,0.023631
DNase-seq x H3K27ac,avg. Hi-C,0.59,0.51,0.01609
DNase-seq x H3K27ac,Powerlaw,0.58,0.48,0.016455
DNase-seq,K562 Hi-C,0.6,0.52,0.024674
DNase-seq,Powerlaw,0.56,0.44,0.01587
ATAC-seq x H3K27ac,K562 Hi-C,0.56,0.5,0.024511
ATAC-seq x H3K27ac,avg. Hi-C,0.54,0.46,0.016279
ATAC-seq x H3K27ac,Powerlaw,0.53,0.45,0.016684
DNase-seq x H3K27ac,K562 Hi-C,0.62,0.56,0.027
DNase-seq,K562 Hi-C,0.6,0.51,0.024
DNase-seq x H3K27ac,avg. Hi-C,0.59,0.51,0.016
DNase-seq x H3K27ac,Powerlaw,0.58,0.48,0.017
DNase-seq,avg. Hi-C,0.57,0.48,0.016
DNase-seq,Powerlaw,0.56,0.44,0.016
ATAC-seq x H3K27ac,K562 Hi-C,0.57,0.5,0.025
ATAC-seq x H3K27ac,avg. Hi-C,0.55,0.44,0.016
ATAC-seq x H3K27ac,Powerlaw,0.54,0.44,0.016
ATAC-seq,K562 Hi-C,0.52,0.43,0.021
ATAC-seq,avg. Hi-C,0.5,0.34,0.012
ATAC-seq,Powerlaw,0.44,0.37,0.013
4 changes: 4 additions & 0 deletions docs/usage/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,10 @@ You can see what commands snakemake will run
To read about each step, check out :ref:`ABC-methods`

The predictions will be stored in the ``{ABC_DIR}/results/{biosample_name}/Predictions`` folder.
``EnhancerPredictions_threshold_.*.tsv`` contains the predicted ABC E-G links that meet the ABC Score threshold.
``EnhancerPredictionsAllPutative.tsv.gz`` contains all (unthresholded) E-G links with the ABC Score.

To sanity check your output from ABC, you can check out the QC metrics in the ``{ABC_DIR}/results/{biosample_name}/Metrics`` folder.
For comparison, you can find the QC plots for our K562 run `here <https://drive.google.com/file/d/1fyd7ONKDgP646fOIafJhXcXnAk_6LCi1/view?usp=sharing>`_.
The metrics includes plots of things such as number of enhancers per gene and number of enhancer-genes per chromosome.
Expand Down
57 changes: 4 additions & 53 deletions docs/usage/methods.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,22 +32,6 @@ Description:
#. Remove any regions listed in the 'blocklist' and include any regions listed in the 'includelist'
#. Merge any overlapping regions


**Example:**

.. code-block:: console
$ python workflow/scripts/makeCandidateRegions.py \
--narrowPeak results/K562_chr22/Peaks/macs2_peaks.narrowPeak.sorted \
--accessibility example_chr/chr22/ENCFF860XAE.chr22.sorted.se.bam \
--outDir results/K562_chr22/Peaks \
--chrom_sizes reference/hg38/GRCh38_EBV.no_alt.chrom.sizes.tsv \
--chrom_sizes_bed results/tmp/reference/hg38/GRCh38_EBV.no_alt.chrom.sizes.tsv.bed \
--regions_blocklist reference/hg38/GRCh38_unified_blacklist.bed \
--regions_includelist example_chr/chr22/RefSeqCurated.170308.bed.CollapsedGeneBounds.chr22.hg38.TSS500bp.bed \
--peakExtendFromSummit 250 \
--nStrongestPeak 150000
The method of defining candidate elements includes the following steps:

- Peak-calling with MACS2
Expand Down Expand Up @@ -136,22 +120,6 @@ Output
Description:
- Counts DNase-seq (or ATAC-seq) and H3K27ac ChIP-seq reads in candidate enhancer regions

**Example:**

.. code-block:: console
$ python workflow/scripts/run.neighborhoods.py \
--candidate_enhancer_regions results/K562_chr22/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
--DHS example_chr/chr22/ENCFF860XAE.chr22.sorted.se.bam \
--default_accessibility_feature DHS \
--chrom_sizes reference/hg38/GRCh38_EBV.no_alt.chrom.sizes.tsv \
--chrom_sizes_bed results/tmp/reference/hg38/GRCh38_EBV.no_alt.chrom.sizes.tsv.bed \
--outdir results/K562_chr22/Neighborhoods \
--genes results/K562_chr22/processed_genes_file.bed \
--ubiquitously_expressed_genes reference/UbiquitouslyExpressedGenes.txt \
--qnorm reference/EnhancersQNormRef.K562.txt \
--H3K27ac example_chr/chr22/ENCFF790GFL.chr22.sorted.se.bam
2.1. Activity scales with read counts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Enhancer activity in the ABC model is estimated by counting reads in peaks (from DNase-seq, H3K27ac ChIP-seq, etc.) in peaks. The quantitative signal in these assays in informative regarding the strength of enhancers, and the ABC model assumes that this relationship is linear.
Expand Down Expand Up @@ -344,26 +312,6 @@ Description:
- Makes predictions following the Activity by Contact model
- Utilizes HiC data for contact; otherwise, uses powerlaw

**Example:**

.. code-block:: console
$ python workflow/scripts/predict.py \
--enhancers results/K562_chr22/Neighborhoods/EnhancerList.txt \
--outdir results/K562_chr22/Predictions \
--score_column ABC.Score \
--chrom_sizes reference/hg38/GRCh38_EBV.no_alt.chrom.sizes.tsv \
--accessibility_feature DHS \
--cellType K562_chr22 \
--genes results/K562_chr22/Neighborhoods/GeneList.txt \
--hic_gamma 1.024238616787792 \
--hic_scale 5.9594510043736655 \
--hic_file https://www.encodeproject.org/files/ENCFF621AIY/@@download/ENCFF621AIY.hic \
--hic_type hic \
--hic_resolution 5000 \
--scale_hic_using_powerlaw
5. Interpreting the ABC score
------------------------------------

Expand All @@ -374,7 +322,7 @@ model against CRISPR enhancer perturbation in K562 cells (

These analyses show that ABC scores reliably predicts enhancer-gene regulatory interactions
that were experimentally inferred in the CRISPR experiments. At the recall of 70%, an ABC model
using DNase-seq + cell-type specific Hi-C data achieves a precision of 52%, meaning around half of
using DNase-seq + cell-type specific Hi-C data achieves a precision of 51%, meaning around half of
the predicted enhancer-gene regulatory interactions will be true positives. The ABC scores
themselves correlate with the CRISPR effect size on gene expression when perturbing an enhancer,
however not in a precise linear fashion. This probably has different technical and biological
Expand All @@ -390,5 +338,8 @@ common ABC models can be found in the table below:
:file: /tables/perf_summary.csv
:header-rows: 1

We automatically choose the best threshold based on your input, but you can specify a threshold value
yourself in the config.yaml file.

Our CRISPR benchmarking pipeline can be used to infer thresholds for non-standard ABC models and is
available on `Github <https://github.com/EngreitzLab/CRISPR_comparison>`_.
2 changes: 1 addition & 1 deletion docs/usage/scATAC.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Let's convert the fragment file to .tagAlign and index it
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data] (job 37050919) $ cd encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37050919) $ LC_ALL=C zcat fragments.tsv.gz | sed '/^#/d' | awk -v OFS='\t' '{mid=int(($2+$3)/2); print $1,$2,mid,"N",1000,"+"; print $1,mid,$3,"N",1000,"-"}' | sort -k 1,1V -k 2,2n -k3,3n --parallel 5 | bgzip -c > tagAlign.gz # Adjust --parallel 5 based on number of cpus you have. The more cpus, the faster
(abc-env) [atan5133@sh03-04n23 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37050919) $ LC_ALL=C zcat fragments.tsv.gz | sed '/^#/d' | awk -v OFS='\t' '{mid=int(($2+$3)/2); print $1,$2,mid,"N",1000,"+"; print $1,mid+1,$3,"N",1000,"-"}' | sort -k 1,1V -k 2,2n -k3,3n --parallel 5 | bgzip -c > tagAlign.gz # Adjust --parallel 5 based on number of cpus you have. The more cpus, the faster
(abc-env) [atan5133@sh03-04n24 /oak/stanford/groups/engreitz/Users/atan5133/data/encode_scatac_dcc_2/results/ENCSR308ZGJ-1/fragments] (job 37151429) $ tabix -p bed tagAlign.gz
Expand Down
13 changes: 13 additions & 0 deletions reference/abc_thresholds.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
accessibility has_h3k27ac hic_type threshold
DHS TRUE intact_hic 0.027
DHS FALSE intact_hic 0.024
DHS TRUE avg 0.016
DHS TRUE powerlaw 0.017
DHS FALSE avg 0.016
ATAC TRUE intact_hic 0.025
DHS FALSE powerlaw 0.016
ATAC TRUE avg 0.016
ATAC TRUE powerlaw 0.016
ATAC FALSE intact_hic 0.021
ATAC FALSE avg 0.012
ATAC FALSE powerlaw 0.013
4 changes: 2 additions & 2 deletions tests/config/generic_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ TEST_CONFIG_NAME: "generic"
biosamplesTable: "tests/config/test_biosamples.tsv" # replace with your own config/config-biosamples.tsv

### OUTPUT DATA
predictions_results_dir: "tests/test_output/generic"
results_dir: "tests/test_output/generic/"

### REFERENCE FILES
ref:
Expand Down Expand Up @@ -40,7 +40,7 @@ params_predict:

params_filter_predictions:
score_column: 'ABC.Score'
threshold: .02
threshold: null
include_self_promoter: True
only_expressed_genes: False

Expand Down
Loading

0 comments on commit 813a182

Please sign in to comment.