Merge pull request #62 from cnobles/vc_incorp

Implement Version Control on IncorpData
cnobles · Aug 22, 2019 · 84682be · 84682be
2 parents 8298da7 + ecce001
commit 84682be
Show file tree

Hide file tree

Showing 60 changed files with 3,005 additions and 1,486 deletions.
diff --git a/.version b/.version
@@ -1 +1 @@
-v0.9.9
+v1.0.0
diff --git a/README.md b/README.md
@@ -10,194 +10,17 @@ Bioinformatic pipeline for processing iGUIDE and GUIDE-seq samples.
 ### Description
 iGUIDE is a pipeline written in [snakemake](http://snakemake.readthedocs.io/) for processing and analyzing double-strand DNA break events. These events may be induced, such as by designer nucleases like Cas9, or spontaneous, as produced through DNA replication or ionizing radiation. A laboratory bench-side protocol accompanies this software pipeline, and can be found [**https://doi.org/10.1186/s13059-019-1625-3**](https://doi.org/10.1186/s13059-019-1625-3). 
 
-Below, this readme gives the reader a overview of the pipeline, including how to install and process a sample dataset. Processing a sample data set is broken into three parts: 
-
-1) developing a configuration file and sample information
-2) setting up a run directory and acquiring the sequence data
-3) initializing the pipeline and understanding the output
-
-More complete documentation can be found on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html).
-
-### Install
-To install iGUIDE, simply clone the repository to the desired destination:
-
-```
-git clone https://github.com/cnobles/iGUIDE.git
-```
-
-Then initiate the install using the install script. If you would like the installed environment to be named something other than 'iguide', the new conda environment name can be provided to the 'install.sh' script as provided below.
-
-```
-cd path/to/iGUIDE
-bash install.sh
-
-# Or
-
-cd path/to/iGUIDE
-bash install.sh -e {env_name}
-
-# Or include simulation test
-cd path/to/iGUIDE
-bash install.sh -t
-
-# For help with install options:
-cd path/to/iGUIDE
-bash install.sh -h
-```
-
-### An Example Run
-To perform a local test of running the iGUIDE informatic pipeline, run the below code after installing. This block first activates your conda environment, 'iguide' by default, and then creates a test directory within the analysis directory. The run information is stored in the run specific configuration file (config file). Using the '-np' flag with the snakemake call will perform a dry-run (won't actually process anything) and print the commands to the terminal, so you can see what snakemake is about to perform. Next, the test data can be moved to the input directory underneath the new test run directory or the path to the input data needs to be included in the config file. Then the entirety of processing can start. 
-
-```
-# If conda is not in your path ...
-
-source ${HOME}/miniconda3/etc/profile.d/conda.sh
-
-# Activate iguide environment
-
-conda activate iguide
-
-# After constructing the config file and having reference files (i.e. sampleinfo)
-# You can check the samples associated with the run.
-
-iguide list_samples configs/simulation.config.yml
-
-# Create test analysis directory
-
-iguide setup configs/simulation.config.yml
-
-# Process a simulation dataset
-
-iguide run configs/simulation.config.yml -- -np
-iguide run configs/simulation.config.yml -- --latency-wait 30
-
-# Processing will complete with several reports, but if additional analyses are required,
-# you can re-evaluate a run by its config file. Multiple runs can be evaluated together, 
-# just include multiple config files.
-
-iguide eval configs/simulation.config.yml \
-  -o analysis/simulation/output/iguide.eval.simulation.test.rds \
-  -s sampleInfo/simulation.supp.csv
-
-# After evaluation, generate a report in a different format than standard.
-# Additionally the evaluation and report generation step can be combined using 
-# config file(s) as inputs for the 'report' subcommand (using the -c flag instead of -e).
-
-iguide report -e analysis/simulation/output/iguide.eval.simulation.test.rds \
-  -o analysis/simulation/reports/report.simulation.pdf \
-  -s sampleInfo/simulation.supp.csv \
-  -t pdf
-
-# When you are all finished and ready to archive / remove excess files, a minimal configuration
-# can be achieved with the 'clean' subcommand.
-
-iguide clean configs/simulation.config.yml
-
-# Or you realized you messed up all the input and need to restart
-
-iguide clean configs/simulation.config.yml --remove_proj
-
-# Deactivate the environment
-
-conda deactivate
-```
+To get started, checkout the iGUIDE documentation at [iGUIDE.ReadTheDocs.io](https://iguide.readthedocs.io/).
 
 ### Changelog:
 
-**v0.9.9 (August 9th,2019) - Additional updates**
-
-* Implemented support for BWA aligner
-* Added tools (samqc) for working with other SAM/BAM output aligners as well
-* Switched iguide support code to iguideSupport R-package and added unit tests
-* Fixed bugs related to quoted table inputs (csv/tsv)
-* Implemented a method to skip demultiplexing, see documentation for setup
-* Resoved a number of issues identified, check GitHub for history!
-
-**v0.9.9 (June 10th, 2019)**
-
-* Modified the assimilate + evaluate workflow
-  + Assimilate now only includes reference genome data, meaning a cleaner intermediate file
-  + Evaluate will now handle ref. gene sets and further analysis
-  + This increases the modularity and consistancy of the workflow
-* Revised the iGUIDE Report format to be more informational and clearer
-* Revised a bit of the workflow to make reprocessing smoother
-* Updated BLAT coupling script to be more memory efficient
-* Fixed TravisCI testing!
-* Changed stat workflow, now restarting analysis won't init a total reproc.
-
-**v0.9.8 (April 19th, 2019)**
-
-* iGUIDE can now support non-Cas9 nucleases as well!
-  + Implemented nuclease profiles into configs
-  + Updated assimilation, evaluation, and reporting scripts
-* Added default resources to allow simpler HPC processing
-* Included flexible system for identifying on-target sites
-  + Config can accept a range rather than a single site
-  + Acceptable notation: chr4:+:397-416 and chr3:*:397
-* Changed build nomenclature from v0.9.3 to b0.9.3
-  + So as not to confuse with version
-* Added 'summary' subcommand to generate a consise text-based report
-  + Working in the same manner as 'report', can generate from config(s) or eval file
-* Added short stats-based report to be produced at the end of processing
-* Additional bugfixes.
-
-**v0.9.7 (March 6th, 2019)**
-
-* Hotfix to workflow.
-* Changed 'setup' subcommand to python script based rather than snakemake.
-* Changed file organization.
-
-**v0.9.6 (March 5th, 2019)**
-
-* Introduced process workflow steps: assimilate and evaluate
-  + Assimilate aligned data and compare with targeting sequences
-    + Core data object that can be combined across runs / projects
-  + Evaluated data incorporates reference data and statistical models
-    + A staple data object for reports and can be constructed from multiple runs
-* Included new subcommands 'eval' and modified 'report'
-  + report from either config(s) or eval dataset
-* Cleaned up file structure
-* Updated documentation in code and docs.
-* Implemented accuracy and retention checks with simulation dataset.
-* Updated simulation dataset with larger set to test analysis.
-
-**v0.9.5 (February 19th, 2019)**
-
-* Updated demultiplexing to be more efficient and better HPC compatible.
-* Added RefSeq Extended* reference gene sets
-  + 'ext' includes curated, predicted, and other RefSeq sets
-  + 'ext.nomodel' includes only curated and other RefSeq sets
-* Incorporated resource allocation for job dependent memory consumption
-  + Works great with HPC to specify memory requirements
-* Streamlined input for report generation by only requiring config(s)
-
-
-**v0.9.4 (January 30th, 2019)**
-
-* Updated 'report' utility and formating
-  + custom templates now accepted
-  + included as subcommand, check with 'iguide report -h'
-  + pdf and html options report 'nicely' even when printed from either
-* Updated build to v0.9.2 to support new formating in report
-* Builds are constructed from spec files rather than yaml requirements
-* Included the 'clean' subcommand to reduce size of processed projects
-  + after cleaning a project, only terminal data files will remain
-
-**v0.9.3 (January 11th, 2019)**
-
-* Added 'list_samples' subcommand to list samples within a project.
-* Caught a few bugs and worked them out for smoother processing and reports.
-
-**v0.9.2 (January 7th, 2019)**
-
-* Modified test dataset to run tests quicker and implemented CirclCI checking.
-
-**v0.9.1 (January 6th, 2019)**
-
-* Fixed problematic install for first time conda installers.
-
-**v0.9.0 (January 4th, 2019)**
+**v1.0.0 (August 15th,2019)**
 
-* Initial release.
-* Supports setup and analysis of GUIDE-seq and iGUIDE experiments.
-* Documentation on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html).
+* Release of version 1.0.0!!!
+* iGUIDE is a computational pipeline that supports the detection of DSBs induced
+  by designer nucleases
+* Aligner support for BLAT and BWA currently implemented, let us know if you 
+  would like to see others.
+* Flexible pipeline processing built on Snakemake, supports a binning system
+  to better distribute workflow for whichever system it is being processed on
+* Documentation supporting a Quickstart and User Guide hosted by [ReadTheDocs](https://iguide.readthedocs.io/)
diff --git a/Snakefile b/Snakefile
@@ -115,7 +115,7 @@ if not "alignMB" in config:
     config["alignMB"] = 4000
 
 if not "qualCtrlMB" in config:
-    config["qualCtrlMB"] = 4000
+    config["qualCtrlMB"] = 8000
 
 if not "assimilateMB" in config:
     config["assimilateMB"] = 4000
@@ -125,16 +125,31 @@ if not "evaluateMB" in config:
 
 if not "reportMB" in config:
     config["reportMB"] = 4000
-
+
+if not "bins" in config:
+    config["bins"] = 5
+
+if not "level" in config:
+    config["level"] = 300000
+
 if not "readNamePattern" in config:
     config["readNamePattern"] = str("'[\\w\\:\\-\\+]+'")
 
+
+# Define BINS
+BINS = []
+
+for i in range(1, config["bins"] + 1, 1):
+    BINS.append("bin" + str(i).zfill(len(str(config["bins"]))))
+
+
 # Regex constraints on wildcards
 wildcard_constraints:
     sample="[\w\-\_]+",
     read="R[12]",
     read_type="[RI][12]",
-    req_type="[RI][12]"
+    req_type="[RI][12]",
+    bin="bin[\d]+"
 
 # Target Rules
 rule all:
@@ -153,6 +168,7 @@ if (config["skipDemultiplexing"]):
 else:
     include: "rules/demulti.rules"
 
+include: "rules/binning.rules"
 include: "rules/trim.rules"
 
 if (config["UMItags"]):

diff --git a/configs/simulation.config.yml b/configs/simulation.config.yml
@@ -3,7 +3,6 @@ Run_Name : "simulation"
 Sample_Info : "sampleInfo/simulation.sampleInfo.csv"
 Supplemental_Info : "sampleInfo/simulation.supp.csv"
 Ref_Genome : "hg38"
-Ref_Genome_Path : "genomes/hg38.2bit"
 Aligner : "blat"
 UMItags : TRUE
 
@@ -119,6 +118,10 @@ R2odnMismatch : 0
 R2overMismatch : 4
 R2overMaxLength : 20
 
+# Binning
+bins : 3
+level : 250
+
 # Reference Alignment
 BLATparams : "-tileSize=11 -stepSize=9 -minIdentity=85 -maxIntron=5 -minScore=27 -dots=1000 -out=psl -noHead"
 BWAparams : "-k 30 -w 2500 -P -L 25 -a"

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -0,0 +1,110 @@
+.. _changelog:
+
+ChangeLog
+=========
+
+**v1.0.0 (August 15th, 2019)**
+
+- Complete support for BLAT and BWA aligners
+- Included a binning system to distribute workload into smaller loads
+- Implemented a version tracking system into the intermediate data files
+  (incorp_sites)
+- Updated CLI with "hints" for snakemake processing
+
+**v0.9.9 (August 9th, 2019) - Additional updates**
+
+- Implemented support for BWA aligner
+- Added tools (samqc) for working with other SAM/BAM output aligners as well
+- Switched iguide support code to iguideSupport R-package and added unit tests
+- Fixed bugs related to quoted table inputs (csv/tsv)
+- Implemented a method to skip demultiplexing, see documentation for setup
+- Resoved a number of issues identified, check GitHub for history!
+
+**v0.9.9 (June 10th, 2019)**
+
+- Revised the iGUIDE Report format to be more informational and clearer
+- Revised a bit of the workflow to make reprocessing smoother
+- Updated BLAT coupling script to be more memory efficient
+- Fixed TravisCI testing!
+- Changed stat workflow, now restarting analysis won't initiate a total 
+  reprocessing.
+- Modified the assimilate + evaluate workflow
+- Assimilate now only includes reference genome data, meaning a cleaner
+  intermediate file
+- Evaluate will now handle ref. gene sets and further analysis
+- This increases the modularity and consistancy of the workflow
+
+
+**v0.9.8 (April 19th, 2019)**
+
+- iGUIDE can now support non-Cas9 nucleases as well!
+- Implemented nuclease profiles into configs
+- Updated assimilation, evaluation, and reporting scripts
+- Added default resources to allow simpler HPC processing
+- Included flexible system for identifying on-target sites
+- Config can accept a range rather than a single site
+- Acceptable notation: chr4:+:397-416 and chr3:\*:397
+- Changed build nomenclature from v0.9.3 to b0.9.3, so as not to confuse with
+  version
+- Added 'summary' subcommand to generate a consise text-based report
+- Added short stats-based report to be produced at the end of processing
+- Additional bugfixes.
+
+**v0.9.7 (March 6th, 2019)**
+
+- Hotfix to workflow.
+- Changed 'setup' subcommand to python script based rather than snakemake.
+- Changed file organization.
+
+**v0.9.6 (March 5th, 2019)**
+
+- Introduced process workflow steps: assimilate and evaluate
+- Assimilate aligned data and compare with targeting sequences
+- Incorp_sites now a core data object that can be combined across runs
+- Evaluated data incorporates reference data and statistical models
+- A staple data object for reports and can be constructed from multiple runs
+- Included new subcommands 'eval' and modified 'report', report from either
+  config(s) or eval dataset
+- Cleaned up file structure
+- Updated documentation in code and docs.
+- Implemented accuracy and retention checks with simulation dataset.
+- Updated simulation dataset with larger set to test analysis.
+
+**v0.9.5 (February 19th, 2019)**
+
+- Updated demultiplexing to be more efficient and better HPC compatible.
+- Added RefSeq Extended reference gene sets
+- 'ext' includes curated, predicted, and other RefSeq sets
+- 'ext.nomodel' includes only curated and other RefSeq sets
+- Incorporated resource allocation for job dependent memory consumption, works
+  great with HPC to specify memory requirements
+- Streamlined input for report generation by only requiring config(s)
+
+**v0.9.4 (January 30th, 2019)**
+
+- Updated 'report' utility and formating. Custom templates now accepted. 
+  Included as subcommand, check with 'iguide report -h'. PDF and HTML options
+  report 'nicely' even when printed from either
+- Updated build to v0.9.2 to support new formating in report
+- Builds are constructed from spec files rather than yaml requirements
+- Included the 'clean' subcommand to reduce size of processed projects. After
+  cleaning a project, only terminal data files will remain
+
+**v0.9.3 (January 11th, 2019)**
+
+- Added 'list_samples' subcommand to list samples within a project.
+- Caught a few bugs and worked them out for smoother processing and reports.
+
+**v0.9.2 (January 7th, 2019)**
+
+- Modified test dataset to run tests quicker and implemented CirclCI checking.
+
+**v0.9.1 (January 6th, 2019)**
+
+- Fixed problematic install for first time conda installers.
+
+**v0.9.0 (January 4th, 2019)**
+
+- Initial release.
+- Supports setup and analysis of GUIDE-seq and iGUIDE experiments.
+- Documentation on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html).