diff --git a/.version b/.version index 8031930e..0ec25f75 100644 --- a/.version +++ b/.version @@ -1 +1 @@ -v0.9.9 +v1.0.0 diff --git a/README.md b/README.md index 58a17f0a..fb4bc4aa 100644 --- a/README.md +++ b/README.md @@ -10,194 +10,17 @@ Bioinformatic pipeline for processing iGUIDE and GUIDE-seq samples. ### Description iGUIDE is a pipeline written in [snakemake](http://snakemake.readthedocs.io/) for processing and analyzing double-strand DNA break events. These events may be induced, such as by designer nucleases like Cas9, or spontaneous, as produced through DNA replication or ionizing radiation. A laboratory bench-side protocol accompanies this software pipeline, and can be found [**https://doi.org/10.1186/s13059-019-1625-3**](https://doi.org/10.1186/s13059-019-1625-3). -Below, this readme gives the reader a overview of the pipeline, including how to install and process a sample dataset. Processing a sample data set is broken into three parts: - -1) developing a configuration file and sample information -2) setting up a run directory and acquiring the sequence data -3) initializing the pipeline and understanding the output - -More complete documentation can be found on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html). - -### Install -To install iGUIDE, simply clone the repository to the desired destination: - -``` -git clone https://github.com/cnobles/iGUIDE.git -``` - -Then initiate the install using the install script. If you would like the installed environment to be named something other than 'iguide', the new conda environment name can be provided to the 'install.sh' script as provided below. - -``` -cd path/to/iGUIDE -bash install.sh - -# Or - -cd path/to/iGUIDE -bash install.sh -e {env_name} - -# Or include simulation test -cd path/to/iGUIDE -bash install.sh -t - -# For help with install options: -cd path/to/iGUIDE -bash install.sh -h -``` - -### An Example Run -To perform a local test of running the iGUIDE informatic pipeline, run the below code after installing. This block first activates your conda environment, 'iguide' by default, and then creates a test directory within the analysis directory. The run information is stored in the run specific configuration file (config file). Using the '-np' flag with the snakemake call will perform a dry-run (won't actually process anything) and print the commands to the terminal, so you can see what snakemake is about to perform. Next, the test data can be moved to the input directory underneath the new test run directory or the path to the input data needs to be included in the config file. Then the entirety of processing can start. - -``` -# If conda is not in your path ... - -source ${HOME}/miniconda3/etc/profile.d/conda.sh - -# Activate iguide environment - -conda activate iguide - -# After constructing the config file and having reference files (i.e. sampleinfo) -# You can check the samples associated with the run. - -iguide list_samples configs/simulation.config.yml - -# Create test analysis directory - -iguide setup configs/simulation.config.yml - -# Process a simulation dataset - -iguide run configs/simulation.config.yml -- -np -iguide run configs/simulation.config.yml -- --latency-wait 30 - -# Processing will complete with several reports, but if additional analyses are required, -# you can re-evaluate a run by its config file. Multiple runs can be evaluated together, -# just include multiple config files. - -iguide eval configs/simulation.config.yml \ - -o analysis/simulation/output/iguide.eval.simulation.test.rds \ - -s sampleInfo/simulation.supp.csv - -# After evaluation, generate a report in a different format than standard. -# Additionally the evaluation and report generation step can be combined using -# config file(s) as inputs for the 'report' subcommand (using the -c flag instead of -e). - -iguide report -e analysis/simulation/output/iguide.eval.simulation.test.rds \ - -o analysis/simulation/reports/report.simulation.pdf \ - -s sampleInfo/simulation.supp.csv \ - -t pdf - -# When you are all finished and ready to archive / remove excess files, a minimal configuration -# can be achieved with the 'clean' subcommand. - -iguide clean configs/simulation.config.yml - -# Or you realized you messed up all the input and need to restart - -iguide clean configs/simulation.config.yml --remove_proj - -# Deactivate the environment - -conda deactivate -``` +To get started, checkout the iGUIDE documentation at [iGUIDE.ReadTheDocs.io](https://iguide.readthedocs.io/). ### Changelog: -**v0.9.9 (August 9th,2019) - Additional updates** - -* Implemented support for BWA aligner -* Added tools (samqc) for working with other SAM/BAM output aligners as well -* Switched iguide support code to iguideSupport R-package and added unit tests -* Fixed bugs related to quoted table inputs (csv/tsv) -* Implemented a method to skip demultiplexing, see documentation for setup -* Resoved a number of issues identified, check GitHub for history! - -**v0.9.9 (June 10th, 2019)** - -* Modified the assimilate + evaluate workflow - + Assimilate now only includes reference genome data, meaning a cleaner intermediate file - + Evaluate will now handle ref. gene sets and further analysis - + This increases the modularity and consistancy of the workflow -* Revised the iGUIDE Report format to be more informational and clearer -* Revised a bit of the workflow to make reprocessing smoother -* Updated BLAT coupling script to be more memory efficient -* Fixed TravisCI testing! -* Changed stat workflow, now restarting analysis won't init a total reproc. - -**v0.9.8 (April 19th, 2019)** - -* iGUIDE can now support non-Cas9 nucleases as well! - + Implemented nuclease profiles into configs - + Updated assimilation, evaluation, and reporting scripts -* Added default resources to allow simpler HPC processing -* Included flexible system for identifying on-target sites - + Config can accept a range rather than a single site - + Acceptable notation: chr4:+:397-416 and chr3:*:397 -* Changed build nomenclature from v0.9.3 to b0.9.3 - + So as not to confuse with version -* Added 'summary' subcommand to generate a consise text-based report - + Working in the same manner as 'report', can generate from config(s) or eval file -* Added short stats-based report to be produced at the end of processing -* Additional bugfixes. - -**v0.9.7 (March 6th, 2019)** - -* Hotfix to workflow. -* Changed 'setup' subcommand to python script based rather than snakemake. -* Changed file organization. - -**v0.9.6 (March 5th, 2019)** - -* Introduced process workflow steps: assimilate and evaluate - + Assimilate aligned data and compare with targeting sequences - + Core data object that can be combined across runs / projects - + Evaluated data incorporates reference data and statistical models - + A staple data object for reports and can be constructed from multiple runs -* Included new subcommands 'eval' and modified 'report' - + report from either config(s) or eval dataset -* Cleaned up file structure -* Updated documentation in code and docs. -* Implemented accuracy and retention checks with simulation dataset. -* Updated simulation dataset with larger set to test analysis. - -**v0.9.5 (February 19th, 2019)** - -* Updated demultiplexing to be more efficient and better HPC compatible. -* Added RefSeq Extended* reference gene sets - + 'ext' includes curated, predicted, and other RefSeq sets - + 'ext.nomodel' includes only curated and other RefSeq sets -* Incorporated resource allocation for job dependent memory consumption - + Works great with HPC to specify memory requirements -* Streamlined input for report generation by only requiring config(s) - - -**v0.9.4 (January 30th, 2019)** - -* Updated 'report' utility and formating - + custom templates now accepted - + included as subcommand, check with 'iguide report -h' - + pdf and html options report 'nicely' even when printed from either -* Updated build to v0.9.2 to support new formating in report -* Builds are constructed from spec files rather than yaml requirements -* Included the 'clean' subcommand to reduce size of processed projects - + after cleaning a project, only terminal data files will remain - -**v0.9.3 (January 11th, 2019)** - -* Added 'list_samples' subcommand to list samples within a project. -* Caught a few bugs and worked them out for smoother processing and reports. - -**v0.9.2 (January 7th, 2019)** - -* Modified test dataset to run tests quicker and implemented CirclCI checking. - -**v0.9.1 (January 6th, 2019)** - -* Fixed problematic install for first time conda installers. - -**v0.9.0 (January 4th, 2019)** +**v1.0.0 (August 15th,2019)** -* Initial release. -* Supports setup and analysis of GUIDE-seq and iGUIDE experiments. -* Documentation on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html). +* Release of version 1.0.0!!! +* iGUIDE is a computational pipeline that supports the detection of DSBs induced + by designer nucleases +* Aligner support for BLAT and BWA currently implemented, let us know if you + would like to see others. +* Flexible pipeline processing built on Snakemake, supports a binning system + to better distribute workflow for whichever system it is being processed on +* Documentation supporting a Quickstart and User Guide hosted by [ReadTheDocs](https://iguide.readthedocs.io/) diff --git a/Snakefile b/Snakefile index 1df27096..2635e211 100644 --- a/Snakefile +++ b/Snakefile @@ -115,7 +115,7 @@ if not "alignMB" in config: config["alignMB"] = 4000 if not "qualCtrlMB" in config: - config["qualCtrlMB"] = 4000 + config["qualCtrlMB"] = 8000 if not "assimilateMB" in config: config["assimilateMB"] = 4000 @@ -125,16 +125,31 @@ if not "evaluateMB" in config: if not "reportMB" in config: config["reportMB"] = 4000 - + +if not "bins" in config: + config["bins"] = 5 + +if not "level" in config: + config["level"] = 300000 + if not "readNamePattern" in config: config["readNamePattern"] = str("'[\\w\\:\\-\\+]+'") + +# Define BINS +BINS = [] + +for i in range(1, config["bins"] + 1, 1): + BINS.append("bin" + str(i).zfill(len(str(config["bins"])))) + + # Regex constraints on wildcards wildcard_constraints: sample="[\w\-\_]+", read="R[12]", read_type="[RI][12]", - req_type="[RI][12]" + req_type="[RI][12]", + bin="bin[\d]+" # Target Rules rule all: @@ -153,6 +168,7 @@ if (config["skipDemultiplexing"]): else: include: "rules/demulti.rules" +include: "rules/binning.rules" include: "rules/trim.rules" if (config["UMItags"]): diff --git a/configs/simulation.config.yml b/configs/simulation.config.yml index d26d844f..39c7e44e 100644 --- a/configs/simulation.config.yml +++ b/configs/simulation.config.yml @@ -3,7 +3,6 @@ Run_Name : "simulation" Sample_Info : "sampleInfo/simulation.sampleInfo.csv" Supplemental_Info : "sampleInfo/simulation.supp.csv" Ref_Genome : "hg38" -Ref_Genome_Path : "genomes/hg38.2bit" Aligner : "blat" UMItags : TRUE @@ -119,6 +118,10 @@ R2odnMismatch : 0 R2overMismatch : 4 R2overMaxLength : 20 +# Binning +bins : 3 +level : 250 + # Reference Alignment BLATparams : "-tileSize=11 -stepSize=9 -minIdentity=85 -maxIntron=5 -minScore=27 -dots=1000 -out=psl -noHead" BWAparams : "-k 30 -w 2500 -P -L 25 -a" diff --git a/docs/changelog.rst b/docs/changelog.rst new file mode 100644 index 00000000..8d464ef8 --- /dev/null +++ b/docs/changelog.rst @@ -0,0 +1,110 @@ +.. _changelog: + +ChangeLog +========= + +**v1.0.0 (August 15th, 2019)** + +- Complete support for BLAT and BWA aligners +- Included a binning system to distribute workload into smaller loads +- Implemented a version tracking system into the intermediate data files + (incorp_sites) +- Updated CLI with "hints" for snakemake processing + +**v0.9.9 (August 9th, 2019) - Additional updates** + +- Implemented support for BWA aligner +- Added tools (samqc) for working with other SAM/BAM output aligners as well +- Switched iguide support code to iguideSupport R-package and added unit tests +- Fixed bugs related to quoted table inputs (csv/tsv) +- Implemented a method to skip demultiplexing, see documentation for setup +- Resoved a number of issues identified, check GitHub for history! + +**v0.9.9 (June 10th, 2019)** + +- Revised the iGUIDE Report format to be more informational and clearer +- Revised a bit of the workflow to make reprocessing smoother +- Updated BLAT coupling script to be more memory efficient +- Fixed TravisCI testing! +- Changed stat workflow, now restarting analysis won't initiate a total + reprocessing. +- Modified the assimilate + evaluate workflow +- Assimilate now only includes reference genome data, meaning a cleaner + intermediate file +- Evaluate will now handle ref. gene sets and further analysis +- This increases the modularity and consistancy of the workflow + + +**v0.9.8 (April 19th, 2019)** + +- iGUIDE can now support non-Cas9 nucleases as well! +- Implemented nuclease profiles into configs +- Updated assimilation, evaluation, and reporting scripts +- Added default resources to allow simpler HPC processing +- Included flexible system for identifying on-target sites +- Config can accept a range rather than a single site +- Acceptable notation: chr4:+:397-416 and chr3:\*:397 +- Changed build nomenclature from v0.9.3 to b0.9.3, so as not to confuse with + version +- Added 'summary' subcommand to generate a consise text-based report +- Added short stats-based report to be produced at the end of processing +- Additional bugfixes. + +**v0.9.7 (March 6th, 2019)** + +- Hotfix to workflow. +- Changed 'setup' subcommand to python script based rather than snakemake. +- Changed file organization. + +**v0.9.6 (March 5th, 2019)** + +- Introduced process workflow steps: assimilate and evaluate +- Assimilate aligned data and compare with targeting sequences +- Incorp_sites now a core data object that can be combined across runs +- Evaluated data incorporates reference data and statistical models +- A staple data object for reports and can be constructed from multiple runs +- Included new subcommands 'eval' and modified 'report', report from either + config(s) or eval dataset +- Cleaned up file structure +- Updated documentation in code and docs. +- Implemented accuracy and retention checks with simulation dataset. +- Updated simulation dataset with larger set to test analysis. + +**v0.9.5 (February 19th, 2019)** + +- Updated demultiplexing to be more efficient and better HPC compatible. +- Added RefSeq Extended reference gene sets +- 'ext' includes curated, predicted, and other RefSeq sets +- 'ext.nomodel' includes only curated and other RefSeq sets +- Incorporated resource allocation for job dependent memory consumption, works + great with HPC to specify memory requirements +- Streamlined input for report generation by only requiring config(s) + +**v0.9.4 (January 30th, 2019)** + +- Updated 'report' utility and formating. Custom templates now accepted. + Included as subcommand, check with 'iguide report -h'. PDF and HTML options + report 'nicely' even when printed from either +- Updated build to v0.9.2 to support new formating in report +- Builds are constructed from spec files rather than yaml requirements +- Included the 'clean' subcommand to reduce size of processed projects. After + cleaning a project, only terminal data files will remain + +**v0.9.3 (January 11th, 2019)** + +- Added 'list_samples' subcommand to list samples within a project. +- Caught a few bugs and worked them out for smoother processing and reports. + +**v0.9.2 (January 7th, 2019)** + +- Modified test dataset to run tests quicker and implemented CirclCI checking. + +**v0.9.1 (January 6th, 2019)** + +- Fixed problematic install for first time conda installers. + +**v0.9.0 (January 4th, 2019)** + +- Initial release. +- Supports setup and analysis of GUIDE-seq and iGUIDE experiments. +- Documentation on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html). diff --git a/docs/conf.py b/docs/conf.py index cb1165dd..1d670b7f 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -20,13 +20,13 @@ # -- Project information ----------------------------------------------------- project = 'iGUIDE' -copyright = '2018, Christopher Nobles, Ph.D.' +copyright = '2019, Christopher Nobles, Ph.D.' author = 'Christopher Nobles, Ph.D.' # The short X.Y version version = '' # The full version, including alpha/beta/rc tags -release = 'v0.2.6' +release = 'v1.0.0' # -- General configuration --------------------------------------------------- diff --git a/docs/iguide_aux_workflow_fig.jpg b/docs/iguide_aux_workflow_fig.jpg new file mode 100644 index 00000000..8f0b4e64 Binary files /dev/null and b/docs/iguide_aux_workflow_fig.jpg differ diff --git a/docs/iguide_prime_workflow_fig.jpg b/docs/iguide_prime_workflow_fig.jpg new file mode 100644 index 00000000..18355abe Binary files /dev/null and b/docs/iguide_prime_workflow_fig.jpg differ diff --git a/docs/iguide_subcmd_fig.jpg b/docs/iguide_subcmd_fig.jpg new file mode 100644 index 00000000..8a49ad81 Binary files /dev/null and b/docs/iguide_subcmd_fig.jpg differ diff --git a/docs/images/iguide_aux_workflow_fig.pdf b/docs/images/iguide_aux_workflow_fig.pdf new file mode 100644 index 00000000..0e2387ca Binary files /dev/null and b/docs/images/iguide_aux_workflow_fig.pdf differ diff --git a/docs/images/iguide_aux_workflow_fig.png b/docs/images/iguide_aux_workflow_fig.png new file mode 100644 index 00000000..691ec7e4 Binary files /dev/null and b/docs/images/iguide_aux_workflow_fig.png differ diff --git a/docs/images/iguide_prime_workflow_fig.pdf b/docs/images/iguide_prime_workflow_fig.pdf new file mode 100644 index 00000000..cb396c8e Binary files /dev/null and b/docs/images/iguide_prime_workflow_fig.pdf differ diff --git a/docs/images/iguide_prime_workflow_fig.png b/docs/images/iguide_prime_workflow_fig.png new file mode 100644 index 00000000..6e2741d6 Binary files /dev/null and b/docs/images/iguide_prime_workflow_fig.png differ diff --git a/docs/images/iguide_subcmd_fig.pdf b/docs/images/iguide_subcmd_fig.pdf new file mode 100644 index 00000000..e55d8c0f Binary files /dev/null and b/docs/images/iguide_subcmd_fig.pdf differ diff --git a/docs/images/iguide_subcmd_fig.png b/docs/images/iguide_subcmd_fig.png new file mode 100644 index 00000000..6da8bfeb Binary files /dev/null and b/docs/images/iguide_subcmd_fig.png differ diff --git a/docs/index.rst b/docs/index.rst index cd6f6956..33aa4e83 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,14 +6,11 @@ Welcome to iGUIDE's documentation ================================== -Description -=========== - -Software pipeline for processing and analyzing double-strand DNA break events. -These events may be induced, such as by designer nucleases like Cas9, or -spontaneous, as produced through DNA replication or ionizing radiation. A -laboratory bench-side protocol accompanies this software pipeline, and can be -found (https://doi.org/10.1186/s13059-019-1625-3). +iGUIDE is a software pipeline for processing and analyzing double-strand DNA +break events. These events may be induced, such as by designer nucleases like +Cas9, or spontaneous, as produced through DNA replication or ionizing +radiation. A laboratory bench-side protocol accompanies this software pipeline, +and can be found (https://doi.org/10.1186/s13059-019-1625-3). This documentation gives the reader an overview of the pipeline, including how to install and process a sample dataset. Processing a sample data set is @@ -23,17 +20,17 @@ broken into a few parts: #. Setting up a run directory and acquiring the sequence data #. Initializing the pipeline and understanding the output +After processing sequencing run(s), additional evaluation and analysis can be +performed by supplying supplemental data and combining specimen outputs from +several runs. + +To get started, see :ref:`quickstart`! + .. toctree:: :hidden: - :maxdepth: 4 + :maxdepth: 2 :caption: Contents: - pages/install.rst - pages/quickstart.rst - pages/config_setup.rst - pages/config_run.rst - pages/config_nucs.rst - pages/config_proc.rst - pages/sampleinfo.rst - pages/suppinfo.rst - pages/changelog.rst + quickstart.rst + usage.rst + changelog.rst diff --git a/docs/pages/changelog.rst b/docs/pages/changelog.rst deleted file mode 100644 index 8f9f7ca9..00000000 --- a/docs/pages/changelog.rst +++ /dev/null @@ -1,128 +0,0 @@ -.. _changelog: - -.. contents:: - :depth: 2 - -ChangeLog -========= - -**v0.9.9 (August 9th,2019) - Additional updates** - -* Implemented support for BWA aligner -* Added tools (samqc) for working with other SAM/BAM output aligners as well -* Switched iguide support code to iguideSupport R-package and added unit tests -* Fixed bugs related to quoted table inputs (csv/tsv) -* Implemented a method to skip demultiplexing, see documentation for setup -* Resoved a number of issues identified, check GitHub for history! - -**v0.9.9 (June 10th, 2019)** - -* Modified the assimilate + evaluate workflow - - - Assimilate now only includes reference genome data, meaning a cleaner intermediate file - - Evaluate will now handle ref. gene sets and further analysis - - This increases the modularity and consistancy of the workflow - -* Revised the iGUIDE Report format to be more informational and clearer -* Revised a bit of the workflow to make reprocessing smoother -* Updated BLAT coupling script to be more memory efficient -* Fixed TravisCI testing! -* Changed stat workflow, now restarting analysis won't initiate a total reprocessing. - -**v0.9.8 (April 19th, 2019)** - -* iGUIDE can now support non-Cas9 nucleases as well! - - - Implemented nuclease profiles into configs - - Updated assimilation, evaluation, and reporting scripts - -* Added default resources to allow simpler HPC processing -* Included flexible system for identifying on-target sites - - - Config can accept a range rather than a single site - - Acceptable notation: chr4:+:397-416 and chr3:*:397 - -* Changed build nomenclature from v0.9.3 to b0.9.3 - - - So as not to confuse with version - -* Added 'summary' subcommand to generate a consise text-based report - - - Working in the same manner as 'report', can generate from config(s) or eval file - -* Added short stats-based report to be produced at the end of processing -* Additional bugfixes. - -**v0.9.7 (March 6th, 2019)** - -* Hotfix to workflow. -* Changed 'setup' subcommand to python script based rather than snakemake. -* Changed file organization. - -**v0.9.6 (March 5th, 2019)** - -* Introduced process workflow steps: assimilate and evaluate - - - Assimilate aligned data and compare with targeting sequences - - + Core data object that can be combined across runs / projects - - - Evaluated data incorporates reference data and statistical models - - + A staple data object for reports and can be constructed from multiple runs - -* Included new subcommands 'eval' and modified 'report' - - - report from either config(s) or eval dataset - -* Cleaned up file structure -* Updated documentation in code and docs. -* Implemented accuracy and retention checks with simulation dataset. -* Updated simulation dataset with larger set to test analysis. - -**v0.9.5 (February 19th, 2019)** - -* Updated demultiplexing to be more efficient and better HPC compatible. -* Added RefSeq Extended* reference gene sets - - - 'ext' includes curated, predicted, and other RefSeq sets - - 'ext.nomodel' includes only curated and other RefSeq sets - -* Incorporated resource allocation for job dependent memory consumption - - - Works great with HPC to specify memory requirements - -* Streamlined input for report generation by only requiring config(s) - -**v0.9.4 (January 30th, 2019)** - -* Updated 'report' utility and formating - - - custom templates now accepted - - included as subcommand, check with 'iguide report -h' - - pdf and html options report 'nicely' even when printed from either - -* Updated build to v0.9.2 to support new formating in report -* Builds are constructed from spec files rather than yaml requirements -* Included the 'clean' subcommand to reduce size of processed projects - - - after cleaning a project, only terminal data files will remain - -**v0.9.3 (January 11th, 2019)** - -* Added 'list_samples' subcommand to list samples within a project. -* Caught a few bugs and worked them out for smoother processing and reports. - -**v0.9.2 (January 7th, 2019)** - -* Modified test dataset to run tests quicker and implemented CirclCI checking. - -**v0.9.1 (January 6th, 2019)** - -* Fixed problematic install for first time conda installers. - -**v0.9.0 (January 4th, 2019)** - -* Initial release. -* Supports setup and analysis of GUIDE-seq and iGUIDE experiments. -* Documentation on [ReadTheDocs.io](https://iguide.readthedocs.io/en/latest/index.html). diff --git a/docs/pages/config_nucs.rst b/docs/pages/config_nucs.rst deleted file mode 100644 index 6295887a..00000000 --- a/docs/pages/config_nucs.rst +++ /dev/null @@ -1,133 +0,0 @@ -.. _configinfo: - -.. contents:: - :depth: 2 - - - -Config - Nuclease Profiles -========================== - -An additional component to the first part of the config file, is the Nuclease -Profiles. The user can specify which nuclease they are using and include -and profile to help identify edit sites. Nuclease can range from Cas9 to Cpf1 -or TALEN based nickases. - -**Note:** For TALEN and dual flanking nickases / nucleases, each side will need -to be input as a different target. Specify in `Target_Sequences` the sequence -and `On_Target_Sites` the actual editing site. Make sure you include two -distinct identifiers for the sequences on-target sites, then specify the -target treatment as `{target_seq1};{target_seq2}. - -Any name can be given in the `Nuclease` section, but that name needs to match -the profile name as well. So if you want to call it "Cas9v2", then just make -sure you have a profile named "Cas9v2". - -Below is some ascii art that indicates the differences between nucleases. -Additionally, below the art are example profiles for input into the iGUIDE -software:: - - Editing strategies by designer nucleases - Cas9 : - >< PAM - ATGCATGCATGCATGCATGCA TGG (sense strand) - - TGCATGCATGCATGCATGCA NGG # gRNA - |||||||||||||||||||| ||| - TACGTACGTACGTACGTACGT ACC (anti-sense strand) - >< # Dominant cutpoint - - Cpf1 : Also known as Cas12a (similar nuclease structure for CasX) - >< # Dominant cutpoint - GTTTG ATGCATGCATGCATGCATGCATGCATGC (sense strand) - PAM - TTTV ATGCATGCATGCATGCATGCA # gRNA, nuclease activity leave overhang - |||| ||||||||||||||||||||| - CTAAC TACGTACGTACGTACGTACGTACGTACG (anti-sense strand) - >< # Dominant cutpoint - - TALEN : Protin-DNA binding domain fused with FokI nickase - ATATATATATATATATATAT GCATGCATGCATGCAT GCGCGCGCGCGCGCGCGCGC (sense strand) - \\\\\\\\\\\\\\\\\\\\ - |-------> - <-------| - \\\\\\\\\\\\\\\\\\\\ - TATATATATATATATATATA CGTACGTACGTACGTA CGCGCGCGCGCGCGCGCGCG (anti-sense strand) - # Proteins bind flanking the cleavage site and cut in the "insert" sequence. - - CasCLOVER : Clo051 or another nickases with CRISPR-based binding domains - ATCCT ATGCATGCATGCATGCATGC TTAACCGGTTAACCGG TACGTACGTACGTACGTACG CGGTC - ||| |||||||||||||||||||| (sense strand) - PAM Target Sequence \-------> - <-------\ Target Sequence PAM - (anti-sense strand) |||||||||||||||||||| ||| - TAGGA TACGTACGTACGTACGTACG AATTGGCCAATTGGCC ATGCATGCATGCATGCATGC GCCAG - - -Below are the example profiles:: - - Nuclease_Profiles : - Cas9 : - PAM : "NGG" - PAM_Loc : "3p" - PAM_Tol : 1 - Cut_Offset : -4 - Insert_size : FALSE - - Cpf1 : - PAM : "TTTV" - PAM_Loc : "5p" - PAM_Tol : 1 - Cut_Offset : 26 #(Anywhere between 23 and 28) - Insert_size : FALSE - - CasX : - PAM : "TTCN" - PAM_Loc : "5p" - PAM_Tol : 1 - Cut_Offset : 22 #(Anywhere between 16 and 29) - Insert_size : FALSE - - TALEN : - PAM : FALSE - PAM_Loc : FALSE - PAM_Tol : 0 - Cut_Offset : Mid_insert - Insert_size : "15:21" - - CasCLOVER : - PAM : "NGG" - PAM_Loc : "3p" - PAM_Tol : 1 - Cut_Offset : Mid_insert - Insert_size : "10:30" - - -Profile parameters ------------------- - -``PAM`` - protospacer adjacent motif - should be specified here and can contain - ambiguous nucleotides. - -``PAM_Loc`` - indicates the location of the PAM with respect to the pattern, either '5p', - '3p' or FALSE. - -``PAM_Tol`` - indicates the tolerance for mismatches in the PAM sequence (ignorned if PAM - is FALSE). - -``Cut_Offset`` - indicates the offset from the 5' nucleotide of the PAM sequence where the - nuclease creates a double strand break, unless PAM is FALSE, then the 5' - position of the target sequence (also accepts "mid_insert" to specify middle - of region between paired alignments). - -``Insert_size`` - is used if target sequences are expected to flank each other for editing, - such as with TALENs, and indicates the expected size of the insert. To input - a range, delimit the min and max by a colon, ie. 15:21. All names of - nucleases used to treat specimens need to have a profile. Additional profiles - should be added under the 'Nuclease_Profiles' parameter. - diff --git a/docs/pages/config_proc.rst b/docs/pages/config_proc.rst deleted file mode 100644 index 082b8455..00000000 --- a/docs/pages/config_proc.rst +++ /dev/null @@ -1,214 +0,0 @@ -.. _configinfo: - -.. contents:: - :depth: 4 - -Configs - Processing Information -================================ - -Below are parameters that are used to process the large amount of data, such as -setting memory suggestions if resources are specified or parameters for sequence -alignments. While these figues may not be relevant to the bench scientist, they -are particulars for computational scientists. - -Resource management is not required, but it can help when using HPC or limiting -jobs. You are encouraged to spend some time optimizing if you would like, these -parameters work out well on the designers platform. - - -iGUIDE configuration -"""""""""""""""""""" - -``Read_Types`` - This parameter should include which read types will be used in the analysis, - i.e. ``["R1", "R2", "I1", "I2"]``. This follows a list notation is Python. If - only single barcoding or some other method is employed and a read type is not - included, simply leave it out of the example. - -``Genomic_Reads`` - This parameter is similar to the ``Read_Types`` but only indicates which reads - contain genomic information rather than indexing. - - -Memory Management -""""""""""""""""" - -``defaultMB / demultiMB / trimMB / filtMB / consolMB / alignMB / coupleMB / assimilateMB / evaluateMB / reportMB`` - Controls the amount of memory allocated to each of these processes during - snakemake processing. While working on a server or multicored machine, these - parameters will work internally to help schedule jobs. Each value will act as - an upper limit for the amount of MB of RAM to expect the process to take, and - schedule jobs appropriately using the ``--resources mem_mb={limitMB}`` flag with - snakemake. During HPC use, these parameters can be combined with the cluster config - to schedule specific memory requirements for jobs. Additionally, if the - ``--restart-times {x}`` is used where "x" is the number of times to restart a job - if it fails, then the amount of memory for the job will increase by a unit of the - parameter. For example, if a trimming job fails because it runs out of memory, then - restarting the job will try to allocate 2 times the memory for the second attempt. - All parameters should be in megabytes (MB). - - -Demultiplexing parameters -""""""""""""""""""""""""" - -``skipDemultiplexing`` - Logical (either TRUE or FALSE) to indicate if demultiplexing should be carried - out. If TRUE, sequence files (*.fastq.gz) need to be placed or linked in the - input_data directory of an existing project directory (as with ``iguide setup``), - one sequence file for each type (R1, R2, I1, I2). These need to be identified - in the "Run" portion of the config file. If FALSE, then demultiplexed files need - to be included in the input_data directory of an existing project directory. The - files need to be appropriately named, in the format of ``{sampleName}.{readtype}.fastq.gz``, - where ``sampleName`` matches the 'sampleName' column found in the associated 'sampleInfo' - file, and ``readtype`` is R1, R2, I1, or I2. If ``UMItags`` is ``FALSE``, then only R1 and R2 - file types are required for analysis, if ``UMItags`` is ``TRUE``, then I2 is a - required file type as well. - -``barcode{1/2}Length`` - Integer values indicating the number of nucleotides in the barcodes or - indexing sequences. - -``barcode{1/2}`` - Character values (i.e. ``"I1"``) indicating which reads to find the associated - indexing information for demultiplexing. - -``bc{1/2}Mismatch`` - An integer value indicating the number of tolarated mismatches in the barcode - sequences for either barcode 1 or 2. - - -Sequence trimming -""""""""""""""""" - -``R{1/2}leadMismatch`` - Integer values indicating the number of allowed mismatches in either R1 or R2 - leading sequence trimming. Recommend to set to less than 10% error. - -``R2odnMismatch`` - Integer value indicating the number of allowed mismatches in the unprimed - ODN sequence, typically should be set to 0. - -``R{1/2}overMismatch`` - Integer values indicating the number of allowed mismatches in either R1 or R2 - overreading trimming. This is converted into a percent matching and should be - thought of as a number of mismatches allowed out of the total length of the - overreading trim sequence. - -``R{1/2}overMaxLength`` - Searching for overread trimming in sequences can be time consuming while not - producing different results. For this the total length of searched for - sequences can be limited here. For example, if ``ATGCGTCGATCGTACTGCGTTCGAC`` - is used as the overreading sequence, and 5 mismatches are allowed, then the - tolerance will be 5/25 or 80% matching, but only the first 20 nucleotides of - the sequence will be aligned for overtrimming, ``ATGCGTCGATCGTACTGCGT``. With - an 80% matching requirement, 16 out of 20 nucleotides will need to align for - overread trimming to be initiated. - - -Reference Alignment -""""""""""""""""""" - -``BLATparams`` - A character string to be included with the BLAT call. For options, please see - the BLAT help options by typing ``blat`` into the commandline after - activating ``iguide``. - -``BWAparams`` - A character string to be inclued with the BWA call. BWA is not currently - supported, so this parameter is currently silent. - - -Post-alignment filtering -"""""""""""""""""""""""" - -``maxAlignStart`` - Integer value indicating the number of nucleotides at the beginning of the - alignment that will be allowed to not align. Another way of thinking of this - is the maximum start position on the query rather than the target reference. - A default value of 5 means that the alignment needs to start in the first 5 - nucleotides or the alignment is discarded during quality control filtering. - -``minPercentIdentity`` - This is a value between 0 and 100 indicating the minimum global percent - identity allow for an alignment. If an alignment has less, then it is - discarded during quality control filtering. - -``{min/max}TempLength`` - Specify the minimum (min) and maximum (max) template length expected. Joined - alignments between R1 and R2 the are outside of this range are considered - artifacts and are discarded or classified as chimeras. - - -Post-processing -""""""""""""""" - -``refGenes / oncoGeneList / specialGeneList`` - These are special reference files in either text or BioConductoR's - GenomicRanges objects. They can be in an '.rds' format or table format - ('.csv' or '.tsv'). The ``file`` parameter should indicate the file path to - the file (relative paths should be relative to the SnakeFile), and the - ``symbolCol`` parameter should indicate the column in the data object which - contains the reference names to be used in the analysis. - -``maxTargetMismatch`` - The maximum number of mismatches between the reference genome and target - sequence allowed for consideration to be a target matched incorporation - site. This is an integer value and is compared to the target sequence(s). - -``upstreamDist`` - The distance upstream of the incorporation site to look for a target - similar sequence within the criteria specified by ``maxTargetMismatch``. - -``downstreamDist`` - The distance downstream of the incorporation site to look / include for a - target similar sequence within the criteria specified by - ``maxTargetMismatch``. - -``pileUpMin`` - An integer value indicating the number of alignments required to overlap - before being considered a 'pileUp'. - -``recoverMultihits`` - While multihit alignments are often difficult to analyze, some information - can still be gleamed from the data given reasonable assumptions. Adjusting - this parameter to ``TRUE`` will still only focuses on sites that are uniquely - mapped, but if a multihit includes a unique site and other locations, - contributions are given to the unique site location. Further, reads and their - contributions, umitags and fragments, are not double counted but instead - evenly distributed to all included unique sites. **Note**, some sequencing - artifacts may arrise in "off-target" associated sites. Users should be careful - to conclude anything from these alignment artifacts. Leaving this option as - ``FALSE`` is recommended if the user does not have a target sequence that - locates a repetitive sequence. - - -Report -"""""" - -``suppFile`` - Logical (``TRUE`` or ``FALSE``), if the supplemental file provided in - ``Supplemental_Info`` should be used in the default report generated at the - end of processing. If set to ``FALSE``, the ``Supplemental_Info`` parameter - is not required for processing. - -``{tables/figures}`` - Logicals indicating if tables and figures should be generated from the report. - Data will be included under the ``reports`` directory in the project run directory. - For figures, both PDF and PNG formats will be generated if set to ``TRUE`` at 300 dpi - while tables will be generated in a comma-separated values (csv) format. - -``reportData`` - Logical indicating if a RData object should be saved during the report - generation in the ``reports`` directory. - -``infoGraphic`` - Logical indicating if an info graphic displaying the genomic distribution of - incorporations should be generated at the beginning of the report. While - aesthetically pleasing, the graphic gives the report a unique twist and can - provide the knowledgeable user with information about the report at the very - beginning. - -``signature`` - Character string included at the beginning of reports to denote the author, - analyst, laboratory, etc. Make sure you change if you don't want Chris - getting credit for your work. diff --git a/docs/pages/config_run.rst b/docs/pages/config_run.rst deleted file mode 100644 index 86e1d578..00000000 --- a/docs/pages/config_run.rst +++ /dev/null @@ -1,203 +0,0 @@ -.. _configinfo: - -.. contents:: - :depth: 4 - - -Config - Run Specific Information -================================= - -Run configuration -""""""""""""""""" - -``Run_Name`` - This is the name of the sequencing run, and should only contain alpha-numeric - characters. Underscores (``_``) and dashes (``-``) are also allowed within the - run name parameters. Other symbols should not be included, such as a dot - (``.``). The run name is further used by the software to link files and - directories together, so it will need to be consistent whenever it is used. - Examples include: iGUIDE_190201_B6V99, 181213_PD1_T-cell_exp. - -``Sample_Info`` - This is a file path to the sample information file. It can either be an - absolute file path or relative file path. If the file path is relative though, - it will need to be relative to the Snakefile used by the iGUIDE software. For - more information about this file, please see the Sample Information page. - -``Supplemental_Info`` - Similar to ``Sample_Info``, this is a file path to a supplementary file which - can contain information related to experimental parameters or patient - information. This will be used during the report output, which will group - samples with identical parameters. The format for this file is quite loose, - and it only requires a single column ``Specimen``, which should match the - names of specimens in the sample information file. For more information about - this file, please see the Supplemental Information page. If no file is to be - used, set the value for this parameter to ``"."`` and make sure to set the - ``suppFile`` in the run protion of the config to ``FALSE``. - -``Ref_Genome`` - This is a designation for the reference genome to used during processing. The - genome will need to be included in the R libraries through BioConductoR prior - to running the software. The human genome draft ``hg38`` is included by - default. Please see information on the BioConductoR package 'BSgenome' for - installing alternative genomes. - -``Ref_Genome_Path`` - This is the file path (following the same workflow as the ``Sample_Info`` - parameter) to a reference genome file, if one is already available in a fasta - format. - -``Aligner`` - Options include either 'blat' or 'bwa', though at this time, only 'blat' is - supported. Future versions of iGUIDE will support other alignment softwares. - -``UMItags`` - This is a logical parameter indicating whether to use unique molecular indices - (UMI) sequence tags ('TRUE') or to only use unique fragments lengths (see - `SonicAbundance `) to quantify - abundances of unique observations. - - -Sequence files -"""""""""""""" - -``Seq_Path`` - This is the file path to the sequence files. Rather than repeating the path - for each below, just include the path to the directory containing the files. - -``R1 / R2 / I1 / I2`` - These parameters should be the file names of the sequence files to be - analyzed by the iGUIDE software. It is recommened to pass complete sequencing - files to iGUIDE rather than demultiplexing prior to analysis. - -``Demulti_Dir`` - Path to the directory containing demultiplexed sequence data. This is still - under development and may present with bugs. - - -SampleInfo formating -"""""""""""""""""""" - -``Sample_Name_Column`` - This is the name of the column in the sample information file which contains - identifiable information about samples. An appropriate format for the sample - names is "{specimen}-{rep}" where 'specimen' is an alpha-numeric designator - for the specimen and 'rep' is a numeric identifier for technical or biological - replicates, separated by a dash (``-``). Replicates will be pooled during the - final analysis, so if you want them to be separate in the report, make sure - you give each specimen a different identifier. For example, iGSP0002-1 and - iGSP0002-2, will be pooled together for the report and analysis, but - iGSP0002-1 and iGSP0003-1 will not. These names will be used in naming files, - so do not include any special characters that will confuse file managment. - Try to stick to common delimiters, such as "-", "_", ".". A good practice is - to put specimen identifiers at the beginning, replicate identifiers at the end - following a "-", and anything else descriptive in the middle. For example, - iGSP0002-neg-1, can specify the orientation the sample was processed with. - - -Sequence information -"""""""""""""""""""" - -``R{1/2}_Leading_Trim`` - Sequence to be removed from the 5' or beginning of the R1 or R2 sequences. - Commonly a linker or fixed sequence that is part of the priming scheme during - amplification. If no sequence should be removed, just include ``"."``. If the - sequence is sample or specimen specific, it can be included in the sample - information file and indicated in these fields as ``"sampleInfo:{column}"``, - where 'column' is the column name with the data in the sample information - file. - -``R{1/2}_Overreading_Trim`` - Similar to the ``Leading_Trim`` parameters, these parameters indicate the - sequence that should be removed from the 3' or end of the reads if it is - present. Again, if no sequence should be removed, use a ``"."`` or if the data - is present in the sample information file, ``"sampleInfo:{column}"``. - -``R2_Leading_Trim_ODN`` - This is a key parameter difference between iGUIDE and the original GUIDEseq - method. This parameter indicates the sequence that is part of the dsODN but is - **not** primed against. This sequence should directly follow the - ``R2_Leading_Trim`` sequence and should be a reverse complement of the - beginning of the ``R1_Overreading_Trim`` sequence if the iGUIDE dsODN is being - used. For GUIDEseq, simply include ``"."``, or if you have multiple sequences, - then specify in the sample information file as ``"sampleInfo:{column}"``. - - -Target sequence information -""""""""""""""""""""""""""" - -``Target_Sequences`` - This parameter specifies the target sequences, **not including** the PAM - sequences for guide RNAs. An acceptable input format would be - ``{target_name} : "{sequence}"`` (i.e. ``B2M.3 : "GAGTAGCGCGAGCACAGCTANGG"``) - and additional target sequences can be included, one per line, and each - indented at the same level. The input format of - ``{target_name} : {target_seq}`` needs to be maintained for proper function. - The 'target_name' in this situation will need to match the 'target_name' used - in the ``On_Target_Sites`` and ``Treatment`` parameters. 'target_name' should - follow a common format, and use standard delimiters, such as "-", "_", and - ".". For example: ``B2M.3``, ``TRAC.1.5``, ``TruCD33v5``. - -``On_Target_Sites`` - This parameter indicates the specific location for editing by the target - enzyme. There should be one line for each on-target site, even if there are - more than one on-target sites for a given target sequence. Typically the input - format should follow ``{target_name} : "{seqname}:{+/-}:{position}"``, where - 'target_name' matches the name of the given target sequence, and if multiple - on-target sites exist, then the names can be expanded using a - ``{target_name}'#`` notation. Additionally, the notation can be expanded to - ``{target_name} : "{seqname}:{+/-/*}:{min.position}-{max.position}"``, where - '*' indicates either orientation and 'min.position' and 'max.position' - represent the numerical range for the on-target site. The value for each - on-target site specifies the location or genomic coordinates of nuclease - activity. The 'seqname' indicates the chromosome or sequence name, an - orientation of '+' or '-' is given to the location depending on the editing - orientation (in line with positional numbering is '+' and opposite is '-', - unknown or both is '*'), and the 'position' or 'min/max.position' indicates - the nucleotide(s) of editing. For Cas9, the position of editing is commonly - between the 3rd and 4th nucleotide from the 3' end of the targeting sequence - (not including the PAM). Being off by a nucleotide or so will not cause any - problems. Example below. - - .. code-block:: shell - - On_Target_Sites : - TRAC.5 : "chr14:+:22547664" - TRBC.4'1 : "chr7:+:142792020" - TRBC.4'2 : "chr7:+:142801367" - PD1.3 : "chr2:-:241858808" - TRAC.3.4 : "chr14:-:22550616-22550625" - B2M.3 : "chr15:*:44711569-44711570" - CIITA.15.1 : "chr16:+:10916399" - - -Specimen target treatment -""""""""""""""""""""""""" - -``Treatment`` - This parameter indicates how samples were treated. If samples were all treated - differently, then this information can be included in the sample information - file as ``all : "sampleInfo:{column}"`` where 'column' is the name of the - column with the information. If a single sample was treated with more than one - target sequence, then delimit multiple target names by a semicolon (``;``), - i.e. ``all : "B2M;TRAC;TRBC"``. Additionally, each specimen can be indicated - individually on a new line. Only specimen names should be given here and - provided individually, not sample identifiers. This means that if your sample - names follow the suggested format, "{specimen}-{replicate}", you would only - specify the "{specimen} : {treatment}" underneath this parameter. - - -Specimen nuclease treatment - -``Nuclease`` - Similar to target treatment above, this parameter dictates which nuclease(s) - where used on the specimens. This refers to the class of nuclease, such as - Cas9 or Cpf1, which behave differently when they edit DNA. Notation can follow - the same as above, if all specimens were treated with the same class of - nuclease, then just specify 'all : "{nuclease_profile}"', or list out by - specimen. Additionally you can specify the column in sampleInfo in the same - format as above. Currently, iGUIDE does not support processing for specimens - with multiple classes of nuclease profiles. Only one profile can be specified - per specimen. - - \ No newline at end of file diff --git a/docs/pages/config_setup.rst b/docs/pages/config_setup.rst deleted file mode 100644 index b7224ae8..00000000 --- a/docs/pages/config_setup.rst +++ /dev/null @@ -1,40 +0,0 @@ -.. _configinfo: - -.. contents:: - :depth: 4 - -Config - Setup -============== - -Configuration files, or configs for short, contain both run-related and -pipeline-related information. This is by design. For reproducibility it is -easiest to have what was processed and how it was processed in the same -location. There should be one config file for each sequencing run to be -processed. Below is a brief summary of how to 'configure' your config file to -your specific run. - -Config files need to be named in the format '{RunName}.config.yml', where -``{RunName}`` is a parameter set within the config file for the run. For -example, the default run configuration file is named ``simulation.config.yml``, -so the run name is ``simulation``. - -Config files can be deposited anywhere in the users directory, but a dediacted -directory has been included in the release of iGUIDE. For convienence, config -files can be placed in ``iGUIDE/configs/``. - -For sample specific information, input is more easily placed in a sampleInfo -file. See the included section regarding sample info files. - -Config File Layout ------------------- - -Config files are in a ``yaml`` format, but are broken into two parts. The first -contains run specific information that should be filled out by an individual -familiar with the sequence data used in the laboratory bench-side protocol. -Additionally, they should be aware of the biochemistry related to the enzymes -and sequences they are using. - -The second part (below the divide ``----``) should be filled out by an -individual familiar with the bioinformatic processing. Explanations of the -different portions can be found in the following pages. - diff --git a/docs/pages/install.rst b/docs/pages/install.rst deleted file mode 100644 index b4fc8c7d..00000000 --- a/docs/pages/install.rst +++ /dev/null @@ -1,37 +0,0 @@ -.. _install: - -.. contents:: - :depth: 3 - -======= -Install -======= - -To install iGUIDE, simply clone the repository to the desired destination. - -.. code-block:: shell - - git clone https://github.com/cnobles/iGUIDE.git - -Then initiate the install using the install script. If you would like the -installed environment to be named something other than 'iguide', the new conda -environment name can be provided to the ``install.sh`` script as provided below. - -.. code-block:: shell - - cd path/to/iGUIDE - bash install.sh - -Or: - -.. code-block:: shell - - cd path/to/iGUIDE - bash install.sh -e {env_name} - -Additionally, help information on how to use the ``install.sh`` can be accessed -by: - -.. code-block:: shell - - bash install.sh -h diff --git a/docs/pages/sampleinfo.rst b/docs/pages/sampleinfo.rst deleted file mode 100644 index 2b25bb4d..00000000 --- a/docs/pages/sampleinfo.rst +++ /dev/null @@ -1,40 +0,0 @@ -.. _sampleinfo: - -.. contents:: - :depth: 2 - - -Sample Information Files -======================== - -Sample information files (or sampleInfo files) contain information that may -change from specimen to specimen. They need to contain at lease 3 columns of -information: sample names, barcode 1, and barcode 2 sequences. Additionally, -other parameters defined in the config file can be defined in the sample -information file if they change from specimen to specimen. - -Run specific config file will need to point to the sample information files. For -convienence, a directory can be found at ``iGUIDE/sampleInfo/`` for depositing -these files. - -SampleInfo files need to have a specific naming format that follows -'{RunName}.sampleinfo.csv'. - -An appropriate format for the sample names is "{specimen}-{rep}" where -'specimen' is an alpha-numeric designator for the specimen and 'rep' is a -numeric identifier for technical or biological replicates, separated by a dash -(``-``). Replicates will be pooled during the final analysis, so if you want -them to be separate in the report, make sure you give each specimen a different -identifier. - -For example, iGSP0002-1 and iGSP0002-2, will be pooled together for -the report and analysis, but iGSP0002-1 and iGSP0003-1 will not. These names -will be used in naming files, so do not include any special characters that will -confuse file managment. Try to stick to common delimiters, such as ``-`` and ``_``. -Using a dot, ``.``, as a delimiter is not currently supported. - -A good practice is to put specimen identifiers at the beginning, replicate -identifiers at the end following a "-", and anything else descriptive in the -middle. For example, iGSP0002-neg-1, can specify the orientation the sample was -processed with. - diff --git a/docs/pages/suppinfo.rst b/docs/pages/suppinfo.rst deleted file mode 100644 index 1c283955..00000000 --- a/docs/pages/suppinfo.rst +++ /dev/null @@ -1,45 +0,0 @@ -.. _sampleinfo: - -.. contents:: - :depth: 2 - - -Supplemental Information Files -======================== - -Supplemental information files (or supp files) contain information that may -change from specimen to specimen. They have only one required column, -"Specimen", but subsequence columns will be used to define conditions. Let's use -the below supp file as an example.:: - - # Supplemental csv file example, padding included for visualization - Specimen, Nuclease, gRNA - iGXA, Cas9, TRAC - iGXB, Cas9, TRAC - iGXC, Cas9, B2M - iGXD, Cas9, B2M - iGXE, Mock, Mock - iGXF, Mock, Mock - -This type of setup would indicate that there are 6 specimens to be analyzed -(iGXA - iGXF). Each of these would correlate with their sampleName'd replicates, -so for iGXA, all samples with the format iGXA-{number} or iGXA-{info}-{number} -would be pooled into the iGXA specimen. - -Additionally, there are three conditions, defined by the distinct data excluding -information in the "Specimen" column. So in this case, the conditions are -"Cas9-TRAC", "Cas9-B2M", and "Mock-Mock". Within the report format, there are -several analyses that are conditionally based rather than specimen based. This -adds to the flexibility and utility of the reporting functions supplied with -iGUIDE. - -If the user would rather ever specimen analyzed independently and reported in -that manner, then they can either run a report without a supp file or in a supp -file include a column that distinguishes each specimen from each other. - -Column names and formating are transferred directly into the report. -Additionally, this files sets the order presented in the report. If "iGXC" -comes before "iGXB" in the supp file, the it will be orderd as so throughout the -report. Conditions, as well, follow this format. As presented above, the report -will order the conditions in the following order "Cas9-TRAC", "Cas9-B2M", and -"Mock-Mock", which is the order of first observation. diff --git a/docs/pages/quickstart.rst b/docs/quickstart.rst similarity index 62% rename from docs/pages/quickstart.rst rename to docs/quickstart.rst index 3705f635..415529b0 100644 --- a/docs/pages/quickstart.rst +++ b/docs/quickstart.rst @@ -1,15 +1,42 @@ .. _quickstart: +Quickstart Guide +================ + .. contents:: :depth: 2 +Install +******* + +To install iGUIDE, simply clone the repository to the desired destination.:: + + git clone https://github.com/cnobles/iGUIDE.git + +Then initiate the install using the install script. If you would like the +installed environment to be named something other than 'iguide', the new conda +environment name can be provided to the ``install.sh`` script as provided +below.:: + + cd path/to/iGUIDE + bash install.sh + +Or:: + + cd path/to/iGUIDE + bash install.sh -e {env_name} + +Additionally, help information on how to use the ``install.sh`` can be accessed +by:: + bash install.sh -h -Initializing a Run ------------------- -Once the config and sampleInfo files have been configured, a run directory can -be created using the command below where {ConfigFile} is the path to your +Setup a Run +*********** + +Once the config and sampleInfo files have been configured, a run directory +can be created using the command below where {ConfigFile} is the path to your configuration file:: cd path/to/iGUIDE @@ -43,25 +70,25 @@ files into the /analysis/{RunName}/input_data directory. Copy the fastq.gz files from the sequencing instrument into this directory if you do not have paths to the files specified in the config file. -Currently, iGUIDE needs each of the sequencing files (R1, R2, I1, and I2) for +iGUIDE typically uses each of the sequencing files (R1, R2, I1, and I2) for processing since it is based on a dual barcoding scheme. If I1 and I2 are concatenated into the read names of R1 and R2, it is recommended the you run ``bcl2fastq ... --create-fastq-for-index-reads`` on the machine output directory to generate the I1 and I2 files. As iGUIDE has its own demultiplexing, it is recommend to not use the Illumina -machine demultiplexing through input of index sequences in the SampleSheet.csv. -See SampleSheet example in XXX. If sequence files are demultiplexed, they can be -concatenated together into one file for each type of read using 'zcat'. +machine demultiplexing through input of index sequences in the SampleSheet.csv. +If your sequence data has already been demultiplexed though, please see the +:ref:`usage` for setup instructions. -List Samples for a Run ----------------------- +List Samples in a Run +********************* As long as the config and sampleInfo files are present and in their respective locations, you can get a quick view of what samples are related to the project. -Using the 'list_samples' subcommand will produce an overview table on the -console or write the table to a file (specified by the output option). +Using the ``iguide list_samples`` command will produce an overview table on +the console or write the table to a file (specified by the output option). Additionally, if a supplemental information file is associated with the run, the data will be combined with the listed table.:: @@ -77,7 +104,7 @@ data will be combined with the listed table.:: Processing a Run ----------------- +**************** Once the input_data directory has the required sequencing files, the run can be processed using the following command:: @@ -87,41 +114,20 @@ processed using the following command:: Snakemake offers a great number of resources for managing the processing through the pipeline. I recommend familiarizing yourself with the utility -(https://snakemake.readthedocs.io/en/stable/). Here are some helpful snakemake -options that can be passed to iGUIDE by appending to the iguide command after -``--``: - -* ``[--configfile X]`` associate a specific configuration for processing, - essential for processing but already passed in by ``iguide``. -* ``[--cores X]`` multicored processing, specified cores to use by X. -* ``[--nolock]`` process multiple runs a the same time, from different sessions. -* ``[--notemp]`` keep all temporary files, otherwise removed. -* ``[--keep-going]`` will keep processing if one or more job error out. -* ``[-w X, --latency-wait X]`` wait X seconds for the output files to appear - before erroring out. -* ``[--restart-times X]`` X is the number of time to restart a job if it fails. - Defaults to 0, but is used in ``iguide`` to increase memory allocation. -* ``[--resources mem_mb=X]`` Defined resources, for ``iguide`` the mem_mb is the - MB units to allow for memory allocation to the whole run. For HPC, this can be - coupled with ``--cluster-config`` to request specific resources for each job. -* ``[--rerun-incomplete, --ri]`` Re-run all jobs that the output is recognized - as incomplete, useful if your run gets terminated before finishing. -* ``[--cluster-config FILE]`` A JSON or YAML file that defines wildcards used - for HPC. - - -An Example Run --------------- +(https://snakemake.readthedocs.io/en/stable/). + + +An Example Workflow +******************* To perform a local test of running the iGUIDE informatic pipeline, run the below code after installing. This block first activates your conda environment, -``iguide`` by default, and then creates a test directory within the analysis +'iguide' by default, and then creates a test directory within the analysis directory. The run information is stored in the run specific configuration file (config file). Using the ``-np`` flag with the snakemake call will perform a dry-run (won't actually process anything) and print the commands to the -terminal, so you can see what snakemake is about to perform. Next, the test data -is moved to the input directory underneath the new test run directory. Then the -entirety of processing can start.:: +terminal, so you can see what snakemake is about to perform. Then the entirety +of processing can start.:: # If conda is not in your path ... @@ -162,7 +168,7 @@ entirety of processing can start.:: -s sampleInfo/simulation.supp.csv \ -t pdf - # When you are all finished and ready to archive / remove excess files, a minimal configuration + # When you are all finished and ready to archive / remove excess files, a minimal structure # can be achieved with the 'clean' subcommand. iguide clean configs/simulation.config.yml @@ -175,37 +181,9 @@ entirety of processing can start.:: conda deactivate -Uninstall ---------- - -To uninstall iGUIDE, the user will need to remove the environment and the -directory. - -To remove the environment and channels used with conda:: - - cd path/to/iGUIDE - bash etc/uninstall.sh - -Or:: - - cd path/to/iGUIDE - bash etc/uninstall.sh {env_name} - -If the user would rather remove the environment created for iGUIDE, it is -recommended directly use conda. This will leave the channels within the conda -config for use with other conda configurations:: - - conda env remove -n iguide - -Or:: - - conda env remove -n {env_name} - -To remove the iGUIDE directory and conda, the following two commands can be -used:: - # Remove iGUIDE directory and software - rm -r path/to/iGUIDE +Reviewing Results +***************** - # Remove conda - rm -r path/to/miniconda3 +The output reports from a run are deposited under +``analysis/{RunName}/reports``. For more informtion on output files, see :ref:`usage`! diff --git a/docs/usage.rst b/docs/usage.rst new file mode 100644 index 00000000..241e03a1 --- /dev/null +++ b/docs/usage.rst @@ -0,0 +1,1261 @@ +.. _usage: + +User Guide +========== + +.. contents:: + :depth: 3 + + +Nomenclature and Semantics +************************** + +Before diving too far into the documentation, it is important to understand some +of the nomenclature and semantics used throughout this documentation. Focus will +be mostly spent on words important to the workflow and that may be ambiguous to +when used without definition. Some of these words may seem like they overlap in +definition, and in certain situation, they do. In these situations, we should +still give proper designation to each catagory to distiguish them from each +other. + + +The three S's +------------- + +We'll focus first on the three S's that relate to what we are working with, +Subject, Specimen, and Sample. + +* Subject: the who or what we are working with, this could be a patient or an + experiment. It is important to remember which of the downstream identifiers + are associated with a specific subject. +* Specimen: is collected from a subject and is the start of the protocol. A + specimen for iGUIDE could be considered a tube of starting gDNA. This will be + the actual material that will be worked with for the protocol. +* Sample: While many people commonly use specimen and sample interchangibly, + here we note that a sample comes from a specimen. We make this distiction + because we realize there are multiple ways to workup a single specimen, each + of these different ways is a different sample. Samples are taken from the + specimen, just like the specimen is taken from the Subject. + +In the following workflow, you'll notice that in certain places, we refer to +'sampleName' (such as in the sampleInfo file), or 'specimen' (such as in the +supplemental data file). These designations are consistent with the above +definitions and it is expected that the user will follow these customs. + +How do we destinguish Subject, Specimen, and Sample? During processing these +identifiers will need to be distiguished from each other using different +nomenclature. Below is an example of a naming scheme for the three +identifiers.:: + + Subject Specimen Sample + {patient} {Spec.ID} {Spec.ID-info-rep.ID} + pND405 iGSP0015 iGSP0015-neg-1 + pND405 iGSP0015 iGSP0015-neg-2 + pND405 iGSP0015 iGSP0015-neg-3 + +Here we have an example workflow. Subject identifiers are not usually part of +the processing, we consider Subject typically during data interpretation through +reviewing the output reports. Subject identifiers can be included in +supplemental files with runs. Specimens can have identifiers (or IDs). For +iGUIDE, it is easiest to use a single alpha-numeric string as an identifier +(without delimiters!). + +Following the above practice, the specimen ID can be included in a sample ID +(or sampleName) along with additional information. As indicated above, iGUIDE +will treat sampleNames as three part strings, the specimen ID is at the +beginning, delimited (or separated by a "-") from additional information. +The last part of the string is a replicate identifier, expected to be numeric. +In practice, we find it best to create 4 samples for processing from a single +specimen. This limits the possibility for PCR jackpotting an allows an analyst +to utilize capture-recapture statistics for population estimation. The remainder +of the string that is not captured in the first or last components is not +directly used by iGUIDE, except as a unique identifier of the specimen. +Therefore it is a great place to indicate sample specific treatments. + +Given the above example, three different samples have been indicated, all from +a single specimen and single subject. During processing, the user will indicate +each of sample in the sampleName column of the sampleInfo file. When iGUIDE +returns the analysis, each specimen will be indicated. So while three samples go +into the pipeline, data will be combined in the output to represent the single +specimen. + +More information can be found about specimen and sampleNames in this user guide. + + +The Experiment +-------------- + +While writing this documentation, I thought it would be helpful to explain in a +general sense what an experiment might look like using with repective +terminology of this software. + +For a respecitive subject (patient, individual experiment, ...) that has been +treated with the marker dsODN during genome editing, specimens are harvested +from various conditions (with nuclease, with different targets controlled by +gRNAs, ...). This harvesting yeilds genomic DNA which is commonly cataloged +into a specimen database holding metadata and experimental parameters for the +different specimens. + +Samples are then taken from these specimens, typically 4 samples (see protocol +from iGUIDE manuscript), and processed through the iGUIDE protocol. Before +sequencing, a sampleInfo sheet would be constructed where each row of a csv file +indicates a different sample that was processed along with the samples barcode +and demultiplexing information. + +During sequencing (or after), a run specific configuration file (config file) +would be constructed by one or two parties. There is run specific information +that needs to be included, such as: target sequence patterns, nuclease profiles, +treatment information, etc. If a variable changes throughout the samples, then +it can be indicated in the sampleInfo file, while if it is constant, it can be +indicated in the config file. + +The latter part of the config is reviewed and checked by the individual who will +computationally process the run. This portion of the config file contains +parameters that modify or tune the software to run on different systems. + +After the computational processing has completed, a stat report and analytical +report are generated in the reports directory. These can be reviewed by +respecitive parties. + +Additionally, if multiple runs contain samples to be analyzed together, +auxiliary commands in iGUIDE allow for the computational analyst to generate new +reports combining multiple sequencing runs together. + +If the user is unsure if the experiment or would work with this type of +analysis, feel free to contact the maintainers of iGUIDE. + + +Subcommands +*********** + +Once installed, iGUIDE utilization is broken down into subcommands as indicated +in the Figure 1 below. A description of these commands are reviewed here to give +the user an understanding of how the software would work from a workflow view +point. + +.. figure:: /iguide_subcmd_fig.* + :figwidth: 75% + :align: center + + Figure 1. iGUIDE Subcommands: setup, run, eval, report, summary, clean. + +Primary subcommands: Used for standard or primary workflow of processing +sequencing runs. + +* ``setup`` : This subcommand initiates a project or run directory. It requires + a config file and will create the new project directory within the + iGUIDE/analysis directory. +* ``run`` : This subcommand will process a run given a config file using a + Snakemake workflow (https://snakemake.readthedocs.io/en/stable/). + Therefore, Snakemake specific commands can be passed into the ``run`` + subcommand. All Snakemake specific commands should come after a ``--`` break + in the options. + +Auxiliary subcommands: Used for auxiliary workflows which further dive into +analyzing the processed data. + +* ``eval`` : Short for evaluation, this subcommand will analyze a run's data and + yeild an RDS file (R-based data file). Supplemental data can additionally be + passed into the evaluation to group specimens together for analysis and + include metadata. This output object has a host of broad analysis that are + based in the input information. +* ``report`` : This will generate a full report on the given config file(s) or + input evaluated RDS file. The report is defaultly produced as an html document + but can be changed to a pdf if the correct latex libraries are installed. + Additionally, all figures and tables can be output as independent files (pdf + and png formats for figures and csv formats for tables). +* ``summary`` : Similar to the report but with some reduced utility, this + subcommand will output a single text file that overviews the data. This is + readable on the terminal and is helpful for getting quick answers to data + questions if working on the command line. + +Additional subcommands: Used for cleanup and helpful suggestions for processing. + +* ``clean`` : After processing, most intermediate data files are removed as they + are designated temparary, but other file still exist within the run directory + that may inflate the size and are no longer needed, such as input data and + log files. The ``clean`` subcommand will remove files no longer required. A + "clean" run directory can still be used with ``eval``, ``report``, and + ``summary``. Additionally, this subcommand can remove the entire run directory + by passing the ``--remove_proj`` flag. +* ``hints`` : Prints out a message with Snakemake option hints to help with + using the ``run`` subcommand. + + +Workflows +********* + +A workflow is simply how data is moved from an unprocessed state (like +sequencing data off an Illumina sequencer) to a processed state (a final +report). Below we will review the primary and auxiliary workflows iGUIDE is +designed to handle. + + +Primary Workflow +---------------- + +In the primary workflow, we consider how to get from input sequence information +to processed reports. To initiate this process, the user needs to gather the +information and complete two files, the configuration file (config file) and the +sample information file (sampleInfo file). These two files will tell iGUIDE how +to process the sequence information, sample specific parameters should be +included in the sampleInfo file while constant parameters can be simply +specified in the config file. Once these two files are completed, they can be +deposited into their repective directories (config file --> iGUIDE/configs and +sampleInfo file --> iGUIDE/sampleInfo). Additionally, if a supplemental file +(supp file) is to be included, it is easiest to deposit this file with the +sampleInfo file, in iGUIDE/sampleInfo. + +.. figure:: /iguide_prime_workflow_fig.* + :figwidth: 100% + :align: center + + Figure 2. Primary workflow for processing input sequencing files to processed + runs with data deliverables like reports and figures. + +With the config, sampleInfo, and potentially supp files in place, the user can +use ``iguide setup {path/to/[run].config.yml}`` to create a new run directory. +In Figure 1, three runs have been developed, named proj1, proj2, and proj3. +Each of these would have a different config and sampleInfo file. With the files +in their respective directories, the user would run +``iguide setup configs/proj1.config.yml`` to create the "proj1" run directory +in the analysis directory, and then repeat the command with the other two config +files to have a total of three empty run directories under the analysis +directory. + +Once the run directories are setup, the input data needs to be located. This can +be done in a number of ways. In the config file, the user can specify the path +to the sequence files (preferably not demultiplexed, see latter sections for +skipping demultiplexing). The user can create symbolic links to the data within +the input_data directory of the run directory, or the user can simply deposit +the sequence files (fastq.gz) into the input_data directory. + +With config file, sampleInfo file, and sequencing files ready, the user can +start processing with ``iguide run configs/{run}.config.yml``. Recall that the +``run`` subcommand is built on a Snakemake workflow, so additional Snakemake +options can be passed after ``--`` when issuing the command. For example, +``iguide run configs/proj1.config.yml -- --cores 6 --nolock -k``, tells +Snakemake to use 6 cores for processing, do not lock the working directory +(helpful for running multiple processing runs at the same time), and keep going +even if one job has an error. + +Allowing the ``iguide run`` command to go to completion will yeild a processed +data run. At this point, if calling the same "run" command on a project, +Snakemake should return a message indicating that there is nothing to do. If +for some reason processing gets terminated, ``iguide run`` and Snakemake will +pickup from where it left off in the processing. + +If the user is content with the processing, then they can run the +``iguide clean`` command to clean up a specific run directory (shown in +Figure 3 below). This leaves the output data (useful in the auxiliary workflow) +and the reports, but will remove input_data and log files. Additionally if the +user wants to remove the run directory completely, they can also use the +``iguide clean`` command with an optional flag. + + +Auxiliary Workflow +------------------ + +After running the primary workflow on several runs, or if the user would like +to change specific parameters (gene lists, target sequences, ...) then the +auxiliary workflow becomes quite useful. + +.. figure:: /iguide_aux_workflow_fig.* + :figwidth: 100% + :align: center + + Figure 3. Auxiliary workflow helps with subsequent analysis of the processed + data. + +There are three subcommands included in this workflow: ``eval``, ``report``, and +``summary``. Each of them work in similar ways, but have different outputs. + +The ``iguide eval`` is a focal point of the auxiliary workflow. This command +will process one or more runs and analyze them in a consistent manner, so the +user is confident they don't have a mixed data set. This subcommand will output +a binary R-based file (\*.rds) which can be read into an R environment with the +function base::readRDS(). This file contains a host of analysis and can be used +with the other two subcommands, ``report`` and ``summary``. + +The ``iguide report`` will output an html or pdf analysis of the evaluated +dataset. This is the standard deliverable from the iGUIDE package. Additionally, +the command can generate the figures and tables along with the report. +``iguide summary`` is very similar, but only generates a text-file based report. +Both will take ``eval`` output files as an input, but they can also be used with +the same input as would be given to ``eval``, config file(s). + +Supplemental files carrying specimen-based metadata can also be included in the +auxiliary commands. Any specimen not indicated in the supp file will be dropped +from the analysis. This means the user can select which samples are included in +the analysis by specifying the associated specimens to include, even if the +specimens are across multiple runs. + +With this knowlege in hand, the remainder of the documentation should have more +context as to how it is applied to processing data with the iGUIDE software. + + +Requirements +************ + +- A relatively-recent Linux computer with more than 2Gb of RAM + +We do not currently support Windows or Mac. (iGUIDE may be able to run on +Windows using the [WSL](https://docs.microsoft.com/en-us/windows/wsl/about), but +it has not been tested). + + +Installing +********** + +To install iGUIDE, simply clone the repository to the desired destination.:: + + git clone https://github.com/cnobles/iGUIDE.git + +Then initiate the install using the install script. If the user would like the +installed environment to be named something other than 'iguide', the new conda +environment name can be provided to the ``install.sh`` script as shown below.:: + + cd path/to/iGUIDE + bash install.sh + +Or specify a different environment name.:: + + cd path/to/iGUIDE + bash install.sh -e {env_name} + +Additionally, help information on how to use the ``install.sh`` can be accessed +with the ``-h`` flag.:: + + bash install.sh -h + + +Testing +------- + +If the user would like to run a test of the software during the installation, +the install script has a ``-t`` option that helps with just that. The below +command will install the software with the environment named 'iguide' and test +the software with the built-in simulated dataset during installation. Be ready +for the testing to take a little bit of time through (up to 30 mins or so).:: + + bash install.sh -e iguide -t + +Otherwise, the testing can be initiated after install using the following +command.:: + + bash etc/tests/test.sh {env} {cores} + +Where ``{env}`` would be the environment the user would like to test, "iguide" +by default, and ``{cores}`` would be the number of cores to run the test on. The +test will complete faster given more cores. + +The test dataset can be regenerated with a script provided in the +iGUIDE/etc/tests/construct_scripts directory, ``simulate_incorp_data.R``. This +script is configured by a partner config.yml file, ``sim_config.yml``. A quick +look through this configuration and the user can change the size of the +simulated data output, rerun the script to generate new data, and develop a new +test for iGUIDE.:: + + cd etc/tests/construct_scripts + Rscript simulate_incorp_data.R sim_config.yml + +There are two scripts included in the tools/rscript directory that work with the +simulated data. The first is designed to check the accuracy compared to the +"truth" dataset that the simulated data was built on. To run that script, follow +the command below.:: + + Rscript tools/rscripts/check_test_accuracy.R configs/simulation.config.yml etc/tests/Data/truth.csv -v + +The second script checks output files by their md5 digest, therefore any changes +to the test (including generating new data, changing the aligner, +changing parameters, ...) could make the test fail.:: + + Rscript tools/rscripts/check_file_digests.R etc/tests/simulation.digests.yml -v + +Both testing scripts will exit with exit code 1 if they fail, which makes them +easy to build into integration testing. + + +Updating +-------- + +Over time, components of iGUIDE will be updated, including environmental builds, +the commandline interface (python library or lib), and the supporting R-package +(iguideSupport or pkg), as well as the standard code base. To update these, pull +the latest release from GitHub with the following command after installation.:: + + git pull origin master + +Once this has updated, the user should update their install by running the +install script with the update option.:: + + bash install.sh -u all + +It is recommended to update everything if the user is unsure of what has been +updated. If the user just wants to update specific parts of the software +through, they can use ``env``, ``pkg``, or ``lib`` after the ``-u`` flag to +specify a component. + +It is recommened that after updating, the user rerun the testing scripts to make +sure the software is working appropriately on the specified system. + + +Uninstalling +------------ + +To uninstall iGUIDE, the user will need to remove the environment and the +directory. + +To remove the environment and channels used with conda:: + + cd path/to/iGUIDE + bash etc/uninstall.sh + +Or:: + + cd path/to/iGUIDE + bash etc/uninstall.sh {env_name} + +If the user would rather remove the environment created for iGUIDE, it is +recommended to use conda. This will leave the channels within the conda +config for use with other conda configurations:: + + conda env remove -n iguide + +Or:: + + conda env remove -n {env_name} + +To remove the iGUIDE directory and conda, the following two commands can be +used:: + + # Remove iGUIDE directory and software + rm -r path/to/iGUIDE + + # Remove conda + rm -r path/to/miniconda3 + + +Config Files +************ + +Configuration files, or configs for short, contain both run-related and +pipeline-related information. This is by design. For reproducibility it is +easiest to have what was processed and how it was processed in the same +location. There should be one config file for each sequencing run to be +processed. Below is a brief summary of how to 'configure' your config file to +your specific run. + +Config files need to be named in the format '{RunName}.config.yml', where +``{RunName}`` is a parameter set within the config file for the run. For +example, the default run configuration file is named ``simulation.config.yml``, +so the run name is ``simulation``. + +Config files can be deposited anywhere in the users directory, but a dediacted +directory has been included in the release of iGUIDE. For convienence, config +files can be placed in ``iGUIDE/configs/``. + +For sample specific information, input is more easily placed in a sampleInfo +file. See the included section regarding sample info files. + + +File Layout +----------- + +Config files are in a ``yaml`` format, but are broken into two parts. The first +contains run specific information that should be filled out by an individual +familiar with the sequence data used in the laboratory bench-side protocol. +Additionally, they should be aware of the biochemistry related to the enzymes +and sequences they are using. + +The second part (below the divide ``----``) should be filled out by an +individual familiar with the bioinformatic processing. Explanations of the +different portions can be found in the following pages. + + +Run Specific Information +------------------------ + +Run configuration +""""""""""""""""" + +``Run_Name`` + This is the name of the sequencing run, and should only contain alpha-numeric + characters. Underscores (``_``) and dashes (``-``) are also allowed within the + run name parameters. Other symbols should not be included, such as a dot + (``.``). The run name is further used by the software to link files and + directories together, so it will need to be consistent whenever it is used. + Examples include: iGUIDE_190201_B6V99, 181213_PD1_T-cell_exp. + +``Sample_Info`` + This is a file path to the sample information file. It can either be an + absolute file path or relative file path. If the file path is relative though, + it will need to be relative to the Snakefile used by the iGUIDE software. For + more information about this file, please see the Sample Information page. + +``Supplemental_Info`` + Similar to ``Sample_Info``, this is a file path to a supplementary file which + can contain information related to experimental parameters or patient + information. This will be used during the report output, which will group + samples with identical parameters. The format for this file is quite loose, + and it only requires a single column ``Specimen``, which should match the + names of specimens in the sample information file. For more information about + this file, please see the Supplemental Information page. If no file is to be + used, set the value for this parameter to ``"."`` and make sure to set the + ``suppFile`` in the run protion of the config to ``FALSE``. + +``Ref_Genome`` + This is a designation for the reference genome to used during processing. The + genome will need to be included in the R libraries through BioConductoR prior + to running the software. The human genome draft ``hg38`` is included by + default. Please see information on the BioConductoR package 'BSgenome' for + installing alternative genomes. + +``Aligner`` + Options include either 'blat' or 'bwa', though at this time, only 'blat' is + supported. Future versions of iGUIDE may support other alignment softwares. + Please contact the maintainers if you have a favorite you would like to see + listed here. + +``UMItags`` + This is a logical parameter indicating whether to use unique molecular indices + (UMI) sequence tags ('TRUE') or to only use unique fragments lengths (see + `SonicAbundance `) to quantify + abundances of unique observations. + + +Sequence files +"""""""""""""" + +``Seq_Path`` + This is the file path to the sequence files. Rather than repeating the path + for each below, just include the path to the directory containing the files. + +``R1 / R2 / I1 / I2`` + These parameters should be the file names of the sequence files to be + analyzed by the iGUIDE software. It is recommened to pass complete sequencing + files to iGUIDE rather than demultiplexing prior to analysis. + + +SampleInfo formating +"""""""""""""""""""" + +``Sample_Name_Column`` + This is the name of the column in the sample information file which contains + identifiable information about samples. An appropriate format for the sample + names is "{specimen}-{rep}" where 'specimen' is an alpha-numeric designator + for the specimen and 'rep' is a numeric identifier for technical or biological + replicates, separated by a dash (``-``). Replicates will be pooled during the + final analysis, so if you want them to be separate in the report, make sure + you give each specimen a different identifier. For example, iGSP0002-1 and + iGSP0002-2, will be pooled together for the report and analysis, but + iGSP0002-1 and iGSP0003-1 will not. These names will be used in naming files, + so do not include any special characters that will confuse file managment. + Try to stick to common delimiters, such as "-" and "_". A good practice is + to put specimen identifiers at the beginning, replicate identifiers at the end + following a "-", and anything else descriptive in the middle. For example, + iGSP0002-neg-1, can specify the priming orientation the sample was processed + with. + + +Sequence information +"""""""""""""""""""" + +``R{1/2}_Leading_Trim`` + Sequence to be removed from the 5' or beginning of the R1 or R2 sequences. + Commonly a linker or fixed sequence that is part of the priming scheme during + amplification. If no sequence should be removed, just include ``"."``. If the + sequence is sample or specimen specific, it can be included in the sample + information file and indicated in these fields as ``"sampleInfo:{column}"``, + where 'column' is the column name with the data in the sample information + file. + +``R{1/2}_Overreading_Trim`` + Similar to the ``Leading_Trim`` parameters, these parameters indicate the + sequence that should be removed from the 3' or end of the reads if it is + present. Again, if no sequence should be removed, use a ``"."`` or if the data + is present in the sample information file, ``"sampleInfo:{column}"``. + +``R2_Leading_Trim_ODN`` + This is a key parameter difference between iGUIDE and the original GUIDEseq + method. This parameter indicates the sequence that is part of the dsODN but is + **not** primed against. This sequence should directly follow the + ``R2_Leading_Trim`` sequence and should be a reverse complement of the + beginning of the ``R1_Overreading_Trim`` sequence if the iGUIDE dsODN is being + used. For GUIDEseq, simply include ``"."``, or if you have multiple sequences, + then specify in the sample information file as ``"sampleInfo:{column}"``. + + +Target sequence information +""""""""""""""""""""""""""" + +``Target_Sequences`` + This parameter specifies the target sequences, **not including** the PAM + sequences for guide RNAs. An acceptable input format would be + ``{target_name} : "{sequence}"`` (i.e. ``B2M.3 : "GAGTAGCGCGAGCACAGCTANGG"``) + and additional target sequences can be included, one per line, and each + indented at the same level. The input format of + ``{target_name} : {target_seq}`` needs to be maintained for proper function. + The 'target_name' in this situation will need to match the 'target_name' used + in the ``On_Target_Sites`` and ``Treatment`` parameters. 'target_name' should + follow a common format, and use standard delimiters, such as "-", "_", and + ".". For example: ``B2M.3``, ``TRAC.1.5``, ``TruCD33v5``. + +``On_Target_Sites`` + This parameter indicates the specific location for editing by the target + enzyme. There should be one line for each on-target site, even if there are + more than one on-target sites for a given target sequence. Typically the input + format should follow ``{target_name} : "{seqname}:{+/-}:{position}"``, where + 'target_name' matches the name of the given target sequence, and if multiple + on-target sites exist, then the names can be expanded using a + ``{target_name}'#`` notation. Additionally, the notation can be expanded to + ``{target_name} : "{seqname}:{+/-/*}:{min.position}-{max.position}"``, where + '*' indicates either orientation and 'min.position' and 'max.position' + represent the numerical range for the on-target site. The value for each + on-target site specifies the location or genomic coordinates of nuclease + activity. The 'seqname' indicates the chromosome or sequence name, an + orientation of '+' or '-' is given to the location depending on the editing + orientation (in line with positional numbering is '+' and opposite is '-', + unknown or both is '*'), and the 'position' or 'min/max.position' indicates + the nucleotide(s) of editing. For Cas9, the position of editing is commonly + between the 3rd and 4th nucleotide from the 3' end of the targeting sequence + (not including the PAM). Being off by a nucleotide or so will not cause any + problems. Example below.:: + + On_Target_Sites : + TRAC.5 : "chr14:+:22547664" + TRBC.4'1 : "chr7:+:142792020" + TRBC.4'2 : "chr7:+:142801367" + PD1.3 : "chr2:-:241858808" + TRAC.3.4 : "chr14:-:22550616-22550625" + B2M.3 : "chr15:*:44711569-44711570" + CIITA.15.1 : "chr16:+:10916399" + + +Specimen target treatment +""""""""""""""""""""""""" + +``Treatment`` + This parameter indicates how samples were treated. If samples were all treated + differently, then this information can be included in the sample information + file as ``all : "sampleInfo:{column}"`` where 'column' is the name of the + column with the information. If a single sample was treated with more than one + target sequence, then delimit multiple target names by a semicolon (``;``), + i.e. ``all : "B2M;TRAC;TRBC"``. Additionally, each specimen can be indicated + individually on a new line. Only specimen names should be given here and + provided individually, not sample identifiers. This means that if your sample + names follow the suggested format, "{specimen}-{replicate}", you would only + specify the "{specimen} : {treatment}" underneath this parameter. + + +Specimen nuclease treatment + +``Nuclease`` + Similar to target treatment above, this parameter dictates which nuclease(s) + where used on the specimens. This refers to the class of nuclease, such as + Cas9 or Cpf1, which behave differently when they edit DNA. Notation can follow + the same as above, if all specimens were treated with the same class of + nuclease, then just specify 'all : "{nuclease_profile}"', or list out by + specimen. Additionally you can specify the column in sampleInfo in the same + format as above. Currently, iGUIDE does not support processing for specimens + with multiple classes of nuclease profiles. Only one profile can be specified + per specimen. + +``Nuclease_Profiles`` + See below section on nuclease profiles. + + +Processing Information +---------------------- + +Below are parameters that are used to process the large amount of data, such as +setting memory suggestions if resources are specified or parameters for sequence +alignments. While these figues may not be relevant to the bench scientist, they +are particulars for computational scientists. + +Resource management is not required, but it can help when using HPC or limiting +jobs. You are encouraged to spend some time optimizing if you would like, these +parameters work out well on the designer's platform. + + +iGUIDE configuration +"""""""""""""""""""" + +``Read_Types`` + This parameter should include which read types will be used in the analysis, + i.e. ``["R1", "R2", "I1", "I2"]``. This follows a list notation is Python. If + only single barcoding or some other method is employed and a read type is not + included, simply leave it out of the example. + +``Genomic_Reads`` + This parameter is similar to the ``Read_Types`` but only indicates which reads + contain genomic information rather than indexing. + +``readNamePattern`` + This is a regex pattern for which to gather read names, it should not make the + read name sequencing orientation specific, R1 and R2 should have the same read + name. The default works well for Illumina based readnames ``[\w\:\-\+]+``. For + R-based scripts to interpret the regex correctly, you will need to use double + escapes, ``[\\w\\:\\-\\+]+``. + + +Memory Management +""""""""""""""""" + +``defaultMB / demultiMB / trimMB / filtMB / consolMB / alignMB / qualCtrlMB / assimilateMB / evaluateMB / reportMB`` + Controls the amount of memory allocated to each of these processes during + snakemake processing. While working on a server or multicored machine, these + parameters will work internally to help schedule jobs. Each value will act as + an upper limit for the amount of MB of RAM to expect the process to take, and + schedule jobs appropriately using the ``--resources mem_mb={limitMB}`` flag + with Snakemake. During HPC use, these parameters can be combined with the + cluster config to schedule specific memory requirements for jobs. + Additionally, if the ``--restart-times {x}`` is used where "x" is the number + of times to restart a job if it fails, then the amount of memory for the job + will increase by a unit of the parameter. For example, if a trimming job fails + because it runs out of memory, then restarting the job will try to allocate 2 + times the memory for the second attempt. All parameters should be in megabytes + (MB). + + +Demultiplexing parameters +""""""""""""""""""""""""" + +``skipDemultiplexing`` + Logical (either TRUE or FALSE) to indicate if demultiplexing should be carried + out. If TRUE, sequence files (\*.fastq.gz) need to be placed or linked in the + input_data directory of an existing project directory (as with + ``iguide setup``), one sequence file for each type (R1, R2, I1, I2). These + need to be identified in the "Run" portion of the config file. If FALSE, then + demultiplexed files need to be included in the input_data directory of an + existing project directory. The files need to be appropriately named, in the + format of ``{sampleName}.{readtype}.fastq.gz``, where ``sampleName`` matches + the 'sampleName' column found in the associated 'sampleInfo' file, and + ``readtype`` is R1, R2, I1, or I2. If ``UMItags`` is ``FALSE``, then only R1 + and R2 file types are required for analysis, if ``UMItags`` is ``TRUE``, then + I2 is a required file type as well. + +``barcode{1/2}Length`` + Integer values indicating the number of nucleotides in the barcodes or + indexing sequences. + +``barcode{1/2}`` + Character values (i.e. ``"I1"``) indicating which reads to find the associated + indexing information for demultiplexing. + +``bc{1/2}Mismatch`` + An integer value indicating the number of tolarated mismatches in the barcode + sequences for either barcode 1 or 2. + + +Sequence trimming +""""""""""""""""" + +``R{1/2}leadMismatch`` + Integer values indicating the number of allowed mismatches in either R1 or R2 + leading sequence trimming. Recommend to set to less than 10% error. + +``R2odnMismatch`` + Integer value indicating the number of allowed mismatches in the unprimed + ODN sequence, typically should be set to 0. + +``R{1/2}overMismatch`` + Integer values indicating the number of allowed mismatches in either R1 or R2 + overreading trimming. This is converted into a percent matching and should be + thought of as a number of mismatches allowed out of the total length of the + overreading trim sequence. + +``R{1/2}overMaxLength`` + Searching for overread trimming in sequences can be time consuming while not + producing different results. For this the total length of searched for + sequences can be limited here. For example, if ``ATGCGTCGATCGTACTGCGTTCGAC`` + is used as the overreading sequence, and 5 mismatches are allowed, then the + tolerance will be 5/25 or 80% matching, but only the first 20 nucleotides of + the sequence will be aligned for overtrimming, ``ATGCGTCGATCGTACTGCGT``. With + an 80% matching requirement, 16 out of 20 nucleotides will need to align for + overread trimming to be initiated. + +Binning +""""""" + +``bins`` + A number of bins to separate filtered sequences into for higher parallel + processing. The increasing the number of bins can help spread out the work + required for processing to keep memory requirements lower. + +``level`` + A number indicating the number of reads that should be targeted for each bin. + Bins will be filled to the level amount, leaving remaining bins empty if + previous bins contain all the reads. Additionally, if all bins will + "overflow", then reads will be evenly distributed across the number of bins. + +Reference Alignment +""""""""""""""""""" + +``BLATparams`` + A character string to be included with the BLAT call. A suggested example has + been provided in the simulation config file. For options, please see the BLAT + help options by typing ``blat`` into the commandline after activating + ``iguide``. + +``BWAparams`` + A character string to be inclued with the BWA call. A suggested example has + been provided in the simulation config file. For options, please see BWA help + by typing ``bwa mem`` into the commandline after activating ``iguide``. + + +Post-alignment filtering +"""""""""""""""""""""""" + +``maxAlignStart`` + Integer value indicating the number of nucleotides at the beginning of the + alignment that will be allowed to not align. Another way of thinking of this + is the maximum start position on the query rather than the target reference. + A default value of 5 means that the alignment needs to start in the first 5 + nucleotides or the alignment is discarded during quality control filtering. + +``minPercentIdentity`` + This is a value between 0 and 100 indicating the minimum global percent + identity allow for an alignment. If an alignment has less, then it is + discarded during quality control filtering. + +``{min/max}TempLength`` + Specify the minimum (min) and maximum (max) template length expected. Joined + alignments between R1 and R2 the are outside of this range are considered + artifacts and are discarded or classified as chimeras. + + +Post-processing +""""""""""""""" + +``refGenes / oncoGeneList / specialGeneList`` + These are special reference files in either text or BioConductoR's + GenomicRanges objects. They can be in an '.rds' format or table format + ('.csv' or '.tsv'). The ``file`` parameter should indicate the file path to + the file (relative paths should be relative to the SnakeFile), and the + ``symbolCol`` parameter should indicate the column in the data object which + contains the reference names to be used in the analysis. + +``maxTargetMismatch`` + The maximum number of mismatches between the reference genome and target + sequence allowed for consideration to be a target matched incorporation + site. This is an integer value and is compared to the target sequence(s). + +``upstreamDist`` + The distance upstream of the incorporation site to look for a target + similar sequence within the criteria specified by ``maxTargetMismatch``. + +``downstreamDist`` + The distance downstream of the incorporation site to look / include for a + target similar sequence within the criteria specified by + ``maxTargetMismatch``. + +``pileUpMin`` + An integer value indicating the number of alignments required to overlap + before being considered a 'pileUp'. + +``recoverMultihits`` + While multihit alignments are often difficult to analyze, some information + can still be gleamed from the data given reasonable assumptions. Adjusting + this parameter to ``TRUE`` will still only focuses on sites that are uniquely + mapped, but if a multihit includes a unique site and other locations, + contributions are given to the unique site location. Further, reads and their + contributions, umitags and fragments, are not double counted but instead + evenly distributed to all included unique sites. **Note**, some sequencing + artifacts may arrise in "off-target" associated sites. Users should be careful + to conclude anything from these alignment artifacts. Leaving this option as + ``FALSE`` is recommended if the user does not have a target sequence that + locates a repetitive sequence. + + +Report +"""""" + +``suppFile`` + Logical (``TRUE`` or ``FALSE``), if the supplemental file provided in + ``Supplemental_Info`` should be used in the default report generated at the + end of processing. If set to ``FALSE``, the ``Supplemental_Info`` parameter + is not required for processing. + +``{tables/figures}`` + Logicals indicating if tables and figures should be generated from the report. + Data will be included under the ``reports`` directory in the project run + directory. For figures, both PDF and PNG formats will be generated if set to + ``TRUE`` at 300 dpi while tables will be generated in a comma-separated values + (csv) format. + +``reportData`` + Logical indicating if a RData object should be saved during the report + generation in the ``reports`` directory. + +``infoGraphic`` + Logical indicating if an info graphic displaying the genomic distribution of + incorporations should be generated at the beginning of the report. While + aesthetically pleasing, the graphic gives the report a unique twist and can + provide the knowledgeable user with information about the report at the very + beginning. + +``signature`` + Character string included at the beginning of reports to denote the author, + analyst, laboratory, etc. Make sure you change if you don't want Chris + getting credit for your work. + + +Nuclease Profiles +----------------- + +An additional component to the first part of the config file, is the Nuclease +Profiles. The user can specify which nuclease they are using and include +and profile to help identify edit sites. Nuclease can range from Cas9 to Cpf1 +or TALEN based nickases. + +**Note:** For TALEN and dual flanking nickases or nucleases, each side will need +to be input as a different target. Specify in ``Target_Sequences`` the sequence +and ``On_Target_Sites`` the actual editing site. Make sure you include two +distinct identifiers for the sequences on-target sites, then specify the +target treatment as ``{target_seq1};{target_seq2}``. + +Any name can be given in the ``Nuclease`` section, but that name needs to match +the profile name as well. So if you want to call it "Cas9v2", then just make +sure you have a profile named "Cas9v2". + +Below is some ascii art that indicates the differences between nucleases. +Additionally, below the art are example profiles for input into the iGUIDE +software.:: + + Editing strategies by designer nucleases + Cas9 : + >< PAM + ATGCATGCATGCATGCATGCA TGG (sense strand) + + TGCATGCATGCATGCATGCA NGG # gRNA + |||||||||||||||||||| ||| + TACGTACGTACGTACGTACGT ACC (anti-sense strand) + >< # Dominant cutpoint + + Cpf1 : Also known as Cas12a (similar nuclease structure for CasX) + >< # Dominant cutpoint + GTTTG ATGCATGCATGCATGCATGCATGCATGC (sense strand) + PAM + TTTV ATGCATGCATGCATGCATGCA # gRNA, nuclease activity leave overhang + |||| ||||||||||||||||||||| + CTAAC TACGTACGTACGTACGTACGTACGTACG (anti-sense strand) + >< # Dominant cutpoint + + TALEN : Protin-DNA binding domain fused with FokI nickase + ATATATATATATATATATAT GCATGCATGCATGCAT GCGCGCGCGCGCGCGCGCGC (sense strand) + \\\\\\\\\\\\\\\\\\\\ + |-------> + <-------| + \\\\\\\\\\\\\\\\\\\\ + TATATATATATATATATATA CGTACGTACGTACGTA CGCGCGCGCGCGCGCGCGCG (anti-sense strand) + # Proteins bind flanking the cleavage site and cut in the "insert" sequence. + + CasCLOVER : Clo051 or another nickases with CRISPR-based binding domains + ATCCT ATGCATGCATGCATGCATGC TTAACCGGTTAACCGG TACGTACGTACGTACGTACG CGGTC + ||| |||||||||||||||||||| (sense strand) + PAM Target Sequence \-------> + <-------\ Target Sequence PAM + (anti-sense strand) |||||||||||||||||||| ||| + TAGGA TACGTACGTACGTACGTACG AATTGGCCAATTGGCC ATGCATGCATGCATGCATGC GCCAG + + +Below are the example profiles.:: + + Nuclease_Profiles : + Cas9 : + PAM : "NGG" + PAM_Loc : "3p" + PAM_Tol : 1 + Cut_Offset : -4 + Insert_size : FALSE + + Cpf1 : + PAM : "TTTV" + PAM_Loc : "5p" + PAM_Tol : 1 + Cut_Offset : 26 #(Anywhere between 23 and 28) + Insert_size : FALSE + + CasX : + PAM : "TTCN" + PAM_Loc : "5p" + PAM_Tol : 1 + Cut_Offset : 22 #(Anywhere between 16 and 29) + Insert_size : FALSE + + TALEN : + PAM : FALSE + PAM_Loc : FALSE + PAM_Tol : 0 + Cut_Offset : Mid_insert + Insert_size : "15:21" + + CasCLOVER : + PAM : "NGG" + PAM_Loc : "3p" + PAM_Tol : 1 + Cut_Offset : Mid_insert + Insert_size : "10:30" + + +Profile parameters +"""""""""""""""""" + +``PAM`` + protospacer adjacent motif - should be specified here and can contain + ambiguous nucleotides. + +``PAM_Loc`` + indicates the location of the PAM with respect to the pattern, either '5p', + '3p' or FALSE. + +``PAM_Tol`` + indicates the tolerance for mismatches in the PAM sequence (ignorned if PAM + is FALSE). + +``Cut_Offset`` + indicates the offset from the 5' nucleotide of the PAM sequence where the + nuclease creates a double strand break, unless PAM is FALSE, then the 5' + position of the target sequence (also accepts "mid_insert" to specify middle + of region between paired alignments). + +``Insert_size`` + is used if target sequences are expected to flank each other for editing, + such as with TALENs, and indicates the expected size of the insert. To input + a range, delimit the min and max by a colon, ie. 15:21. All names of + nucleases used to treat specimens need to have a profile. Additional profiles + should be added under the 'Nuclease_Profiles' parameter. + + +Sample Information Files +************************ + +Sample information files (or sampleInfo files) contain information that may +change from specimen to specimen. They need to contain at lease 3 columns of +information: sample names, barcode 1, and barcode 2 sequences. Additionally, +other parameters defined in the config file can be defined in the sample +information file if they change from specimen to specimen. + +Run specific config file will need to point to the sample information files. For +convienence, a directory can be found at ``iGUIDE/sampleInfo/`` for depositing +these files. + +SampleInfo files need to have a specific naming format that follows +'{RunName}.sampleinfo.csv'. + +An appropriate format for the sample names is "{specimen}-{rep}" where +'specimen' is an alpha-numeric designator for the specimen and 'rep' is a +numeric identifier for technical or biological replicates, separated by a dash +(``-``). Replicates will be pooled during the final analysis, so if you want +them to be separate in the report, make sure you give each specimen a different +identifier. + +For example, iGSP0002-1 and iGSP0002-2, will be pooled together for +the report and analysis, but iGSP0002-1 and iGSP0003-1 will not. These names +will be used in naming files, so do not include any special characters that will +confuse file managment. Try to stick to common delimiters, such as ``-`` and ``_``. +Using a dot, ``.``, as a delimiter is not currently supported. + +A good practice is to put specimen identifiers at the beginning, replicate +identifiers at the end following a "-", and anything else descriptive in the +middle. For example, iGSP0002-neg-1, can specify the orientation the sample was +processed with. + + +Supplemental Information Files +****************************** + +Supplemental information files (or supp files) contain information that may +change from specimen to specimen. They have only one required column, +"Specimen", but subsequence columns will be used to define conditions. Let's use +the below supp file as an example.:: + + # Supplemental csv file example, padding included for visualization + Specimen, Nuclease, gRNA + iGXA, Cas9, TRAC + iGXB, Cas9, TRAC + iGXC, Cas9, B2M + iGXD, Cas9, B2M + iGXE, Mock, Mock + iGXF, Mock, Mock + +This type of setup would indicate that there are 6 specimens to be analyzed +(iGXA - iGXF). Each of these would correlate with their sampleName'd replicates, +so for iGXA, all samples with the format iGXA-{number} or iGXA-{info}-{number} +would be pooled into the iGXA specimen. + +Additionally, there are three conditions, defined by the distinct data excluding +information in the "Specimen" column. So in this case, the conditions are +"Cas9-TRAC", "Cas9-B2M", and "Mock-Mock". Within the report format, there are +several analyses that are conditionally based rather than specimen based. This +adds to the flexibility and utility of the reporting functions supplied with +iGUIDE. + +If the user would rather ever specimen analyzed independently and reported in +that manner, then they can either run a report without a supp file or in a supp +file include a column that distinguishes each specimen from each other. + +Column names and formating are transferred directly into the report. +Additionally, this files sets the order presented in the report. If "iGXC" +comes before "iGXB" in the supp file, the it will be orderd as so throughout the +report. Conditions, as well, follow this format. As presented above, the report +will order the conditions in the following order "Cas9-TRAC", "Cas9-B2M", and +"Mock-Mock", which is the order of first observation. + + +Setup a Run +*********** + +Once the config and sampleInfo files have been configured, a run directory +can be created using the command below where {ConfigFile} is the path to your +configuration file:: + + cd path/to/iGUIDE + iguide setup {ConfigFile} + +The directory should look like this (RunName is specified in the ConfigFile):: + + > tree analysis/{RunName} + analysis/{RunName}/ + ├── config.yml -> {path to ConfigFile} + ├── input_data + ├── logs + ├── output + ├── process_data + └── reports + +Components within the run directory: + +* config.yml - This is a symbolic link to the config file for the run +* input_data - Directory where input fastq.gz files can be deposited +* logs - Directory containing log files from processing steps +* output - Directory containing output data from the analysis +* process_data - Directory containing intermediate processing files +* reports - Directory containing output reports and figures + +As a current convention, all processing is done within the analysis directory. +The above command will create a file directory under the analysis directory for +the run specified in by the config ('/iGUIDE/analysis/{RunName}'). At the end of +this process, iGUIDE will give the user a note to deposit the input sequence +files into the /analysis/{RunName}/input_data directory. Copy the fastq.gz files +from the sequencing instrument into this directory if you do not have paths to +the files specified in the config file. + +iGUIDE typically uses each of the sequencing files (R1, R2, I1, and I2) for +processing since it is based on a dual barcoding scheme. If I1 and I2 are +concatenated into the read names of R1 and R2, it is recommended the you run +``bcl2fastq ... --create-fastq-for-index-reads`` on the machine output +directory to generate the I1 and I2 files. + +As iGUIDE has its own demultiplexing, it is recommend to not use the Illumina +machine demultiplexing through input of index sequences in the SampleSheet.csv. +If your sequence data has already been demultiplexed though, please see the +:ref:`usage` for setup instructions. + + +List Samples in a Run +********************* + +As long as the config and sampleInfo files are present and in their respective +locations, you can get a quick view of what samples are related to the project. +Using the ``iguide list_samples`` command will produce an overview table on +the console or write the table to a file (specified by the output option). +Additionally, if a supplemental information file is associated with the run, the +data will be combined with the listed table.:: + + > iguide list_samples configs/simulation.config.yml + + Specimen Info for : simulation. + + specimen replicates gRNA nuclease + ---------- ------------ --------------- ---------- + iGXA 1 TRAC Cas9v1 + iGXB 1 TRAC;TRBC;B2M Cas9v1 + iGXD 1 NA NA + + +Processing a Run +**************** + +Once the input_data directory has the required sequencing files, the run can be +processed using the following command:: + + cd path/to/iGUIDE/ + iguide run {ConfigFile} + +Snakemake offers a great number of resources for managing the processing through +the pipeline. I recommend familiarizing yourself with the utility +(https://snakemake.readthedocs.io/en/stable/). Here are some helpful snakemake +options that can be passed to iGUIDE by appending to the iguide command after +``--``: + +* ``[--cores X]`` multicored processing, specified cores to use by X. +* ``[--nolock]`` prevents locking of the working directory, allows for multiple + sessions to run at the same time. +* ``[--notemp]`` keep all temporary files which are otherwise removed. +* ``[-k, --keep-going]`` will keep processing if one or more job error out. +* ``[-w X, --latency-wait X]`` wait X seconds for the output files to appear + before erroring out. +* ``[--restart-times X]`` X is the number of time to restart a job if it fails. + Defaults to 0, but is used in ``iguide`` to increase memory allocation. +* ``[--resources mem_mb=X]`` Defined resources, for ``iguide`` the mem_mb is the + MB units to allow for memory allocation to the whole run. For HPC, this can be + coupled with ``--cluster-config`` to request specific resources for each job. +* ``[--rerun-incomplete, --ri]`` Re-run all jobs that the output is recognized + as incomplete, useful if your run gets terminated before finishing. +* ``[--cluster-config FILE]`` A JSON or YAML file that defines wildcards used + for HPC. + + +Outputs and Reports +******************* + +After the ``iguide run`` command has completed, the final run directory will +contain a number of output and report files depending on the config parameters. +Additionally, the if user is content with the analysis, they can use the +``iguide clean`` command to "clean up" the run directory. This will remove input +data files, log files, and any remaining process data files, but will leave +output and report files. This makes the "cleaned" run directories still +compatible with the auxiliary workflow. A clean run directory will look +something like the below tree.:: + + > tree analysis/{RunName} + analysis/{RunName}/ + ├── config.yml -> {path to ConfigFile} + ├── input_data + ├── process_data + ├── logs + ├── output + | ├── incorp_sites.{RunName}.rds + | ├── stats.core.{RunName}.csv + | └── stats.eval.{RunName}.csv + └── reports + ├── report.{RunName}.html + ├── runstats.{RunName}.html + └── summary.{RunName}.txt + +There are several standard output files. The ``incorp_sites.{RunName}.rds`` is +the intermediate object that can be reprocessed into final data object and +reports if the user would like to change most parameters. The ``stats`` files +contain processing related information in a condensed form. These stats can be +viewed in a more interpretable fashion from the ``runstats.{RunName}.html`` +report. + +The ``report.{RunName}.html`` would be the main data analysis report. The +``summary`` is a similar report but in a text based format. These are ample +descriptions within the report template that will be included with the report. +But if the user would like to customize this report, then they can modify the +report template, found +``tools/rscripts/report_templates/iGUIDE_report_template.Rmd``. Custom Rmd +templates can also be provided through the ``iguide report`` command which will +use ``eval`` output objects to "knit" reports in html or pdf output formats. + + +Contacts +******** + +Should you have any questions or comments and would like to contact the +maintainer and designer of the iGUIDE software, +please send a email to Chris [dot] L [dot] Nobles [at] Gmail [dot] com, with +iGUIDE in the subject. diff --git a/etc/build.b0.9.8.txt b/etc/build.b1.0.0.txt similarity index 100% rename from etc/build.b0.9.8.txt rename to etc/build.b1.0.0.txt diff --git a/etc/depreciated_builds/build.b0.9.8.txt b/etc/depreciated_builds/build.b0.9.8.txt new file mode 100644 index 00000000..9b77dcf5 --- /dev/null +++ b/etc/depreciated_builds/build.b0.9.8.txt @@ -0,0 +1,318 @@ +# This file may be used to create an environment using: +# $ conda create --name --file +# platform: linux-64 +@EXPLICIT +https://repo.anaconda.com/pkgs/main/linux-64/_libgcc_mutex-0.1-main.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/_r-mutex-1.0.1-anacondar_1.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/binutils_impl_linux-64-2.31.1-h6176602_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2019.6.16-hecc5488_0.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/libgfortran-ng-7.3.0-hdf63c60_0.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/libstdcxx-ng-9.1.0-hdf63c60_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pandoc-2.7.3-0.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/binutils_linux-64-2.31.1-h6176602_7.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/libgcc-ng-9.1.0-hdf63c60_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.8-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/expat-2.2.5-he1b5a44_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/fribidi-1.0.5-h516909a_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gcc_impl_linux-64-7.3.0-habb00fd_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gmp-6.1.2-hf484d3e_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/graphite2-1.3.13-hf484d3e_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/icu-64.2-he1b5a44_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/jpeg-9c-h14c3975_1001.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/libdeflate-1.0-h14c3975_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libffi-3.2.1-he1b5a44_1006.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libiconv-1.15-h516909a_1005.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libopenblas-0.3.6-h6e990d7_6.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libtool-2.4.6-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libunistring-0.9.10-h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libuuid-2.32.1-h14c3975_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/lz4-c-1.8.3-he1b5a44_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/make-4.2.1-h14c3975_2004.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/ncurses-6.1-hf484d3e_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/openssl-1.1.1c-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pcre-8.41-hf484d3e_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/perl-5.26.2-h516909a_1006.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pixman-0.38.0-h516909a_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pthread-stubs-0.4-h14c3975_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-kbproto-1.0.7-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libice-1.0.10-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxau-1.0.9-h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxdmcp-1.1.3-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-renderproto-0.11.1-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-xextproto-7.3.0-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-xproto-7.0.31-h14c3975_1007.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xz-5.2.4-h14c3975_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/yaml-0.1.7-h14c3975_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/zlib-1.2.11-h516909a_1005.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bwa-0.7.17-hed695b0_6.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gcc_linux-64-7.3.0-h553295d_7.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gettext-0.19.8.1-hc5be6a0_1002.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/gfortran_impl_linux-64-7.3.0-hdf63c60_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gxx_impl_linux-64-7.3.0-hdf63c60_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libblas-3.8.0-11_openblas.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libedit-3.1.20170329-hf8c457e_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libpng-1.6.37-hed695b0_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libprotobuf-3.9.1-h8b12597_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libssh2-1.8.2-h22169c7_2.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libxcb-1.13-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libxml2-2.9.9-hee79883_2.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/readline-8.0-hf8c457e_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/tk-8.6.9-hed695b0_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libsm-1.2.3-h84519dc_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/zstd-1.4.0-h3b9ef0a_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/blat-36-0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/bwidget-1.9.11-0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/freetype-2.10.0-he983fc9_0.tar.bz2 +https://repo.anaconda.com/pkgs/main/linux-64/gfortran_linux-64-7.3.0-h553295d_7.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/glib-2.58.3-h6f030ca_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gxx_linux-64-7.3.0-h553295d_7.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/krb5-1.16.3-h05b26f9_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libcblas-3.8.0-11_openblas.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libidn2-2.1.1-h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/liblapack-3.8.0-11_openblas.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libtiff-4.0.10-h57b8799_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/sqlite-3.29.0-hcee41ef_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/tktable-2.10-h555a92e_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libx11-1.6.8-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/fontconfig-2.13.1-h86ecdb6_1001.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/gsl-2.5-h294904e_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/libcurl-7.65.3-hda55be3_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/python-3.6.7-h357f687_1005.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/wget-1.20.1-h90d6eec_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxext-1.3.4-h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxrender-0.9.10-h516909a_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxt-1.1.5-h516909a_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/appdirs-1.4.3-py_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/asn1crypto-0.24.0-py36_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/async-timeout-3.0.1-py_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/attrs-19.1.0-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/cachetools-2.1.0-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/cairo-1.16.0-hfb77d84_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/certifi-2019.6.16-py36_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/chardet-3.0.4-py36_1003.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/configargparse-0.13.0-py_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/curl-7.65.3-hf8cf82a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/datrie-0.8-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/decorator-4.4.0-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/docutils-0.15.2-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/filechunkio-1.6-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/ftputil-3.2-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/idna-2.8-py36_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/jmespath-0.9.4-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/markupsafe-1.1.1-py36h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/multidict-4.5.2-py36h14c3975_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/numpy-1.17.0-py36h95a1406_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/prettytable-0.7.2-py_3.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/psutil-5.6.3-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/pyasn1-0.4.6-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pycparser-2.19-py36_1.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/pytz-2019.2-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pyyaml-5.1.2-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/ratelimiter-1.2.0-py36_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/requests-2.13.0-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/ruamel-1.0-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/semantic_version-2.6.0-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/six-1.12.0-py36_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/smmap2-2.0.5-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/typing_extensions-3.7.4-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/urllib3-1.12-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/wrapt-1.11.2-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/xmlrunner-1.7.7-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/xorg-libxpm-3.5.12-h14c3975_1002.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/cffi-1.12.3-py36h8022711_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/gitdb2-2.0.5-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/google-resumable-media-0.3.2-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/harfbuzz-2.4.0-h9f30f68_2.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/idna_ssl-1.1.0-py36_1000.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/pyasn1-modules-0.0.5-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pyrsistent-0.15.4-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/python-dateutil-2.8.0-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/python-irodsclient-0.7.0-py_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/rsa-3.1.4-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/samtools-1.9-h8571acd_11.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/setuptools-41.0.1-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/yarl-1.3.0-py36h14c3975_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/aiohttp-3.5.4-py36h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/bcrypt-3.1.6-py36h516909a_1.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/botocore-1.10.84-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/cryptography-2.7-py36h72c5cf5_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/dropbox-7.3.1-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/gitpython-2.1.13-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/google-auth-1.2.1-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/jinja2-2.10.1-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/jsonschema-3.0.2-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/networkx-2.3-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pandas-0.25.0-py36hb3f55d8_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pango-1.42.4-he7ab937_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/protobuf-3.9.1-py36he1b5a44_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/pygments-2.4.2-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pynacl-1.3.0-py36h14c3975_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/ruamel.yaml-0.16.0-py36h516909a_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/wheel-0.33.4-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/aioeasywebdav-2.4.0-py36_1000.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/googleapis-common-protos-1.6.0-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/graphviz-2.40.1-h5933667_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/paramiko-2.6.0-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/pip-19.2.1-py36_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-base-3.5.1-hfb2a302_1009.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/s3transfer-0.1.13-py36_1001.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/snakemake-minimal-5.5.4-py_1.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-biocgenerics-0.28.0-r351_1.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-genomeinfodbdata-1.2.1-r351_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-zlibbioc-1.28.0-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/boto3-1.7.84-py_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/google-api-core-1.14.1-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/pygraphviz-1.3.1-py36_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/pysftp-0.2.9-py36_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-assertthat-0.2.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-backports-1.1.2-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-base64enc-0.1_3-r351h96ca727_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-bh-1.66.0_1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-bindr-0.1.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-bitops-1.0_6-r351h96ca727_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-boot-1.3_20-r351hf348343_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-brew-1.0_6-r351h6115d3f_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-clipr-0.4.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-cluster-2.0.7_1-r351hac1494b_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-codetools-0.2_15-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-colorspace-1.3_2-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-commonmark-1.5-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-crayon-1.3.4-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-curl-3.2-r351hadc6856_1.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/r-data.table-1.11.6-r351hc070d10_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-dbi-1.0.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-dichromat-2.0_0-r351h6115d3f_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-digest-0.6.15-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-fansi-0.2.3-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-findpython-1.0.3-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-foreign-0.8_71-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-formatr-1.5-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-futile.options-1.0.1-r35h6115d3f_1001.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-getopt-1.20.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-git2r-0.23.0-r351h96ca727_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-glue-1.3.0-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-gtable-0.2.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-highr-0.7-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-hwriter-1.3.2-r35h6115d3f_1002.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-iterators-1.0.10-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-jsonlite-1.5-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-kernsmooth-2.23_15-r351hac1494b_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-labeling-0.3-r351h6115d3f_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-lattice-0.20_35-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-lazyeval-0.2.1-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-magrittr-1.5-r351h6115d3f_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-mass-7.3_50-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-matrixstats-0.54.0-r35hcdcec82_1001.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-mime-0.5-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-nnet-7.3_12-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-openssl-1.0.2-r351h96ca727_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-pkgconfig-2.0.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-plogr-0.2.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-polyclip-1.10_0-r35h0357c0b_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-praise-1.0.0-r351h6115d3f_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-proto-1.0.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-r6-2.2.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rcolorbrewer-1.1_2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rcpp-0.12.18-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rematch-1.0.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rlang-0.2.1-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rpart-4.1_13-r351hd10c6a6_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rstudioapi-0.7-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-snow-0.4_3-r35h6115d3f_1001.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-spatial-7.3_11-r351hd10c6a6_4.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-statmod-1.4.32-r35h6e990d7_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-stringi-1.4.3-r35h0e574ca_3.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-utf8-1.1.4-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-viridislite-0.3.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-whisker-0.3_2-r351hf348343_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-withr-2.1.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-xfun-0.3-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-xml-3.98_1.12-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-yaml-2.2.0-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-biobase-2.42.0-r351h14c3975_1.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-s4vectors-0.20.1-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/google-cloud-core-1.0.3-py_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-argparse-1.1.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-bindrcpp-0.2.2-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-class-7.3_14-r351hd10c6a6_4.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-cli-1.0.0-r351h6115d3f_1.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-farver-1.1.0-r35h0357c0b_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-foreach-1.4.4-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-hms-0.4.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-htmltools-0.3.6-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-httr-1.3.1-r351h6115d3f_1.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-lambda.r-1.2.3-r35h6115d3f_1001.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-latticeextra-0.6_28-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-markdown-0.8-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-matrix-1.2_14-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-memoise-1.1.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-munsell-0.5.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-nlme-3.1_137-r351ha65eedd_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-pander-0.6.3-r35h0357c0b_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-plyr-1.8.4-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rcurl-1.95_4.11-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rprojroot-1.3_2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-stringr-1.3.1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-tinytex-0.6-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-xml2-1.2.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-iranges-2.16.0-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/google-cloud-storage-1.17.0-py_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-desc-1.2.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-devtools-1.13.6-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-evaluate-0.11-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-futile.logger-1.4.3-r35h6115d3f_1002.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-igraph-1.2.2-r351h80f5a37_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-lubridate-1.7.4-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-mgcv-1.8_24-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-pillar-1.3.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rcppeigen-0.3.3.4.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-reshape2-1.4.3-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-scales-0.5.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-selectr-0.4_1-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-survival-2.42_6-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-testthat-2.0.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-tweenr-1.0.1-r35h0357c0b_1001.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-biocparallel-1.16.6-r351h1c2f66e_0.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-genomeinfodb-1.18.1-r351_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-xvector-0.22.0-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-knitr-1.20-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-processx-3.1.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-recommended-3.5.1-r351_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rvest-0.3.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-tibble-1.4.2-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/snakemake-5.5.4-1.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-biostrings-2.50.2-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-delayedarray-0.8.0-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-genomicranges-1.34.0-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-3.5.1-r351_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-callr-2.0.4-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-cellranger-1.1.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-forcats-0.3.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-ggplot2-3.0.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-purrr-0.2.5-r351h96ca727_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-readr-1.1.1-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-rmarkdown-1.10-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-webshot-0.5.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-rsamtools-1.34.0-r351hf484d3e_0.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-summarizedexperiment-1.12.0-r351_0.tar.bz2 +https://conda.anaconda.org/conda-forge/linux-64/r-ggforce-0.2.2-r35h0357c0b_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-haven-1.1.2-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/conda-forge/noarch/r-kableextra-1.1.0-r35h6115d3f_1.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-pkgbuild-1.0.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-readxl-1.1.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-reprex-0.2.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-tidyselect-0.2.4-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-genomicalignments-1.18.1-r351h14c3975_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-dplyr-0.7.6-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-pkgload-1.0.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-rtracklayer-1.42.1-r351h9d9f1b6_1.tar.bz2 +https://conda.anaconda.org/bioconda/linux-64/bioconductor-shortread-1.40.0-r351hf484d3e_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-dbplyr-1.2.2-r351hf348343_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-roxygen2-6.1.0-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-tidyr-0.8.1-r351h29659fb_0.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-bsgenome-1.50.0-r351_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-broom-0.5.0-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-bsgenome.hsapiens.ucsc.hg38-1.4.1-r351_5.tar.bz2 +https://conda.anaconda.org/bioconda/noarch/bioconductor-hiannotator-1.16.0-r351_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-modelr-0.1.2-r351h6115d3f_0.tar.bz2 +https://conda.anaconda.org/r/linux-64/r-tidyverse-1.2.1-r351h6115d3f_0.tar.bz2 diff --git a/etc/tests/simulation.digests.yml b/etc/tests/simulation.digests.yml index 7e106573..c6628d2a 100644 --- a/etc/tests/simulation.digests.yml +++ b/etc/tests/simulation.digests.yml @@ -8,9 +8,9 @@ file1 : file2 : name : "stats.core.simulation.csv" path : "analysis/simulation/output/stats.core.simulation.csv" - md5 : "45c21291e86fdb00632d4e2915ff9e64" + md5 : "4ad30f6a2f42ee69e404cdcad7a6d5d1" file3 : name : "stats.eval.simulation.csv" path : "analysis/simulation/output/stats.eval.simulation.csv" - md5 : "57158a4826685a8024b3281284ac42b6" + md5 : "26cd4b00fa40212cd01027d1c11cd76f" diff --git a/install.sh b/install.sh index b35881cd..e652d5c9 100755 --- a/install.sh +++ b/install.sh @@ -68,6 +68,7 @@ __iguide_env="${arg_e:-iguide}" __run_iguide_tests=false __reqs_install=false __update_lib=false +__update_pkg=false __update_env=false __req_r_version="3.4.1" __old_path=$PATH @@ -163,7 +164,7 @@ function __test_iguideSupport () { fi } -function __test_iguidelib() { +function __test_iguidelib () { if [[ $(__test_env) = true ]]; then activate_iguide command -v iguide &> /dev/null && echo true || echo false @@ -173,7 +174,7 @@ function __test_iguidelib() { fi } -function __test_iguide() { +function __test_iguide () { if [[ $(__test_env) = true ]]; then $(bash ${__iguide_dir}/etc/tests/test.sh ${__iguide_env} &> /dev/null) && \ echo true || echo false @@ -221,7 +222,7 @@ function install_environment () { local install_options="--quiet --file etc/requirements.yml" debug_capture conda env update --name=$__iguide_env ${install_options} 2>&1 else - local install_options="--quiet --yes --file etc/build.b0.9.8.txt" + local install_options="--quiet --yes --file etc/build.b1.0.0.txt" debug_capture conda create --name=$__iguide_env ${install_options} 2>&1 fi @@ -333,7 +334,7 @@ else if [[ $__reqs_install = "true" ]]; then __build_source="etc/requirements.yml" else - __build_source="etc/build.b0.9.8.txt" + __build_source="etc/build.b1.0.0.txt" fi info "Creating iGUIDE environment..." diff --git a/rules/align.blat.rules b/rules/align.blat.rules index 84cb893c..e4a903cc 100644 --- a/rules/align.blat.rules +++ b/rules/align.blat.rules @@ -3,14 +3,14 @@ rule align: input: - seq = RUN_DIR + "/process_data/consol/{sample}.{read}.consol.fasta", + seq = RUN_DIR + "/process_data/consol/{sample}.{read}.{bin}.consol.fasta", genome = ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".2bit") output: - temp(RUN_DIR + "/process_data/align/{sample}.{read}.psl") + temp(RUN_DIR + "/process_data/align/{sample}.{read}.{bin}.psl") params: config["BLATparams"] log: - RUN_DIR + "/logs/{sample}.{read}.blat.log" + RUN_DIR + "/logs/{sample}.{read}.{bin}.blat.log" resources: mem_mb = lambda wildcards, attempt: attempt * config["alignMB"] shell: @@ -26,8 +26,8 @@ rule align: """ rule compress_align: - input: RUN_DIR + "/process_data/align/{sample}.{read}.psl" - output: temp(RUN_DIR + "/process_data/align/{sample}.{read}.psl.gz") + input: RUN_DIR + "/process_data/align/{sample}.{read}.{bin}.psl" + output: temp(RUN_DIR + "/process_data/align/{sample}.{read}.{bin}.psl.gz") resources: mem_mb = lambda wildcards, attempt: attempt * config["defaultMB"] shell: "gzip {input}" diff --git a/rules/align.bwa.rules b/rules/align.bwa.rules index 6fd1e188..ea1247bf 100644 --- a/rules/align.bwa.rules +++ b/rules/align.bwa.rules @@ -28,8 +28,8 @@ rule index_ref: rule align: input: - R1 = RUN_DIR + "/process_data/filtered/{sample}.R1.filt.fastq.gz", - R2 = RUN_DIR + "/process_data/filtered/{sample}.R2.filt.fastq.gz", + R1 = RUN_DIR + "/process_data/filtered/{sample}.R1.{bin}.filt.fastq.gz", + R2 = RUN_DIR + "/process_data/filtered/{sample}.R2.{bin}.filt.fastq.gz", genome=ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".fasta"), aux1=ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".amb"), aux2=ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".ann"), @@ -37,12 +37,12 @@ rule align: aux4=ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".pac"), aux5=ancient(ROOT_DIR + "/genomes/" + config["Ref_Genome"] + ".sa") output: - temp(RUN_DIR + "/process_data/align/{sample}.unsorted.bam") + temp(RUN_DIR + "/process_data/align/{sample}.{bin}.unsorted.bam") params: bwa=config["BWAparams"], index=ROOT_DIR + "/genomes/" + config["Ref_Genome"] log: - RUN_DIR + "/logs/{sample}.bwa.log" + RUN_DIR + "/logs/{sample}.{bin}.bwa.log" resources: mem_mb = lambda wildcards, attempt: attempt * config["alignMB"] shell: diff --git a/rules/arch.rules b/rules/arch.rules index 9007fe33..8625aed0 100644 --- a/rules/arch.rules +++ b/rules/arch.rules @@ -6,21 +6,29 @@ rule core_stat_matrix: input: demulti=RUN_DIR + "/process_data/stats/" + RUN + ".demulti.stat", trimR1=expand( - RUN_DIR + "/process_data/stats/{sample}.R1.trim.stat", sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.R1.{bin}.trim.stat", + sample=SAMPLES, bin=BINS), trimPrimer=expand( - RUN_DIR + "/process_data/stats/{sample}.R2.primer.trim.stat", - sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.R2.{bin}.primer.trim.stat", + sample=SAMPLES, bin=BINS), trimODN=expand( - RUN_DIR + "/process_data/stats/{sample}.R2.trim.stat", sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.R2.{bin}.trim.stat", + sample=SAMPLES, bin=BINS), umitags=expand( - RUN_DIR + "/process_data/stats/{sample}.umitags.stat", sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.{bin}.umitags.stat", + sample=SAMPLES, bin=BINS), filt=expand( - RUN_DIR + "/process_data/stats/{sample}.filt.stat", sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.{bin}.filt.stat", + sample=SAMPLES, bin=BINS), consol=expand( - RUN_DIR + "/process_data/stats/{sample}.{read}.consol.stat", - sample=SAMPLES, read=READS), + RUN_DIR + "/process_data/stats/{sample}.{read}.{bin}.consol.stat", + sample=SAMPLES, read=READS, bin=BINS), align=expand( - RUN_DIR + "/process_data/stats/{sample}.align.stat", sample=SAMPLES), + RUN_DIR + "/process_data/stats/{sample}.{bin}.align.stat", + sample=SAMPLES, bin=BINS), + multihits=expand( + RUN_DIR + "/process_data/stats/{sample}.multihits.stat", + sample=SAMPLES), assim=RUN_DIR + "/process_data/stats/" + RUN + ".assim.stat" output: RUN_DIR + "/output/stats.core." + RUN + ".csv" @@ -29,7 +37,10 @@ rule core_stat_matrix: data_dir=RUN_DIR + "/process_data/stats/" resources: mem_mb = lambda wildcards, attempt: attempt * config["defaultMB"] - shell: "Rscript {params.tool} -d {params.data_dir} -o {output}" + shell: + """ + Rscript {params.tool} -d {params.data_dir} -o {output} + """ rule eval_stat_matrix: input: RUN_DIR + "/output/" + RUN + ".eval.stat" @@ -43,7 +54,8 @@ rule eval_stat_matrix: rule gen_stat_report: input: core = RUN_DIR + "/output/stats.core." + RUN + ".csv", - eval = RUN_DIR + "/output/stats.eval." + RUN + ".csv" + eval = RUN_DIR + "/output/stats.eval." + RUN + ".csv", + site = RUN_DIR + "/output/incorp_sites." + RUN + ".rds" output: RUN_DIR + "/reports/runstats." + RUN + ".html" params: tool = ROOT_DIR + "/tools/rscripts/generate_stat_report.R", @@ -53,5 +65,6 @@ rule gen_stat_report: mem_mb=lambda wildcards, attempt: attempt * config["reportMB"] shell: """ - Rscript {params.tool} {input} -o {output} -c {params.config} > {log} 2>&1 + Rscript {params.tool} -r {input.core} -e {input.eval} -i {input.site} \ + -o {output} -c {params.config} > {log} 2>&1 """ diff --git a/rules/binning.rules b/rules/binning.rules new file mode 100644 index 00000000..88d88319 --- /dev/null +++ b/rules/binning.rules @@ -0,0 +1,31 @@ +# -*- mode: Snakemake -*- +# Sequence Binning Rules + +rule bin_passing_reads: + input: + expand( + RUN_DIR + "/process_data/demulti/{{sample}}.{req_type}.fastq.gz", + req_type=REQ_TYPES + ) + output: + temp(expand( + RUN_DIR + "/process_data/binned/{{sample}}.{req_type}.{bin}.fastq.gz", + req_type=REQ_TYPES, bin=BINS + )) + params: + tool=ROOT_DIR + "/tools/rscripts/bin_seqs.R", + outdir=RUN_DIR + "/process_data/binned", + bins=config["bins"], + level=config["level"], + readNamePatternArg=config["readNamePattern"] + log: + RUN_DIR + "/logs/{sample}.bin.log" + resources: + mem_mb=lambda wildcards, attempt: attempt * config["defaultMB"] + shell: + """ + Rscript {params.tool} {input} -o {params.outdir} \ + -b {params.bins} -l {params.level} --compress \ + --readNamePattern {params.readNamePatternArg} > {log} 2>&1 + """ + diff --git a/rules/consol.rules b/rules/consol.rules index 548cf2ce..4339b645 100644 --- a/rules/consol.rules +++ b/rules/consol.rules @@ -3,15 +3,15 @@ rule consolidate: input: - RUN_DIR + "/process_data/filtered/{sample}.{read}.filt.fastq.gz" + RUN_DIR + "/process_data/filtered/{sample}.{read}.{bin}.filt.fastq.gz" output: - consol=temp(RUN_DIR + "/process_data/consol/{sample}.{read}.consol.fasta"), - key=temp(RUN_DIR + "/process_data/consol/{sample}.{read}.key.csv"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.{read}.consol.stat") + consol=temp(RUN_DIR + "/process_data/consol/{sample}.{read}.{bin}.consol.fasta"), + key=temp(RUN_DIR + "/process_data/consol/{sample}.{read}.{bin}.key.csv"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{read}.{bin}.consol.stat") params: tool=ROOT_DIR + "/tools/rscripts/consol.R" log: - RUN_DIR + "/logs/{sample}.{read}.consol.log" + RUN_DIR + "/logs/{sample}.{read}.{bin}.consol.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["consolMB"] shell: diff --git a/rules/consol_stub.rules b/rules/consol_stub.rules index e5a90f38..2c876f3c 100644 --- a/rules/consol_stub.rules +++ b/rules/consol_stub.rules @@ -3,7 +3,7 @@ rule collect_consol_stub: output: - stat=temp(RUN_DIR + "/process_data/stats/{sample}.{read}.consol.stat") + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{read}.{bin}.consol.stat") resources: mem_mb=lambda wildcards, attempt: attempt * config["defaultMB"] shell: "touch {output.stat}" diff --git a/rules/filt.rules b/rules/filt.rules index b2159cad..7bc9bb48 100644 --- a/rules/filt.rules +++ b/rules/filt.rules @@ -3,17 +3,17 @@ rule seq_filter: input: - R1=RUN_DIR + "/process_data/trimmed/{sample}.R1.trim.fastq.gz", - R2=RUN_DIR + "/process_data/trimmed/{sample}.R2.trim.fastq.gz" + R1=RUN_DIR + "/process_data/trimmed/{sample}.R1.{bin}.trim.fastq.gz", + R2=RUN_DIR + "/process_data/trimmed/{sample}.R2.{bin}.trim.fastq.gz" output: - R1=temp(RUN_DIR + "/process_data/filtered/{sample}.R1.filt.fastq.gz"), - R2=temp(RUN_DIR + "/process_data/filtered/{sample}.R2.filt.fastq.gz"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.filt.stat") + R1=temp(RUN_DIR + "/process_data/filtered/{sample}.R1.{bin}.filt.fastq.gz"), + R2=temp(RUN_DIR + "/process_data/filtered/{sample}.R2.{bin}.filt.fastq.gz"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{bin}.filt.stat") params: tool=ROOT_DIR + "/tools/rscripts/filt.R", readNamePatternArg=config["readNamePattern"] log: - RUN_DIR + "/logs/{sample}.filt.log" + RUN_DIR + "/logs/{sample}.{bin}.filt.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["filtMB"] shell: diff --git a/rules/process.rules b/rules/process.rules index 4e6553df..d6ab6c1d 100644 --- a/rules/process.rules +++ b/rules/process.rules @@ -4,7 +4,8 @@ rule all_uniq_sites: input: expand( - RUN_DIR + "/process_data/post_align/{sample}.uniq.csv", sample=SAMPLES) + RUN_DIR + "/process_data/post_align/{sample}.{bin}.uniq.csv", + sample=SAMPLES, bin=BINS) output: temp(RUN_DIR + "/output/unique_sites." + RUN + ".csv") params: @@ -19,19 +20,41 @@ rule all_uniq_sites: done """ +rule combine_multihits: + input: + expand( + RUN_DIR + "/process_data/post_align/{{sample}}.{bin}.multihits.rds", + bin=BINS) + output: + hits=temp(RUN_DIR + "/process_data/multihits/{sample}.multihits.rds"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.multihits.stat") + params: + tool=ROOT_DIR + "/tools/rscripts/combine_multihits.R", + dir=RUN_DIR + "/process_data/post_align", + pattern="{sample}.bin[0-9]+.multihits.rds" + log: + RUN_DIR + "/logs/{sample}.multihits.log" + resources: + mem_mb=lambda wildcards, attempt: attempt * config["qualCtrlMB"] + shell: + """ + Rscript {params.tool} -d {params.dir} -p {params.pattern} \ + -o {output.hits} -s {output.stat} > {log} 2>&1 + """ + def all_umitag_inputs(wildcards): if (config["UMItags"]): return expand( - RUN_DIR + "/process_data/indices/{sample}.umitags.fasta.gz", - sample=SAMPLES) + RUN_DIR + "/process_data/indices/{sample}.{bin}.umitags.fasta.gz", + sample=SAMPLES, bin=BINS) else: return [] def all_multi_inputs(wildcards): if (config["recoverMultihits"]): return expand( - RUN_DIR + "/process_data/post_align/{sample}.multihits.rds", + RUN_DIR + "/process_data/multihits/{sample}.multihits.rds", sample=SAMPLES) else: return [] @@ -45,8 +68,10 @@ rule assimilate_sites: incorp=RUN_DIR + "/output/incorp_sites." + RUN + ".rds", stat=temp(RUN_DIR + "/process_data/stats/" + RUN + ".assim.stat") params: - config=RUN_DIR + "/" + "config.yml", - tool=ROOT_DIR + "/tools/rscripts/assimilate_incorp_data.R" + config=RUN_DIR + "/config.yml", + tool=ROOT_DIR + "/tools/rscripts/assimilate_incorp_data.R", + umitagDir=RUN_DIR + "/process_data/indices", + multiDir=RUN_DIR + "/process_data/multihits" log: RUN_DIR + "/logs/" + RUN + ".assim.log" resources: @@ -54,9 +79,9 @@ rule assimilate_sites: run: call_str="Rscript {params.tool} {input.sites} -o {output.incorp} " if (config["UMItags"]): - call_str=call_str + "-u {input.umitag} " + call_str=call_str + "-u {params.umitagDir} " if (config["recoverMultihits"]): - call_str=call_str + "-m {input.multi} " + call_str=call_str + "-m {params.multiDir} " call_str=call_str + "-c {params.config} --stat {output.stat} > {log} 2>&1" shell(call_str) diff --git a/rules/quality.blat.rules b/rules/quality.blat.rules index 9ac60788..638a32c3 100644 --- a/rules/quality.blat.rules +++ b/rules/quality.blat.rules @@ -3,15 +3,15 @@ rule post_align: input: - sampleR1=RUN_DIR + "/process_data/align/{sample}.R1.psl.gz", - sampleR2=RUN_DIR + "/process_data/align/{sample}.R2.psl.gz", - keyR1=RUN_DIR + "/process_data/consol/{sample}.R1.key.csv", - keyR2=RUN_DIR + "/process_data/consol/{sample}.R2.key.csv" + sampleR1=RUN_DIR + "/process_data/align/{sample}.R1.{bin}.psl.gz", + sampleR2=RUN_DIR + "/process_data/align/{sample}.R2.{bin}.psl.gz", + keyR1=RUN_DIR + "/process_data/consol/{sample}.R1.{bin}.key.csv", + keyR2=RUN_DIR + "/process_data/consol/{sample}.R2.{bin}.key.csv" output: - uniq=temp(RUN_DIR + "/process_data/post_align/{sample}.uniq.csv"), - chimera=temp(RUN_DIR + "/process_data/post_align/{sample}.chimera.rds"), - multihit=temp(RUN_DIR + "/process_data/post_align/{sample}.multihits.rds"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.align.stat") + uniq=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.uniq.csv"), + chimera=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.chimera.rds"), + multihit=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.multihits.rds"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{bin}.align.stat") params: tool=ROOT_DIR + "/tools/rscripts/couple.R", ref=config["Ref_Genome"], @@ -21,7 +21,7 @@ rule post_align: maxLen=config["maxTempLength"], readNamePatternArg=config["readNamePattern"] log: - RUN_DIR + "/logs/{sample}.couple.log" + RUN_DIR + "/logs/{sample}.{bin}.couple.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["qualCtrlMB"] shell: diff --git a/rules/quality.sam.rules b/rules/quality.sam.rules index f1db3aad..bc63d33a 100644 --- a/rules/quality.sam.rules +++ b/rules/quality.sam.rules @@ -2,28 +2,28 @@ # Quality Processing: SAM-output aligners rule sort_bam: - input: RUN_DIR + "/process_data/align/{sample}.unsorted.bam" - output: temp(RUN_DIR + "/process_data/align/{sample}.bam") + input: RUN_DIR + "/process_data/align/{sample}.{bin}.unsorted.bam" + output: temp(RUN_DIR + "/process_data/align/{sample}.{bin}.bam") resources: mem_mb = lambda wildcards, attempt: attempt * config["defaultMB"] shell: "samtools sort {input} -o {output}" rule index_bam: - input: RUN_DIR + "/process_data/align/{sample}.bam" - output: temp(RUN_DIR + "/process_data/align/{sample}.bai") + input: RUN_DIR + "/process_data/align/{sample}.{bin}.bam" + output: temp(RUN_DIR + "/process_data/align/{sample}.{bin}.bai") resources: mem_mb = lambda wildcards, attempt: attempt * config["defaultMB"] shell: "samtools index -b {input} {output}" rule post_align: input: - bam=RUN_DIR + "/process_data/align/{sample}.bam", - bai=RUN_DIR + "/process_data/align/{sample}.bai" + bam=RUN_DIR + "/process_data/align/{sample}.{bin}.bam", + bai=RUN_DIR + "/process_data/align/{sample}.{bin}.bai" output: - uniq=temp(RUN_DIR + "/process_data/post_align/{sample}.uniq.csv"), - chimera=temp(RUN_DIR + "/process_data/post_align/{sample}.chimera.rds"), - multihit=temp(RUN_DIR + "/process_data/post_align/{sample}.multihits.rds"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.align.stat") + uniq=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.uniq.csv"), + chimera=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.chimera.rds"), + multihit=temp(RUN_DIR + "/process_data/post_align/{sample}.{bin}.multihits.rds"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{bin}.align.stat") params: tool=ROOT_DIR + "/tools/rscripts/samqc.R", ref=config["Ref_Genome"], @@ -33,7 +33,7 @@ rule post_align: maxLen=config["maxTempLength"], readNamePatternArg=config["readNamePattern"] log: - RUN_DIR + "/logs/{sample}.samqc.log" + RUN_DIR + "/logs/{sample}.{bin}.samqc.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["qualCtrlMB"] shell: diff --git a/rules/trim.rules b/rules/trim.rules index 0b35453d..9b6bf8ca 100644 --- a/rules/trim.rules +++ b/rules/trim.rules @@ -3,10 +3,10 @@ rule seq_trim_R1: input: - RUN_DIR + "/process_data/demulti/{sample}.R1.fastq.gz" + RUN_DIR + "/process_data/binned/{sample}.R1.{bin}.fastq.gz" output: - trim=temp(RUN_DIR + "/process_data/trimmed/{sample}.R1.trim.fastq.gz"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.R1.trim.stat") + trim=temp(RUN_DIR + "/process_data/trimmed/{sample}.R1.{bin}.trim.fastq.gz"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.R1.{bin}.trim.stat") params: tool=ROOT_DIR + "/tools/rscripts/trim.R", lead=lambda wildcards: R1_LEAD[wildcards.sample], @@ -15,7 +15,7 @@ rule seq_trim_R1: overMis=config["R1overMismatch"], overLen=config["R1overMaxLength"] log: - RUN_DIR + "/logs/{sample}.R1.trim.log" + RUN_DIR + "/logs/{sample}.R1.{bin}.trim.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["trimMB"] shell: @@ -29,10 +29,10 @@ rule seq_trim_R1: rule seq_trim_R2_primer: input: - RUN_DIR + "/process_data/demulti/{sample}.R2.fastq.gz" + RUN_DIR + "/process_data/binned/{sample}.R2.{bin}.fastq.gz" output: - trim=temp(RUN_DIR + "/process_data/trimmed/primer/{sample}.R2.primer.trim.fastq.gz"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.R2.primer.trim.stat") + trim=temp(RUN_DIR + "/process_data/trimmed/primer/{sample}.R2.{bin}.primer.trim.fastq.gz"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.R2.{bin}.primer.trim.stat") params: tool=ROOT_DIR + "/tools/rscripts/trim.R", lead=lambda wildcards: R2_LEAD[wildcards.sample], @@ -41,7 +41,7 @@ rule seq_trim_R2_primer: overMis=config["R2overMismatch"], overLen=config["R2overMaxLength"] log: - RUN_DIR + "/logs/{sample}.R2.primer.trim.log" + RUN_DIR + "/logs/{sample}.R2.{bin}.primer.trim.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["trimMB"] shell: @@ -55,16 +55,16 @@ rule seq_trim_R2_primer: rule seq_trim_R2_odn: input: - RUN_DIR + "/process_data/trimmed/primer/{sample}.R2.primer.trim.fastq.gz" + RUN_DIR + "/process_data/trimmed/primer/{sample}.R2.{bin}.primer.trim.fastq.gz" output: - trim=temp(RUN_DIR + "/process_data/trimmed/{sample}.R2.trim.fastq.gz"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.R2.trim.stat") + trim=temp(RUN_DIR + "/process_data/trimmed/{sample}.R2.{bin}.trim.fastq.gz"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.R2.{bin}.trim.stat") params: tool=ROOT_DIR + "/tools/rscripts/trim.R", lead=lambda wildcards: R2_LEAD_ODN[wildcards.sample], leadMis=config["R2odnMismatch"] log: - RUN_DIR + "/logs/{sample}.R2.odn.trim.log" + RUN_DIR + "/logs/{sample}.R2.{bin}.odn.trim.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["trimMB"] shell: diff --git a/rules/umitag.rules b/rules/umitag.rules index 87b23372..e6e241ea 100644 --- a/rules/umitag.rules +++ b/rules/umitag.rules @@ -3,17 +3,17 @@ rule collect_umitags: input: - RUN_DIR + "/process_data/demulti/{sample}.I2.fastq.gz" + RUN_DIR + "/process_data/binned/{sample}.I2.{bin}.fastq.gz" output: - seq=temp(RUN_DIR + "/process_data/indices/{sample}.I2.trim.fastq.gz"), - umi=temp(RUN_DIR + "/process_data/indices/{sample}.umitags.fasta.gz"), - stat=temp(RUN_DIR + "/process_data/stats/{sample}.umitags.stat") + seq=temp(RUN_DIR + "/process_data/indices/{sample}.I2.{bin}.trim.fastq.gz"), + umi=temp(RUN_DIR + "/process_data/indices/{sample}.{bin}.umitags.fasta.gz"), + stat=temp(RUN_DIR + "/process_data/stats/{sample}.{bin}.umitags.stat") params: tool=ROOT_DIR + "/tools/rscripts/trim.R", seq=lambda wildcards: UMIseqs[wildcards.sample], mis=config["bc2Mismatch"] log: - RUN_DIR + "/logs/{sample}.umitag.log" + RUN_DIR + "/logs/{sample}.{bin}.umitag.log" resources: mem_mb=lambda wildcards, attempt: attempt * config["trimMB"] shell: diff --git a/tools/dev/parse_input_seqs.R b/tools/dev/parse_input_seqs.R deleted file mode 100644 index 3792a40c..00000000 --- a/tools/dev/parse_input_seqs.R +++ /dev/null @@ -1,31 +0,0 @@ -args <- commandArgs(trailingOnly = TRUE) -seq_file <- ShortRead::readFastq(args[1]) - -seq_file <- split( - seq_file, - ceiling(seq_along(seq_file)/(length(seq_file)/as.numeric(args[3]))) -) - -input_path <- unlist(stringr::str_split(args[1], "/")) -output_path <- args[2] - -sample_name <- unlist( - stringr::str_split(tail(input_path, n = 1), stringr::fixed(".")) -) - -sample_name <- paste(sample_name[1:2], collapse = ".") - -null <- lapply( - seq_along(seq_file), - function(i){ - - sam_name <- paste0( - sample_name, ".chunk", stringr::str_pad(i, 2, pad = "0"), ".fastq.gz") - - ShortRead::writeFastq( - object = seq_file[[i]], - file = file.path(output_path, sam_name), - compress = TRUE) - - } -) diff --git a/tools/dev/snakefile.high.par b/tools/dev/snakefile.high.par deleted file mode 100644 index fc48eee6..00000000 --- a/tools/dev/snakefile.high.par +++ /dev/null @@ -1,147 +0,0 @@ -# iGUIDE : Improved Genome-wide Unbiased Identification of Double-strand DNA brEaks -# Additional Snakefile for processing samples with higher levels of parallelization -# -# Author : Christopher Nobles, Ph.D. -# useful modules sys, os, csv - -import os -import sys -import re -import yaml -import configparser -from tools.pytools.defs import * - -# Working paths -RUN = config["Run_Name"] -ROOT_DIR = config["Install_Directory"] -RUN_DIR = config["Install_Directory"] + "/analysis/" + RUN - -# Import sampleInfo -if ".csv" in config["Sample_Info"]: - delim = "," -elif ".tsv" in config["Sample_Info"]: - delim = "\t" -else: - raise SystemExit("Sample Info file needs to contain extention '.csv' or '.tsv'.") - -sampleInfo = import_sample_info( - config["Sample_Info"], config["Sample_Name_Column"], delim) - -# Configuration -ROOT_DIR = "/home/cnobles/iGUIDE" -SAMPLE_PATH = ROOT_DIR + "/analysis/iGUIDE_Set0/processData/GSSP0004-neg" -SPECIMEN = "GSSP0004-neg" -REPS = [1, 2, 3, 4] -NUM_CHUNKS = 50 -CHUNKS = ["chunk%02d" % i for i in range(1,NUM_CHUNKS+1)] - -SAMPLES=sampleInfo[config["Sample_Name_Column"]] -TYPES=config["Read_Types"] -READS=config["Genomic_Reads"] -R1_LEAD=choose_sequence_data(config["R1_Leading_Trim"], sampleInfo) -R1_OVER=choose_sequence_data(config["R1_Overreading_Trim"], sampleInfo) -R2_LEAD=choose_sequence_data(config["R2_Leading_Trim"], sampleInfo) -R2_LEAD_ODN=choose_sequence_data(config["R2_Leading_Trim_ODN"], sampleInfo) -R2_OVER=choose_sequence_data(config["R2_Overreading_Trim"], sampleInfo) - -READ_TYPES = ["R1", "R2"] -REF_GENOME = "hg38" -REF_GENOME_2BIT = ROOT_DIR + "/genomes/" + REF_GENOME + ".2bit" - -wildcard_constraints: - sample="GSSP\d{4}-\w{3}-\d", - read="R\d", - chunk="chunk\d{2}" - -# Target Rule -rule all: - input: - sites=expand(RUN_DIR + "/output/uniqSites/GSSP0004-neg-{rep}.uniq.csv", rep = ["1", "2", "3", "4"]), - umis=expand(RUN_DIR + "/processData/GSSP0004-neg-{rep}.umitags.fasta.gz", rep = ["1", "2", "3", "4"]) - -# Workflow Rules -include: "rules/demultiplex/demulti.rules" -include: "rules/sequence_trim/trim.rules" -if (config["UMItags"]): - include: "rules/sequence_trim/umitag.rules" - UMIseqs = sampleInfo["barcode2"] -include: "rules/filter/filt.rules" - -rule make_chunk_dir: - input: "configs/" + RUN + ".config.yml" - output: SAMPLE_PATH - shell: "mkdir {output}" - -rule chunk_seqs: - input: RUN_DIR + "/processData/{sample}.{read}.filt.fastq.gz" - output: temp(expand(SAMPLE_PATH + "/{{sample}}.{{read}}.{chunk}.fastq.gz", chunk = CHUNKS)) - params: - chunks=NUM_CHUNKS, - outpath=SAMPLE_PATH - shell: - "Rscript tools/rtools/parse_input_seqs.R {input} {params.outpath} {params.chunks}" - -rule consol: - input: SAMPLE_PATH + "/{sample}.{read}.{chunk}.fastq.gz" - output: - consol=temp(SAMPLE_PATH + "/{sample}.{read}.{chunk}.consol.fasta"), - key=temp(SAMPLE_PATH + "/{sample}.{read}.{chunk}.key.csv") - shell: - """ - TOOL="{ROOT_DIR}/tools/seqConsolidateR/seqConsolidate.R" - Rscript ${{TOOL}} {input} -o {output.consol} -k {output.key} - """ - -rule align: - input: - seq=SAMPLE_PATH + "/{sample}.{read}.{chunk}.consol.fasta", - genome=ancient(REF_GENOME_2BIT) - output: - SAMPLE_PATH + "/{sample}.{read}.{chunk}.psl" - params: - "-tileSize=11 -stepSize=9 -minIdentity=85 -maxIntron=5 -minScore=27 -dots=1000 -out=psl -noHead" - shell: - "blat {input.genome} {input.seq} {output} {params}" - -rule compress_align: - input: SAMPLE_PATH + "/{sample}.{read}.{chunk}.psl" - output: SAMPLE_PATH + "/{sample}.{read}.{chunk}.psl.gz" - shell: "gzip {input}" - -rule generate_2bit: - input: REF_GENOME - output: REF_GENOME_2BIT - shell: - "Rscript {ROOT_DIR}/tools/rtools/generate_2bit_genome.R {input} {output}" - -rule post_align: - input: - sampleR1=SAMPLE_PATH + "/{sample}.R1.{chunk}.psl.gz", - sampleR2=SAMPLE_PATH + "/{sample}.R2.{chunk}.psl.gz", - keyR1=SAMPLE_PATH + "/{sample}.R1.{chunk}.key.csv", - keyR2=SAMPLE_PATH + "/{sample}.R2.{chunk}.key.csv" - output: temp(SAMPLE_PATH + "/{sample}.{chunk}.uniq.csv") - params: - ref=REF_GENOME, - start=5, - pct=95, - minLen=30, - maxLen=2500 - shell: - """ - TOOL="{ROOT_DIR}/tools/blatCoupleR/blatCouple.R" - Rscript ${{TOOL}} {input.sampleR2} {input.sampleR1} \ - -k {input.keyR2} {input.keyR1} \ - -o {output} \ - -g {params.ref} --maxAlignStart {params.start} --minPercentIdentity {params.pct} \ - --minTempLength {params.minLen} --maxTempLength {params.maxLen} - """ - -rule combine: - input: expand(SAMPLE_PATH + "/{{sample}}.{chunk}.uniq.csv", chunk = CHUNKS) - output: SAMPLE_PATH + "/{sample}.uniq.csv" - shell: - """ - ls {input} | xargs head -n 1 -q | uniq > {output} - cat {input} | sed '/seqnames/d' >> {output} - """ diff --git a/tools/iguidelib/iguidelib/scripts/clean.py b/tools/iguidelib/iguidelib/scripts/clean.py index 57dda8b6..76e7f718 100644 --- a/tools/iguidelib/iguidelib/scripts/clean.py +++ b/tools/iguidelib/iguidelib/scripts/clean.py @@ -37,9 +37,9 @@ def main( argv = sys.argv ): ) parser.add_argument( - "-q", "--quiet", + "-k", "--keep_input", action="store_true", - help = "Will not print messages." + help = "Will not remove files from the input_data directory." ) parser.add_argument( @@ -48,6 +48,12 @@ def main( argv = sys.argv ): help = "Removes the entire project analysis directory. This will delete everything." ) + parser.add_argument( + "-q", "--quiet", + action="store_true", + help = "Will not print messages." + ) + parser.add_argument( "-i", "--iguide_dir", default = os.getenv("IGUIDE_DIR"), @@ -80,7 +86,10 @@ def main( argv = sys.argv ): sys.exit(1) if not args.remove_proj: - directories_to_clean = ["input_data", "logs", "process_data"] + directories_to_clean = ["logs", "process_data"] + + if not args.keep_input: + directories_to_clean.append("input_data") files_to_clean = [] for directory in directories_to_clean: diff --git a/tools/iguidelib/iguidelib/scripts/command.py b/tools/iguidelib/iguidelib/scripts/command.py index b9e162b3..abec93af 100644 --- a/tools/iguidelib/iguidelib/scripts/command.py +++ b/tools/iguidelib/iguidelib/scripts/command.py @@ -11,6 +11,7 @@ from iguidelib.scripts.report import main as Report from iguidelib.scripts.summary import main as Summary from iguidelib.scripts.clean import main as Clean +from iguidelib.scripts.hints import main as Hints def main(): @@ -21,12 +22,14 @@ def main(): " setup \tCreate a new config file for a project using local data.\n" " run \tExecute the iGUIDE pipeline.\n\n" " auxiliary:\n" + " list_samples \tOutput a list of samples from a project.\n" " eval \tEvaluate a set or sets of assimilated iGUIDE outputs.\n" " report \tGenerate a custom report from iGUIDE output files.\n" - " summary \tGenerate a consise summary from iGUIDE output files.\n" - " list_samples \tOutput a list of samples from a project.\n" + " summary \tGenerate a consise summary from iGUIDE output files.\n\n" + " additional:\n" " config \t[inDev] Modify or update iGUIDE config files.\n" - " clean \tCleanup project directory to reduce size. Keeps terminal files.\n\n" + " clean \tCleanup project directory to reduce size. Keeps terminal files.\n" + " hints \tPrint helpful snakemake options to use during processing.\n\n" ).format(version=__version__) parser = argparse.ArgumentParser( @@ -51,7 +54,7 @@ def main(): sub_cmds = [ "setup", "run", "eval", "report", - "summary", "config", "list_samples", "clean" + "summary", "config", "list_samples", "clean", "hints" ] if not args.command in sub_cmds: @@ -85,5 +88,7 @@ def main(): ListSamples(remaining) elif args.command == "clean": Clean(remaining) + elif args.command == "hints": + Hints() else: parser.print_help() diff --git a/tools/iguidelib/iguidelib/scripts/hints.py b/tools/iguidelib/iguidelib/scripts/hints.py new file mode 100644 index 00000000..66f1b629 --- /dev/null +++ b/tools/iguidelib/iguidelib/scripts/hints.py @@ -0,0 +1,21 @@ +def main(): + + print( + "SnakeMake Command Hints:\n\n" + " --cores [X] \tMulticored processing, specified cores to use by X.\n" + " --nolock \tDon't lock iGUIDE directory, for running multiple processes 'iguide run'.\n" + " --notemp \tKeep all temporary files which are defaultly removed during processing.\n" + " --keep-going \tKeep processing even if one job has an error.\n" + " -k \tShort option for --keep-going.\n" + " --latency-wait [X] \tWait X seconds after a job completes to for output verification, can help with slow servers.\n" + " -w [X] \tShort option for '--latency-wait'.\n" + " --restart-times X \tX is the number of times to restart a job if it fails. Increases 'attempt' each time.\n" + " --resources mem_mb=X\tControls the resource limit for 'mem_mb' to help manage pipeline processing.\n" + " --rerun-incomplete \tRe-run all jobs that were not complete before termination.\n" + " --ri \tShort option for '--rerun-incomplete'.\n" + " --cluster-config X \tA JSON or YAML file that defines wildcards used for HPC.\n\n" + "Remember to pass these options after the '--' flag!\n" + "Usage:\n" + " iguide run -- --nolock --cores 12 --keep-going\n\n" + "For more help, see the docs at http://iguide.readthedocs.io." + ) diff --git a/tools/iguidelib/iguidelib/scripts/report.py b/tools/iguidelib/iguidelib/scripts/report.py index f751d8b3..3f8b2617 100644 --- a/tools/iguidelib/iguidelib/scripts/report.py +++ b/tools/iguidelib/iguidelib/scripts/report.py @@ -111,6 +111,11 @@ def main( argv = sys.argv ): help = "Includes an opening graphic on the report." ) + parser.add_argument( + "--override", action="store_true", + help = "Override software and build version control checks on input data." + ) + parser.add_argument( "--template", nargs = 1, default = "tools/rscripts/report_templates/iGUIDE_report_template.Rmd", @@ -197,6 +202,9 @@ def main( argv = sys.argv ): if args.support is not None: eval_comps = eval_comps + ["-s"] + args.support + + if args.override is True: + eval_comps = eval_comps + ["--override"] eval_comps = eval_comps + ["--iguide_dir", str(iguide_directory)] diff --git a/tools/rscripts/assimilate_incorp_data.R b/tools/rscripts/assimilate_incorp_data.R index 6968684f..6ce0a978 100644 --- a/tools/rscripts/assimilate_incorp_data.R +++ b/tools/rscripts/assimilate_incorp_data.R @@ -39,19 +39,18 @@ parser$add_argument( ) parser$add_argument( - "-u", "--umitags", nargs = "+", type = "character", + "-u", "--umitags", nargs = 1, type = "character", help = paste( - "Path(s) to associated fasta files containing read specific", - "random captured sequences. Multiple file paths can be separated by", - "a space." + "Path to directory with associated fasta files containing read specific", + "random captured sequences (*.umitags.fasta.gz)." ) ) parser$add_argument( - "-m", "--multihits", nargs = "+", type = "character", + "-m", "--multihits", nargs = 1, type = "character", help = paste( - "Path(s) to associated multihit files (.rds) as produced by coupling", - "alignment output files. Multiple file paths can be separated by a space." + "Path to directory with associated multihit files (*.multihits.rds) as", + "produced by coupling alignment output files." ) ) @@ -122,6 +121,15 @@ print( ) +# Get versioning ---- +soft_version <- as.character(read.delim( + file = file.path(root_dir, ".version"), header = FALSE)) + +build_version <- list.files(file.path(root_dir, "etc")) %>% + grep(pattern = "build.b[0-9\\.]+.*", x = ., value = TRUE) %>% + stringr::str_extract(pattern = "b[0-9]+\\.[0-9]+\\.[0-9]+") + + # Inputs and parameters ---- # Run parameters and sample parameters config <- yaml::yaml.load_file(args$config) @@ -227,7 +235,13 @@ if( all(!is.null(args$multihits)) ){ seqinfo = GenomeInfoDb::seqinfo(ref_genome) ) - multi_reads <- unlist(GRangesList(lapply(args$multihits, function(x){ + multihit_files <- list.files(path = args$multihit, full.names = TRUE) + + mulithit_files <- multihit_files[ + stringr::str_detect(mulithit_files, ".multihits.rds") + ] + + multi_reads <- unlist(GRangesList(lapply(mulithit_files, function(x){ multi <- readRDS(x) GenomeInfoDb::seqinfo(multi$unclustered_multihits) <- @@ -327,7 +341,13 @@ rm(temp_table) if( all(!is.null(args$umitags)) ){ - umitags <- lapply(args$umitags, ShortRead::readFasta) + umitag_files <- list.files(path = args$umitags, full.names = TRUE) + + umitag_files <- umitag_files[ + stringr::str_detect(umitag_files, ".umitags.fasta") + ] + + umitags <- lapply(umitag_files, ShortRead::readFasta) umitags <- serialAppendS4(umitags) reads$umitag <- as.character(ShortRead::sread(umitags))[ @@ -363,9 +383,17 @@ if( args$stat != FALSE ){ # Output data ---- ## rds file that can be read into evaluation or reports or loaded into a ## database with some additional scripting. -reads %>% - dplyr::select(-lociPairKey, -readPairKey) %>% - saveRDS(file = args$output) +fmt_reads <- reads %>% + dplyr::select(-lociPairKey, -readPairKey) + +output_file <- list( + "soft_version" = soft_version, + "build_version" = build_version, + "config" = config, + "reads" = fmt_reads +) + +saveRDS(output_file, file = args$output) if( all(sapply(output_files, file.exists)) ){ message("Successfully completed script.") diff --git a/tools/rscripts/bin_seqs.R b/tools/rscripts/bin_seqs.R new file mode 100644 index 00000000..1f7ec0af --- /dev/null +++ b/tools/rscripts/bin_seqs.R @@ -0,0 +1,229 @@ +#' For anyone reviewing the code below, the following is a small style guide +#' outlining the various formats for the code. +#' +#' Names with "_": objects, inlucding data.frames, GRanges, vectors, ... +#' Names in caMel format: functions or components of objects (i.e. columns +#' within a data.frame). +#' Names with ".": arguments / options for functions + +options(stringsAsFactors = FALSE, scipen = 99, width = 120) + +# Set up and gather command line arguments ---- +parser <- argparse::ArgumentParser( + description = "Separate sequence files into bins of appropriate size.", + usage = paste( + "Rscript bin_seqs.R -o [optional args] [-h/--help]" + ) +) + +parser$add_argument( + "seqs", nargs = "+", type = "character", + help = paste( + "Path(s) to sequence files to separate into bins. Only read names in first", + "file will be used for indexing and splitting. Make sure all files have", + "the same content! Read order will be set by first file. Fasta or Fastq", + "formats allowed, as well as gzipped compression." + ) +) + +parser$add_argument( + "-o", "--output", nargs = 1, type = "character", default = ".", + help = "Directory for output files to be written. Default: '.'" +) + +parser$add_argument( + "-b", "--bins", nargs = 1, type = "integer", default = 2L, + help = "The number of bins to separate files into, default is 2." +) + +parser$add_argument( + "-l", "--level", nargs = 1, type = "integer", default = 0L, + help = paste( + "Fill level for each bin. If specified, then script will fill files to the", + "specified level with reads before filling the next file, sequentially.", + "If the total number of reads would fill all bins to their level, then", + "reads will be evenly distributed across all bins, which is the default", + "behavior. Default value: 0." + ) +) + +parser$add_argument( + "--compress", action = "store_true", + help = paste( + "Output sequence file(s) in gzipped format. Otherwise this relies on the", + "input format." + ) +) + +parser$add_argument( + "--readNamePattern", nargs = 1, type = "character", + default = "[\\w\\:\\-\\+]+", + help = paste( + "Regular expression for pattern matching read names. Should not contain", + "R1/R2/I1/I2 specific components. Default is [\\w\\:\\-\\+]+" + ) +) + + +args <- parser$parse_args(commandArgs(trailingOnly = TRUE)) + +# Create output directory if not currently available ---- +if( !dir.exists(args$output) ){ + + dir.create(args$output) + if(!dir.exists(args$output)) stop("Cannot create output folder.\n") + args$output <- normalizePath(args$output) + +} + +# Load sequence files +seq_files <- lapply(args$seqs, function(x){ + + if( stringr::str_detect(x, ".fastq") | stringr::str_detect(x, ".fq") ){ + return(ShortRead::readFastq(x)) + }else{ + return(ShortRead::readFasta(x)) + } + +}) + + +# Score indices from first sequence for binning input sequences +if( length(seq_files[[1]]) <= args$bins * args$level ){ + + seq_idx <- split( + seq_along(seq_files[[1]]), + ceiling(seq_along(seq_files[[1]]) / args$level) + ) + + if( length(seq_idx) < args$bins ){ + seq_idx <- c( + seq_idx, + split(integer(), seq(length(seq_idx)+1, args$bins, 1)) + ) + } + +}else{ + + seq_idx <- split( + seq_along(seq_files[[1]]), + ceiling( + seq_along(seq_files[[1]])/(length(seq_files[[1]])/as.numeric(args$bins)) + ) + ) + +} + +seq_names <- stringr::str_extract( + as.character(ShortRead::id(seq_files[[1]])), + args$readNamePattern +) + +seq_name_list <- lapply(seq_idx, function(i) seq_names[i]) + +# Split and write sequences to output directory +output_files <- strsplit(args$seqs, "/") + +output_files <- unlist(mapply( + function(i, j) output_files[[i]][j], + i = seq_along(output_files), + j = lengths(output_files), + SIMPLIFY = FALSE +)) + +if( any(stringr::str_detect(output_files, ".gz$")) | args$compress ){ + args$compress <- TRUE +}else{ + args$compress <- FALSE +} + +expanded_output_file_names <- lapply(output_files, function(x){ + + x <- stringr::str_remove(x, ".gz$") + + ext <- unlist(strsplit(x, "\\.")) + lead <- paste(ext[-length(ext)], collapse = ".") + ext <- ext[length(ext)] + + bins <- stringr::str_pad(seq_len(args$bins), nchar(args$bins), pad = 0) + exp_names <- paste0(lead, ".bin", bins, ".", ext) + + if( args$compress ){ + exp_names <- paste0(exp_names, ".gz") + } + + exp_names + +}) + +# Write output files +null <- mapply( + function(seqs, outputs, idx_names){ + + null <- mapply( + function(idx, outfile){ + + matched_idx <- match(idx, stringr::str_extract( + as.character(ShortRead::id(seqs)), args$readNamePattern + )) + + if( any(table(matched_idx)) > 1 ){ + stop("\n ReadNamePattern is ambiguous, please refine.") + } + + if( file.exists(file.path(args$output, outfile)) ){ + unlink(file.path(args$output, outfile)) + } + + if( stringr::str_detect(outfile, ".fastq") | + stringr::str_detect(outfile, ".fq") ){ + + ShortRead::writeFastq( + object = seqs[matched_idx], + file = file.path(args$output, outfile), + compress = args$compress + ) + + }else{ + + ShortRead::writeFasta( + object = seqs[matched_idx], + file = file.path(args$output, outfile), + compress = args$compress + ) + + } + + }, + idx = idx_names, + outfile = outputs + ) + + }, + seqs = seq_files, + outputs = expanded_output_file_names, + MoreArgs = list(idx_names = seq_name_list) +) + +# Check for output files +if( + all(file.exists(file.path(args$output, unlist(expanded_output_file_names)))) +){ + + cat( + "\nAll files written to output directory:\n ", + paste( + file.path(args$output, unlist(expanded_output_file_names)), + collapse = "\n " + ), + "\n" + ) + + q(save = "no", status = 0) + +}else{ + + stop("\n Could not confirm existance of all output files.\n") + +} + diff --git a/tools/rscripts/check_file_digests.R b/tools/rscripts/check_file_digests.R index e9563958..357964ba 100644 --- a/tools/rscripts/check_file_digests.R +++ b/tools/rscripts/check_file_digests.R @@ -122,7 +122,15 @@ readFile <- function(path, root){ }else{ - return(readRDS(path)) + rds_import <- readRDS(path) + + if( class(rds_import) == "list" ){ + return(rds_import[[ + which(sapply(rds_import, class) == "data.frame") + ]]) + }else{ + return(rds_import) + } } diff --git a/tools/rscripts/check_test_accuracy.R b/tools/rscripts/check_test_accuracy.R index 5d0d499d..31d6e180 100644 --- a/tools/rscripts/check_test_accuracy.R +++ b/tools/rscripts/check_test_accuracy.R @@ -145,7 +145,22 @@ readFile <- function(path, root){ }else if( any(exts %in% c("rds")) ){ - return(readRDS(path)) + rds_import <- readRDS(path) + + if( class(rds_import) == "list" ){ + + if( any(sapply(rds_import, class) == "data.frame") ){ + + idx <- which(sapply(rds_import, class) == "data.frame") + return(rds_import[[idx[1]]]) + + }else{ + return(as.data.frame(rds_import[[1]], row.names = NULL)) + } + + }else{ + return(rds_import) + } }else{ @@ -172,14 +187,15 @@ names(check_data) <- c("uniq_sites") check_data$multihits <- suppressMessages(dplyr::bind_rows(lapply( sample_info$sampleName, function(x){ - y <- readFile( + + readFile( paste0( - "analysis/", run_config$Run_Name, "/process_data/post_align/", + "analysis/", run_config$Run_Name, "/process_data/multihits/", x, ".multihits.rds" ), args$iguide_dir ) - as.data.frame(y$unclustered_multihits, row.names = NULL) + }), .id = "specimen" )) diff --git a/tools/rscripts/collect_stats.R b/tools/rscripts/collect_stats.R index fab86baa..597d3586 100644 --- a/tools/rscripts/collect_stats.R +++ b/tools/rscripts/collect_stats.R @@ -87,16 +87,43 @@ long_data <- dplyr::bind_rows( .id = "type" ) +fmt_long_data <- long_data %>% + dplyr::distinct(type, sampleName, metric, count) %>% + dplyr::mutate( + bin = stringr::str_extract(type, "bin[0-9]+"), + read = ifelse( + stringr::str_detect(type, "R[12]."), + ifelse(stringr::str_detect(type, "R1."), "R1", "R2"), + NA + ), + type = stringr::str_remove(type, "bin[0-9]+.") + ) %>% + dplyr::group_by(sampleName, type, metric, read) %>% + dplyr::summarise(count = sum(count)) %>% + dplyr::ungroup() %>% + dplyr::filter( + (stringr::str_detect(metric, "multihit") & + stringr::str_detect(type, "multihits")) | + !stringr::str_detect(metric, "multihit") + ) %>% + dplyr::mutate(type = ifelse(type == "multihits", "align", type)) %>% + dplyr::ungroup() + # Transform data into a wide format wide_data <- dplyr::mutate( - long_data, + fmt_long_data, type = paste0(type, ".", metric), type = factor(type, levels = unique(type)) ) %>% - dplyr::select(-metric) %>% - dplyr::distinct() %>% + dplyr::select(-metric, -read) %>% tidyr::spread(type, count) +wide_cols <- names(wide_data) + +wide_data <- wide_data[ + ,c("sampleName", sort(wide_cols[-match("sampleName", wide_cols)])) +] + # Write data to output write.csv(wide_data, file = args$output, quote = FALSE, row.names = FALSE) diff --git a/tools/rscripts/combine_multihits.R b/tools/rscripts/combine_multihits.R new file mode 100644 index 00000000..3c164532 --- /dev/null +++ b/tools/rscripts/combine_multihits.R @@ -0,0 +1,386 @@ +#' For those reviewing the code below, the following is a small style guide +#' outlining the various formats for the code. +#' +#' Names with "_": objects, inlucding data.frames, GRanges, vectors, ... +#' Names in caMel format: functions or components of objects (i.e. columns +#' within a data.frame). +#' Names with ".": arguments / options for functions + +#!/usr/bin/env Rscript +options(stringsAsFactors = FALSE, scipen = 99, width = 120) +suppressMessages(library(magrittr)) + +code_dir <- dirname(sub( + pattern = "--file=", + replacement = "", + x = grep("--file=", commandArgs(trailingOnly = FALSE), value = TRUE) +)) + +# Set up and gather command line arguments +parser <- argparse::ArgumentParser( + description = "Script for combining multihit objects together.", + usage = paste( + "Rscript combine_multihits.R -d -p ", + "[-h/--help, -v/--version] [optional args]" + ) +) + +parser$add_argument( + "-d", "--dir", nargs = 1, type = "character", + help = paste( + "Directory where to look for multihit files. Combine with 'pattern'", + "to select specific files." + ) +) + +parser$add_argument( + "-p", "--pattern", nargs = 1, type = "character", default = ".", + help = paste( + "Pattern to identify files within the directory specified to combine.", + "Regex patterns supported through R. Default: '.'" + ) +) + +parser$add_argument( + "-o", "--output", nargs = 1, type = "character", required = TRUE, + help = "Output file name. Output format only supports R-based rds format." +) + +parser$add_argument( + "-s", "--stat", nargs = 1, type = "character", default = FALSE, + help = "Stat output name. Stats output in long csv file format." +) + + +args <- parser$parse_args(commandArgs(trailingOnly = TRUE)) + + +# Check output file name +if( !stringr::str_detect(args$output, ".rds$") ){ + + stop(paste( + "\n Output file name must be in rds format.", + "\n Please change name to have the proper extension (*.rds).\n" + )) + +} + + +# Print Inputs to terminal +input_table <- data.frame( + "Variables" = paste0(names(args), " :"), + "Values" = sapply(seq_along(args), function(i){ + paste(args[[i]], collapse = ", ") + }) +) + +input_table <- input_table[ + match( + c("dir :", "pattern :", "output :", "stat :"), + input_table$Variables + ), +] + +cat("\nCombine Multihit Inputs:\n") +print( + data.frame(input_table), + right = FALSE, + row.names = FALSE +) + +# Clear output file and prep output path +write(c(), file = args$output) +args$output <- normalizePath(args$output) +unlink(args$output) + +# Check for input files +input_files <- list.files(path = args$dir) + +if( args$pattern != "." ){ + input_files <- input_files[stringr::str_detect(input_files, args$pattern)] +} + +if( length(input_files) == 0 ){ + + cat("\nWarning:\n No input files identified, writing empty output files.\n") + + saveRDS( + object = list( + "unclustered_multihits" = GenomicRanges::GRanges(), + "clustered_multihit_positions" = GenomicRanges::GRangesList(), + "clustered_multihit_lengths" = IRanges::RleList() + ), + file = args$output + ) + + if( args$stat != FALSE ){ + write.table( + x = data.frame(), file = args$stat, + sep = ",", row.names = FALSE, col.names = FALSE, quote = FALSE + ) + } + +}else{ + + cat(paste( + "\n A few multihit files to join together:\n ", + paste(head(file.path(args$dir, input_files)), collapse = "\n ") + )) + +} + + +# Load supporting scripts +source(file.path(code_dir, "supporting_scripts", "printHead.R")) +source(file.path(code_dir, "supporting_scripts", "writeOutputFile.R")) + +## Set up stat object +if( args$stat != FALSE ){ + + sampleName <- unlist(strsplit(args$output, "/")) + + sampleName <- unlist( + strsplit(sampleName[length(sampleName)], ".", fixed = TRUE) + )[1] + + stat <- data.frame( + sampleName = vector("character"), + metric = vector("character"), + count = vector("character") + ) + +} + + +# Read in files ---- +multihit_input <- lapply(file.path(args$dir, input_files), readRDS) + +multihits <- unlist(GenomicRanges::GRangesList(lapply( + multihit_input, "[[", "unclustered_multihits" +))) + +num_alignments <- length(multihits) +num_reads <- length(unique(names(multihits))) + +# Message +cat( + "\nA total of", + format(num_alignments, big.mark = ","), + "alignments will be clustered from", + format(num_reads, big.mark = ","), + "reads.\n" +) + + +# Group and characterize multihits +# Multihits are reads that align to multiple locations in the reference +# genome. There are bound to always be a certain proportion of reads aligning +# to repeated sequence due to the high level degree of repeated DNA elements +# within genomes. The final object generated, "multihitData", is a list of +# three objects. "unclustered_multihits" is a GRanges object where every +# alignment for every multihit read is present in rows. +# "clustered_multihit_positions" returns all the possible integration site +# positions for the multihit. Lastly, "clustered_multihit_lengths" contains the +# length of the templates mapping to the multihit clusters, used for +# abundance calculations. + +unclustered_multihits <- GenomicRanges::GRanges() +clustered_multihit_positions <- GenomicRanges::GRangesList() +clustered_multihit_lengths <- list() + +if( length(multihits) > 0 ){ + + #' As the loci are expanded from the coupled_loci object, unique templates + #' and readPairKeys are present in the readPairKeys unlisted from the + #' paired_loci object. + multihit_templates <- multihits + + multihit_keys <- multihits %>% + as.data.frame(row.names = NULL) %>% + dplyr::distinct(sampleName, ID, readPairKey) %>% + dplyr::select(sampleName, ID, readPairKey) + + #' Medians are based on all the potential sites for a given read, which will + #' be identical for all reads associated with a readPairKey. + multihit_medians <- round( + IRanges::median(GenomicRanges::width(GenomicRanges::GRangesList(split( + x = multihit_templates, + f = multihit_templates$readPairKey + )))) + ) + + multihit_keys$medians <- multihit_medians[ + as.character(multihit_keys$readPairKey) + ] + + multihits_pos <- GenomicRanges::flank( + x = multihit_templates, width = -1, start = TRUE + ) + + multihits_red <- GenomicRanges::reduce( + x = multihits_pos, min.gapwidth = 5L, with.revmap = TRUE + ) #! Should make min.gapwidth a option + + revmap <- multihits_red$revmap + + axil_nodes <- as.character(S4Vectors::Rle( + values = multihit_templates$readPairKey[min(revmap)], + lengths = lengths(revmap) + )) + + nodes <- multihit_templates$readPairKey[unlist(revmap)] + edgelist <- unique(matrix( c(axil_nodes, nodes), ncol = 2 )) + + multihits_cluster_data <- igraph::clusters( + igraph::graph.edgelist(el = edgelist, directed = FALSE) + ) + + clus_key <- data.frame( + row.names = unique(as.character(t(edgelist))), + "clusID" = multihits_cluster_data$membership + ) + + multihits_pos$clusID <- clus_key[ + as.character(multihits_pos$readPairKey), "clusID" + ] + + multihits_pos <- multihits_pos[order(multihits_pos$clusID)] + + clustered_multihit_index <- as.data.frame( + GenomicRanges::mcols(multihits_pos) + ) + + multihit_loci_rle <- S4Vectors::Rle(factor( + x = clustered_multihit_index$lociPairKey, + levels = unique(clustered_multihit_index$lociPairKey) + )) + + multihit_loci_intL <- S4Vectors::split( + multihit_loci_rle, clustered_multihit_index$clusID + ) + + clustered_multihit_positions <- GenomicRanges::granges( + x = multihits_pos[ + match( + x = BiocGenerics::unlist(S4Vectors::runValue(multihit_loci_intL)), + table = clustered_multihit_index$lociPairKey + ) + ] + ) + + clustered_multihit_positions <- GenomicRanges::split( + x = clustered_multihit_positions, + f = S4Vectors::Rle( + values = seq_along(multihit_loci_intL), + lengths = S4Vectors::width(S4Vectors::runValue( + multihit_loci_intL + )@partitioning) + ) + ) + + readPairKey_cluster_index <- unique( + clustered_multihit_index[,c("readPairKey", "clusID")] + ) + + multihit_keys$clusID <- readPairKey_cluster_index$clusID[ + match( + as.character(multihit_keys$readPairKey), + readPairKey_cluster_index$readPairKey + ) + ] + + multihit_keys <- multihit_keys[order(multihit_keys$medians),] + + clustered_multihit_lengths <- split( + x = S4Vectors::Rle(multihit_keys$medians), + f = multihit_keys$clusID + ) + + #' Expand the multihit_templates object from readPairKey specific to read + #' specific. + multihit_keys <- multihit_keys[order(multihit_keys$readPairKey),] + + multihit_readPair_read_exp <- IRanges::IntegerList( + split(x = seq_len(nrow(multihit_keys)), f = multihit_keys$readPairKey) + ) + + unclustered_multihits <- multihit_templates + + multihit_readPair_read_exp <- multihit_readPair_read_exp[ + as.character(unclustered_multihits$readPairKey) + ] + + unclustered_multihits <- unclustered_multihits[S4Vectors::Rle( + values = seq_along(unclustered_multihits), + lengths = S4Vectors::width(multihit_readPair_read_exp@partitioning) + )] + + names(unclustered_multihits) <- multihit_keys$ID[ + BiocGenerics::unlist(multihit_readPair_read_exp) + ] + + unclustered_multihits$ID <- multihit_keys$ID[ + BiocGenerics::unlist(multihit_readPair_read_exp) + ] + + unclustered_multihits$sampleName <- multihit_keys$sampleName[ + BiocGenerics::unlist(multihit_readPair_read_exp) + ] + +} + +stopifnot( + length(clustered_multihit_positions) == length(clustered_multihit_lengths) +) + +multihitData <- list( + "unclustered_multihits" = unclustered_multihits, + "clustered_multihit_positions" = clustered_multihit_positions, + "clustered_multihit_lengths" = clustered_multihit_lengths +) + +writeOutputFile(multihitData, file = args$output, format = "rds") + +printHead( + data.frame( + "multihit_reads" = length(unique(names(unclustered_multihits))), + "multihit_alignments" = length(unique(unclustered_multihits)), + "multihit_clusters" = length(clustered_multihit_positions), + "multihit_lengths" = sum(lengths(clustered_multihit_lengths)) + ), + title = "Multihit metrics", + caption = "Metrics highlighting the observation of multiple aligning reads." +) + +if( args$stat != FALSE ){ + + add_stat <- data.frame( + sampleName = sampleName, + metric = c("multihit.reads", "multihit.lengths", "multihit.clusters"), + count = c( + length(unique(names(unclustered_multihits))), + sum(lengths(clustered_multihit_lengths)), + length(clustered_multihit_positions)) + ) + + stat <- rbind(stat, add_stat) + +} + +if( args$stat != FALSE ){ + + write.table( + x = stat, file = args$stat, + sep = ",", row.names = FALSE, + col.names = FALSE, quote = FALSE + ) + +} + +if( file.exists(args$output) ){ + cat("\n Output file generated :", args$output, "\n") + q(save = "no", status = 0) +}else{ + stop("\n Could not verify existance of output file:\n ", args$output, "\n") +} + diff --git a/tools/rscripts/demulti.R b/tools/rscripts/demulti.R index be1bf4a6..5c9abd0d 100644 --- a/tools/rscripts/demulti.R +++ b/tools/rscripts/demulti.R @@ -133,7 +133,7 @@ parser$add_argument( parser$add_argument( "--readNamePattern", nargs = 1, type = "character", default = "[\\w\\:\\-\\+]+", - help = desc$readNamePatter + help = desc$readNamePattern ) diff --git a/tools/rscripts/descriptions/samqc.desc.yml b/tools/rscripts/descriptions/samqc.desc.yml index df3c6e8d..59c4ce1e 100644 --- a/tools/rscripts/descriptions/samqc.desc.yml +++ b/tools/rscripts/descriptions/samqc.desc.yml @@ -14,8 +14,6 @@ maxAlignStart : "Maximum allowable distance from the start of the sequence to ke minPercentIdentity : "Minimal global (whole sequence) percent identity required to keep alignment. Default = 95 (0-100)." minTempLength : "Minimum value for paired template length to consider. Default = 30 (bps)." maxTempLength : "Maximum value for paired template length to consider. Default = 2500 (bps)." -anchorFlags: "Flag code(s) identifying alignments as anchored, as in they are anchored to something of interest. Default: 81 97 337 353 369 65 321." -adriftFlags: "Flag code(s) identifying alignments with more random placement, such as by DNA fragmentation or sonication. Default: 129 145 161 385 401 417 433." keepAltChr : "By default, blatCoupleR will remove alignments from psl files aligning to alternative chromosome, unknown, and random sequences, ex. chr7_*_alt. Using this option will keep these alignments, which may increase multihit outputs." readNamePattern : "Regular expression for pattern matching read names. Should not contain R1/R2/I1/I2 specific components. Default is [\\w:-]+" saveImage : "Output file name for saved image. Include '.RData'. ie. debug.RData" diff --git a/tools/rscripts/evaluate_incorp_data.R b/tools/rscripts/evaluate_incorp_data.R index 3246e90f..0db4b796 100644 --- a/tools/rscripts/evaluate_incorp_data.R +++ b/tools/rscripts/evaluate_incorp_data.R @@ -48,6 +48,11 @@ parser$add_argument( ) ) +parser$add_argument( + "--override", action = "store_true", + help = "Override software and build version control checks." +) + parser$add_argument( "-q", "--quiet", action = "store_true", help = "Hide standard output messages." @@ -278,9 +283,9 @@ if( multihit_option ){ ){ stop( - "\n Inconsistant upstream or downstream distances between config files.\n", - " Comparisons between groups with different run specific criteria\n", - " is not recommended when considering the recover multihit option.\n" + "\n Inconsistant upstream or downstream distances between config files.", + "\n Comparisons between groups with different run specific criteria", + "\n is not recommended when considering the recover multihit option.\n" ) } @@ -556,7 +561,7 @@ cond_overview <- spec_overview %>% dplyr::select(specimen, condition) -# Beginnin analysis ---- +# Beginning analysis ---- if( !args$quiet ) cat("\nStarting analysis...\n") ## Read in experimental data and contatenate different sets @@ -568,12 +573,18 @@ input_data <- lapply(configs, function(x){ ) if( file.exists(file.path(root_dir, path)) ){ - return(readRDS(file.path(root_dir, path))) + y <- readRDS(file.path(root_dir, path)) }else if( file.exists(path) ){ - return(readRDS(path)) + y <- readRDS(path) }else{ - stop("\n Cannot find edited_sites file: ", x, ".\n") + stop("\n Cannot find incorp_sites file: ", x, ".\n") } + + y$reads %>% + dplyr::mutate( + soft.version = y$soft_version, + build.version = y$build_version + ) }) %>% dplyr::bind_rows(.id = "run.set") %>% @@ -586,6 +597,25 @@ if( !multihit_option ){ input_data <- dplyr::filter(input_data, type == "uniq") } +## Check versioning for imported data ---- +vc_check <- input_data %>% + dplyr::distinct(run.set, soft.version, build.version) + +input_data <- dplyr::select(input_data, -soft.version, -build.version) + +cat("\nVersioning:\n") +print(vc_check, right = FALSE, row.names = FALSE) + +if( dplyr::n_distinct(vc_check$soft.version) > 1 | + dplyr::n_distinct(vc_check$build.version) > 1 ){ + + if( args$override ){ + warning("Data processed under different software versions.") + }else{ + stop("\n Data processed with inconsistent software versions.") + } + +} ## Format input alignments ---- ## Determine abundance metrics, with or without UMItags @@ -1625,6 +1655,7 @@ saveRDS( "configs" = configs, "soft_version" = soft_version, "build_version" = build_version, + "input_vc" = vc_check, "specimen_levels" = specimen_levels ), "spec_info" = list( diff --git a/tools/rscripts/generate_stat_report.R b/tools/rscripts/generate_stat_report.R index f397a437..d312e518 100644 --- a/tools/rscripts/generate_stat_report.R +++ b/tools/rscripts/generate_stat_report.R @@ -23,13 +23,24 @@ parser <- argparse::ArgumentParser( ) parser$add_argument( - "stats", nargs = 2, type = "character", + "-r", "--core", nargs = 1, type = "character", help = paste( - "Stat objects generated by iGUIDE run command. Input order: core, eval.", - "Requires csv format." + "Core stat object generated by iGUIDE run command. Requires csv format." ) ) +parser$add_argument( + "-e", "--eval", nargs = 1, type = "character", + help = paste( + "Eval stat object generated by iGUIDE run command. Requires csv format." + ) +) + +parser$add_argument( + "-i", "--incorpSites", nargs = 1, type = "character", required = TRUE, + help = "Unique sites csv file from project directory." +) + parser$add_argument( "-o", "--output", nargs = 1, type = "character", required = TRUE, help = "Output report file, extension not required." @@ -52,7 +63,7 @@ parser$add_argument( ) parser$add_argument( - "-i", "--iguide_dir", nargs = 1, type = "character", default = "IGUIDE_DIR", + "--iguide_dir", nargs = 1, type = "character", default = "IGUIDE_DIR", help = "iGUIDE install directory path, do not change for normal applications." ) @@ -79,10 +90,13 @@ code_dir <- dirname(sub( )) # Check input file ---- -core_file <- args$stats[1] -eval_file <- args$stats[2] +core_file <- args$core +eval_file <- args$eval +sites_file <- args$incorpSites -if( !file.exists(core_file) | !file.exists(eval_file) ){ +if( + !file.exists(core_file) | !file.exists(eval_file) | !file.exists(sites_file) + ){ stop("\n Cannot find input stat files. Check inputs.") } @@ -153,10 +167,26 @@ build_version <- list.files(file.path(root_dir, "etc")) %>% signature <- config[["signature"]] # Load input data ---- -core_stat_df <- read.csv(core_file) +core_stat_df <- read.csv(core_file) %>% + dplyr::select( + -align.unique.reads, -align.unique.algns, -align.unique.loci + ) + eval_stat_df <- read.csv(eval_file) +site_stat_df <- readRDS(sites_file)$reads %>% + dplyr::filter(type == "uniq") %>% + dplyr::group_by(sampleName) %>% + dplyr::summarise( + align.unique.reads = dplyr::n_distinct(ID), + align.unique.algns = dplyr::n_distinct(seqnames, start, end, strand), + align.unique.loci = dplyr::n_distinct( + seqnames, strand, ifelse(strand == "+", start, end) + ) + ) + stat_df <- dplyr::full_join(core_stat_df, eval_stat_df, by = "sampleName") %>% + dplyr::full_join(site_stat_df, by = "sampleName") %>% dplyr::mutate_all(function(x) ifelse(is.na(x), rep(0, length(x)), x)) sampleName_levels <- unique(stat_df$sampleName) @@ -188,8 +218,8 @@ sampleName_levels <- c( read_tbl <- dplyr::select( stat_df, c(sampleName, demulti.reads, R1.trim.reads, R2.primer.trim.reads, R2.trim.reads, if( config$UMItags ) umitags.reads, filt.reads, - if( tolower(config$Aligner) == "blat") R1.consol.reads, - if( tolower(config$Aligner) == "blat") R2.consol.reads, + if( tolower(config$Aligner) == "blat" ) R1.consol.reads, + if( tolower(config$Aligner) == "blat" ) R2.consol.reads, align.unique.reads, align.chimera.reads, align.multihit.reads )) %>% dplyr::mutate(sampleName = factor(sampleName, levels = sampleName_levels)) %>% diff --git a/tools/rscripts/report_templates/iGUIDE_stat_template.Rmd b/tools/rscripts/report_templates/iGUIDE_stat_template.Rmd index 48e477e1..fa9cc7c0 100644 --- a/tools/rscripts/report_templates/iGUIDE_stat_template.Rmd +++ b/tools/rscripts/report_templates/iGUIDE_stat_template.Rmd @@ -145,7 +145,7 @@ read_print <- kable( kableExtra::add_header_above( c( " ", " ", "Trimming" = 3, if( config$UMItags ) " ", - " ", if( tolower(config$Aligner) == "blat") "Consol" = 2, + " ", if( tolower(config$Aligner) == "blat") c("Consol" = 2), "Alignment" = 3 ) ) %>% diff --git a/tools/rscripts/samqc.R b/tools/rscripts/samqc.R index 06f00670..d17624be 100644 --- a/tools/rscripts/samqc.R +++ b/tools/rscripts/samqc.R @@ -81,17 +81,15 @@ parser$add_argument( ) parser$add_argument( - "--anchorFlags", nargs = "+", type = "integer", - default = c(81L, 97L, 337L, 353L, 369L, 65L, 321L), help = desc$anchorFlags -) - -parser$add_argument( - "--adriftFlags", nargs = "+", type = "integer", - default = c(129L, 145L, 161L, 385L, 401L, 417L, 433L), help = desc$adriftFlags + "--keepAltChr", action = "store_true", help = desc$keepAltChr ) parser$add_argument( - "--keepAltChr", action = "store_true", help = desc$keepAltChr + "--batches", nargs = 1, type = "integer", default = 25L, + help = paste( + "A tuning parameter to batch process the alignments, specifies how many", + "batches to do. Default: 500." + ) ) parser$add_argument( @@ -120,7 +118,7 @@ input_table <- input_table[ c("bam :", "bai :", "uniqOutput :", "condSites :", "chimeras :", "multihits :", "stat :", "refGenome :", "maxAlignStart :", "minPercentIdentity :", "minTempLength :", "maxTempLength :", - "anchorFlags :", "adriftFlags :", "keepAltChr :", "readNamePattern :" + "keepAltChr :", "readNamePattern :" ), input_table$Variables ), @@ -217,11 +215,19 @@ if( args$stat != FALSE ){ #' @param tags character vector indicating the additional tags to import. Again, #' refer to the SAMtools or BWA manual for tag names. -loadBAM <- function(bam, bai, params, tags){ +loadBAM <- function(bam, bai, params, tags, onlyPairMapped = TRUE){ algn <- unlist(Rsamtools::scanBam( file = bam, index = bai, - param = Rsamtools::ScanBamParam(what = params, tag = tags) + param = Rsamtools::ScanBamParam( + flag = Rsamtools::scanBamFlag( + isPaired = ifelse(onlyPairMapped, TRUE, NA), + isUnmappedQuery = ifelse(onlyPairMapped, FALSE, NA), + hasUnmappedMate = ifelse(onlyPairMapped, FALSE, NA) + ), + what = params, + tag = tags + ) ), recursive = FALSE ) @@ -249,17 +255,17 @@ calcPctID <- function(cigar, MD){ data.frame("cig" = cigar, "md" = MD, stringsAsFactors = FALSE) %>% dplyr::mutate( - mismatch = S4Vectors::rowSums(matrix( + mismatch = rowSums(matrix( stringr::str_extract_all(md, "[ATGC]", simplify = TRUE) %in% c("A", "T", "G", "C"), nrow = n()), na.rm = TRUE ), - match = S4Vectors::rowSums(matrix(as.numeric(gsub( + match = rowSums(matrix(as.numeric(gsub( "M", "",stringr::str_extract_all(cig, "[0-9]+M", simplify = TRUE) )), nrow = n()), na.rm = TRUE ) - mismatch, - length = S4Vectors::rowSums(matrix(as.numeric(gsub( + length = rowSums(matrix(as.numeric(gsub( "[HSMIDX=]", "", stringr::str_extract_all( cig, "[0-9]+[HSMIDX=]", simplify = TRUE ) @@ -306,10 +312,10 @@ cntClipped <- function(cigar, type = "both", end = "5p"){ } # Capture all patterns and return integer of clipped bases - S4Vectors::rowSums(matrix(as.numeric( + rowSums(matrix(as.numeric( gsub("[HS]", "", stringr::str_extract_all( - cigar, query_pat, simplify = TRUE) - ) + cigar, query_pat, simplify = TRUE + )) ), nrow = length(cigar) ), @@ -335,8 +341,9 @@ cntClipped <- function(cigar, type = "both", end = "5p"){ #' @param maxLen numeric or integer value indicating the maximum distance #' between the two alignments that should be considered. #' @param refGen BSgenome object or other object with GenomeInfoDb::seqinfo. +#' This method is currently depreciated for the latter method. -processAlignments <- function(id, chr, strand, pos, width, type, minLen = 30L, +.processAlignments <- function(id, chr, strand, pos, width, type, minLen = 30L, maxLen = 2500L, refGen = NULL){ # Check inputs @@ -463,44 +470,186 @@ processAlignments <- function(id, chr, strand, pos, width, type, minLen = 30L, } +#' Process alignment data to valid paired-end alignments representing the input +#' template DNA. +#' @param id character vector indicating grouping of alignments. +#' @param chr character vector of seqnames. If using reference genome, these +#' will need to match seqnames present in the reference object passed to +#' `refGen`. +#' @param strand character vector of strand or alignment orientation, must be +#' either "+" or "-". +#' @param pos numeric or integer vector indicating the "start" of the alignment. +#' @param width numeric or integer vector indicating the width of the alignment. +#' @param type character vector indicating type of alignment +#' ("anchor" or "adrift"). +#' @param maxLen numeric or integer value indicating the minimum distance +#' between the two alignments that should be considered. +#' @param maxLen numeric or integer value indicating the maximum distance +#' between the two alignments that should be considered. +#' @param refGen BSgenome object or other object with GenomeInfoDb::seqinfo. +#' @param batches integer indicating the number of batches to serialize the +#' data processing with. The number of reads analyzed within a batch will be +#' the number of unique `id`'s divided by the `batches`. + +processAlignments <- function(id, chr, strand, pos, width, type, minLen = 30L, + maxLen = 2500L, refGen = NULL, batches = 25L){ + + # Check inputs + inputs <- list( + "grp" = id, "chr" = chr, "strand" = strand, + "pos" = pos, "width" = width, "type" = type + ) + + stopifnot( length(unique(sapply(inputs, length))) == 1 ) # All same length + + # Combine into data.frame and build GenomicRanges + input_df <- as.data.frame(inputs) %>% + dplyr::mutate( + grp = as.character(grp), + start = pos, + end = pos + width - 1, + type = as.character(type), + strand = as.character(strand), + pos = ifelse(strand == "+", start, end) + ) %>% + dplyr::select(grp, chr, strand, pos, type) + + idx_list <- IRanges::IntegerList(split(seq_len(nrow(input_df)), input_df$grp)) + + anchor_idx_list <- idx_list[ + IRanges::LogicalList(split(input_df$type == "anchor", input_df$grp)) + ] + + adrift_idx_list <- idx_list[ + IRanges::LogicalList(split(input_df$type == "adrift", input_df$grp)) + ] + + batch_list <- split( + seq_along(idx_list), + ceiling(seq_along(idx_list) / (length(idx_list) / batches)) + ) + + dplyr::bind_rows(lapply(seq_along(batch_list), function(i){ + + print(i) + idxs <- batch_list[[i]] + + # Identify which reads to analyze + x <- names(idx_list)[idxs] + + # Pull in all anchors associated with reads + anchor_aligns <- input_df[unlist(anchor_idx_list[x]),] + + # Pull in all adrift alignments associated with reads + adrift_aligns <- input_df[unlist(adrift_idx_list[x]),] %>% + dplyr::select(grp, "chr.d" = chr, "strand.d" = strand, "pos.d" = pos) + + anc_idx <- IRanges::IntegerList( + split(seq_len(nrow(anchor_aligns)), anchor_aligns$grp) + ) + + adr_idx <- IRanges::IntegerList( + split(seq_len(nrow(adrift_aligns)), adrift_aligns$grp) + ) + + exp_anc_idxs <- unlist(lapply( + seq_along(anc_idx), + function(i) rep(anc_idx[[i]], each = length(adr_idx[[i]])) + )) + + adrift_aligns[ + unlist(unname(adr_idx[rep(names(anc_idx), lengths(anc_idx))])), + ] %>% + dplyr::mutate( + chr.n = anchor_aligns$chr[exp_anc_idxs], + strand.n = anchor_aligns$strand[exp_anc_idxs], + pos.n = anchor_aligns$pos[exp_anc_idxs] + ) %>% + dplyr::filter( + # Filter for opposite strands + strand.n != strand.d, + # Filter for correct size window + ifelse(strand.n == "+", pos.d - pos.n, pos.n - pos.d) >= minLen, + ifelse(strand.n == "+", pos.d - pos.n, pos.n - pos.d) <= maxLen, + # Filter for same chromosome + chr.n == chr.d + ) %>% + dplyr::mutate( + start = ifelse(strand.n == "+", pos.n, pos.d), + end = ifelse(strand.n == "+", pos.d, pos.n) + ) %>% + dplyr::select( + "id" = grp, "chr" = chr.n, "strand" = strand.n, start, end + ) + + })) + + +} + +#' Determine if pair of reads are mapped +#' @param flag numeric or integer vector of flag codes indicating mapping +#' status. This integer will be converted into binary bits and decoded to +#' determine if the flag indicates paired mapping. +#' @description Given flag integer codes, this function returns a logical to +#' indicate if the pair of reads are both mapped. If one or both reads are +#' unmapped, then the return is "FALSE". + +pair_is_mapped <- function(flag){ + + #Check if input is in correct format. + stopifnot( all(is.numeric(flag) | is.integer(flag)) ) + + # Switch flag codes to binary bit matrix + x <- matrix(as.integer(intToBits(flag)), ncol = 32, byrow = TRUE) + + # Flag codes designate 3rd and 4th bits to indicate unmapped read or mate + # As long as both are zero, then the pair of reads are both mapped + rowSums(x[,c(3,4)]) == 0 + +} + +#' Determine the alignment is for the read or mate +#' @param flag numeric or integer vector of flag codes indicating mapping +#' status. This integer will be converted into binary bits and decoded to +#' determine if the flag indicates read or mate maping. +#' @param output character vector of length 2, indicating the output designation +#' for if the alignment is for the read or the mate. +#' @description Given flag integer codes, this function returns a logical or +#' character vector to indicate if the alignment is for the read or mate + +read_or_mate <- function(flag, output = NULL){ + + #Check if input is in correct format. + stopifnot( all(is.numeric(flag) | is.integer(flag)) ) + + # Switch flag codes to binary bit matrix + x <- matrix(as.integer(intToBits(flag)), ncol = 32, byrow = TRUE) + + # Flag codes designate 7th bit to indicate 1st read (read) and the 8th for mate + # As long as both are zero, then the pair of reads are both mapped + if( is.null(output) ){ + return(x[,c(7)] == 1) + }else{ + return(ifelse(x[,c(7)] == 1, output[1], output[2])) + } + +} # Additional parameters ---- # BAM parameters to get from file bam_params <- c( - "qname", "flag", "rname", "strand", "pos", "qwidth", "mapq", "cigar", "isize" + "qname", "flag", "rname", "strand", "pos", "qwidth", "mapq", "cigar" ) # BAM Tags to get from files bam_tags <- c("MD") -# Flag codes -- may need to be updated after reviewing more data -anchor_flags <- args$anchorFlags -adrift_flags <- args$adriftFlags - -if( length(intersect(anchor_flags, adrift_flags)) != 0 ){ - stop("\n Flags specifying anchor and adrift alignments are intersecting.\n") -} - # Import read alignments and filter on input criteria ---- input_hits <- loadBAM( bam = args$bam, bai = args$bai, params = bam_params, tags = bam_tags ) -unkn_flags <- unique(input_hits$flag)[ - !unique(input_hits$flag) %in% c(anchor_flags, adrift_flags) -] - -if( length(unkn_flags) != 0 ){ - - warning(paste0( - " Unknown flags found in alignments: ", - paste(unkn_flags, collapse = " "), - "\n These reads will be binned in with artifactual chimera output if", - "\n specified during input.\n" - )) - -} - # Top of inputs from alignments printHead( input_hits, @@ -521,13 +670,14 @@ if( nrow(input_hits) == 0 ){ ## Initial quality filtering: min percent ID, minimum size, max align start ---- read_hits <- input_hits %>% + dplyr::mutate( + pairMapped = pair_is_mapped(flag), + type = read_or_mate(flag, c("anchor", "adrift")) + ) %>% + dplyr::filter(pairMapped) %>% dplyr::mutate( clip5p = cntClipped(cigar), - pctID = calcPctID(cigar, MD), - type = ifelse( - flag %in% anchor_flags, "anchor", ifelse( - flag %in% adrift_flags, "adrift", NA) - ) + pctID = calcPctID(cigar, MD) ) %>% dplyr::filter( pctID >= args$minPercentIdentity, @@ -535,12 +685,26 @@ read_hits <- input_hits %>% clip5p <= args$maxAlignStart ) +read_wo_pairs_after_init_filter <- read_hits %>% + dplyr::group_by(qname) %>% + dplyr::summarise( + anchors = sum(type == "anchor"), + adrifts = sum(type == "adrift") + ) %>% + dplyr::filter(anchors == 0 | adrifts == 0) %>% + dplyr::pull(qname) + +read_hits <- dplyr::filter( + read_hits, !qname %in% read_wo_pairs_after_init_filter +) + # Stop if there are no remaining alignments -if( nrow(read_hits) == 0 ){ +if( nrow(read_hits) == 0 | dplyr::n_distinct(read_hits$type) == 1 ){ cat( "\nNo valid alignments were found within the data given input criteria.\n" ) + writeNullOutput(args) q() @@ -550,7 +714,8 @@ if( nrow(read_hits) == 0 ){ all_valid_aligns <- with( read_hits, processAlignments( - qname, rname, strand, pos, qwidth, type, refGen = ref_genome + qname, rname, strand, pos, qwidth, type, + refGen = ref_genome, batches = args$batches ) ) %>% dplyr::mutate(