OpenOmics · skchronicles · Dec 15, 2023 · Nov 30, 2023 · Dec 2, 2023 · Dec 2, 2023
diff --git a/.github/workflows/dryrun.yaml b/.github/workflows/dryrun.yaml
@@ -9,14 +9,16 @@ jobs:
     runs-on: ubuntu-latest
     steps:
     - uses: actions/checkout@v2
+      with:
+        submodules: 'true'
     - uses: docker://snakemake/snakemake:stable
     - name: Dry Run with test data
       run: |
         docker run -h cn0000 -v $PWD:/opt2 -w /opt2 snakemake/snakemake:stable /bin/bash -c \
-        "pip install -r requirements.txt;./weave run -s /opt2/.tests/illumnia_demux -o /opt2/.tests/illumnia_demux/dry_run_out --local --dry-run /opt2/.tests/illumnia_demux"
+        "source get_submods.sh; pip install -r requirements.txt;./weave run -s /opt2/.tests/illumnia_demux -o /opt2/.tests/illumnia_demux/dry_run_out --local --dry-run /opt2/.tests/illumnia_demux"
     - name: View the pipeline config file
       run: |
-        echo "Generated config file for pipeline...." && cat $PWD/.tests/illumnia_demux/dry_run_out/.config/config_job_0.json
+        echo "Generated config file for pipeline...." && cat $PWD/.tests/illumnia_demux/dry_run_out/illumnia_demux/.config/config_job_0.json
     - name: Lint Snakefile
       continue-on-error: true
       run: |

diff --git a/.gitignore b/.gitignore
@@ -2,3 +2,4 @@ logs
 .tests/illumnia_demux/dry_run_out
 .snakemake
 site
+output
diff --git a/.tests/illumnia_demux/singled_end.csv b/.tests/illumnia_demux/singled_end.csv
@@ -0,0 +1,32 @@
+[Header],
+IEMFileVersion,2.20.0.422
+Investigator Name,NA
+Date,2023-09-21 14:15:47
+Workflow,LabKey Sample Sheet Gen
+Application,NextSeq
+Instrument Type,NextSeq
+Instrument,NB551182
+Assay,Production Run 01
+Index Adapters,IDT-Ilmn RNA UD Indexes SetB Ligation
+Chemistry,NextSeq Mid Output Kit v2.5 (150 cycles) (130M)
+Operator,13
+,
+[Reads],
+Read01,148
+Index01,10
+Index02,10
+Read02,
+,
+[Settings],
+,
+[Data],
+Sample_Project,SampleID,Sample_ID,MIDSet,IndexSet,Lane,I7_Index_ID,I5_Index_ID,Index,Index2,Well,Description
+EXP_PROJ_SE,13664,LIB_04942_01,B,,,UDP0156,UDP0156,GCAATATTCA,AATTGGCGCC,D08,TSB_24h_1
+EXP_PROJ_SE,13665,LIB_04943_01,B,,,UDP0157,UDP0157,CTAGATTGCG,CGCCATATCT,E08,TSB_24h_2
+EXP_PROJ_SE,13666,LIB_04944_01,B,,,UDP0158,UDP0158,CGATGCGGTT,ACCAAGCAGG,F08,TSB_24h_3
+EXP_PROJ_SE,13667,LIB_04945_01,B,,,UDP0159,UDP0159,TCCGGACTAG,ATTGTTCGTC,G08,uMT_24h_1
+EXP_PROJ_SE,13668,LIB_04946_01,B,,,UDP0160,UDP0160,GTGACGGAGC,TGGACCGCCA,H08,uMT_24h_2
+EXP_PROJ_SE,13669,LIB_04947_01,B,,,UDP0161,UDP0161,AATTCCATCT,GTAACTGAAG,A09,uMT_24h_3
+EXP_PROJ_SE,13670,LIB_04948_01,B,,,UDP0162,UDP0162,TTAACGGTGT,ACGGTCAGGA,B09,Sepi_24h_1
+EXP_PROJ_SE,13671,LIB_04949_01,B,,,UDP0163,UDP0163,ACTTGTTATC,TCTAGGCGCG,C09,Sepi_24h_2
+EXP_PROJ_SE,13672,LIB_04950_01,B,,,UDP0164,UDP0164,CGTGTACCAG,AACTTATCCT,D09,Sepi_24h_3
diff --git a/INSTALL.md b/INSTALL.md
diff --git a/README.md b/README.md
@@ -1,44 +1,73 @@
-# Introduction
+<div align="center">
+
+  <h1>weave 🔬</h1>
+
+  **_An awesome metagenomic and metatranscriptomics pipeline_**
 
-This repository was created to contain the demultiplexing, sample sheet generation, and analysis workflow 
-initialization workflows that exist across the NIH.gov infrastructures.
+  [![tests](https://github.com/OpenOmics/weave/workflows/tests/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/main.yaml) [![docs](https://github.com/OpenOmics/weave/workflows/docs/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/docs.yml) [![GitHub issues](https://img.shields.io/github/issues/OpenOmics/weave?color=brightgreen)](https://github.com/OpenOmics/weave/issues)  [![GitHub license](https://img.shields.io/github/license/OpenOmics/weave)](https://github.com/OpenOmics/weave/blob/main/LICENSE) 
+
+  <i>
+    This is the home of the pipeline, weave. Its long-term goals: to provide accurate quantification, taxonomic classification, and functional profiling of assembled (bacteria and archaea) metagenomes!
+  </i>
+</div>
 
-# Installation
+## Overview
+Welcome to weave's documentation! This guide is the main source of documentation for users that are getting started with the [weave](https://github.com/OpenOmics/weave/). 
 
-Please make sure to setup up ~/.netrc or other means of labkey authentication with the appropriate server information 
-and file permissions.
+The **`./weave`** pipeline is composed of two sub commands to setup and run the pipeline across different systems. Each of the available sub commands perform different functions: 
 
-# Operation
-__*REQUIREMENT*: manual run execution entry point__
-Some developing options for determining how to find run directories:
-- datetime.now(), query the top directory of the NGS data, directories younger now - day.
-- placing .lock files as breadcrumbs and walking directories for ones without them
+<section align="center" markdown="1" style="display: flex; flex-wrap: row wrap; justify-content: space-around;">
 
-Logging would be essential, some ideas about it's operation:
-- sqlite to catalogue what runs have been analyzed and has not + bread crumb lock
-- log entry point in script that gives last 10 or so entries
-- not entirely nailed down what all to store, some obvious meta information:
-    - run id, directory, run time, exit code, outputs, execution start, execution stop, manual/automatic execution
+!!! inline custom-grid-button ""
 
-# Software design & development plan (SDDP)
+    [<code style="font-size: 1em;">weave <b>run</b></code>](usage/run.md)   
+    Run the weave pipeline with your input files.
 
-The value add of this software is still unclear as a modular drop-in system or a one-shot temporary solution. 
-This can be as simple as some lines in bash that are engineered to
-the four things needed with this workflow:
-- demultiplex (from bespoke instruments to a modular system with configurations for multiple sequencing platforms)
-- generate a sample sheet from this directory and LIMS (labkey is the initial use-case, but do we support others)
-- trigger the OpenOmics pipelines for analysis
 
-Or as complex as a workflow that supports drop in configurations for different instruments and clusters, and 
-in-between those two extremes. 
+!!! inline custom-grid-button ""
 
-The idea we will begin with is to start as a simplistic system with anticipation of building in modularity, and the code 
-will be written in such a way to embrace that modularity as best as possible with forward thinking kept in mind. If we
-don't find utility in this work at a brace scale then we will accept the simple approach as bespoke and move forward.
+    [<code style="font-size: 1em;">weave <b>cache</b></code>](usage/cache.md)  
+    Downloads the reference files for the pipeline to a selected directory.
 
-# Requirements
+</section>
 
-Requirements for this software should be minimal. Python 3.8+, snakemake, singularity, a cron daemon, and a 
-user account with crontab access.
+**weave** is a two-pronged pipeline; the first prong detects and uses the appropriate illumnia software to demultiplex the ensemble collection of reads into their individual samples and converts the sequencing information into the FASTQ file format. From there out the second prong is a distrubted parallele step that uses a variety of commonly accepting nextgen sequencing tools to report, visualize, and calculate the quality of the reads after sequencing. **weave** makes uses of the ubiquitous containerization software (singularity)[https://sylabs.io/]<sup>2</sup> for modularity, and the robust pipelining DSL [Snakemake](https://snakemake.github.io/)<sup>3</sup>
 
-Python package requirements in `requirements.txt`
+**weave** common use is to gauge the qualtiy of reads for potential downstream analysis. Since bioinformatic analysis requires robust and accurate data to draw scientific conclusions, this helps save time and resources when it comes to analyzing the volumous amount of sequencing data that is collected routinely.
+
+Several of the applications that **weave** uses to visualize and report quality metrics are:
+- [Kraken](https://github.com/DerrickWood/kraken2)<sup>71</sup>, kmer analysis
+- [Kaiju](https://bioinformatics-centre.github.io/kaiju/)<sup>4</sup>, kmer analysis
+- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), fastq statistics
+- [fastp](https://github.com/OpenGene/fastp)<sup>6</sup>, fastq adapter removal (trimming)
+- [FastQ Screen](https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/)<sup>5</sup>, taxonomic quantification
+- [MultiQC](https://multiqc.info/)<sup>1</sup>, ensemble QC results
+
+
+## Dependencies
+**System Requirements:** `singularity>=3.5`  
+**Python Requirements:** `snakemake>=5.14.0`, `pyyaml`, `progressbar`, `requests`, `terminaltables`, `tabulate`
+
+Please refer to the complete [installation documents](https://openomics.github.io/weave/install/) for detailed information.
+
+## Installation
+```bash
+# clone repo
+git clone https://github.com/OpenOmics/weave.git
+cd weave
+# create virtual environment
+python -m venv ~/.my_venv
+# activate environment
+source ~/.my_venv/bin/activate
+pip install -r requirements.txt 
+```
+
+Please refer to the complete [installation documents](https://openomics.github.io/weave/install/) for detailed information.
+
+## Contribute 
+This site is a living document, created for and by members like you. weave is maintained by the members of OpenOmics and is improved by continous feedback! We encourage you to contribute new content and make improvements to existing content via pull request to our [GitHub repository](https://github.com/OpenOmics/weave).
+
+
+## References
+<sup>**1.**  Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.</sup>  
+<sup>**2.**  Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.</sup>  
diff --git a/docs/index.md b/docs/index.md
@@ -77,10 +77,10 @@ If you use this software, please cite it as below:
     ```
 
 ## References
-<sup>**1.**  (Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048.)[https://doi.org/10.1093/bioinformatics/btw354]</sup>
-<sup>**2.**  Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.</sup>  
-<sup>**3.**  Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.</sup>
-<sup>**4.**  [Menzel P., Ng K.L., Krogh A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257](http://www.nature.com/ncomms/2016/160413/ncomms11257/full/ncomms11257.html)</sup>
-<sup>**5.**  [Wingett SW and Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [version 2; referees: 4 approved]. F1000Research 2018, 7:1338](https://doi.org/10.12688/f1000research.15931.2)</sup>
-<sup>**6.**  [Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890.](https://doi.org/10.1093/bioinformatics/bty560)</sup>
-<sup>**7.**  [Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019).](https://doi.org/10.1186/s13059-019-1891-0)</sup>
+<sup>**1.**  (Philip Ewels, Måns Magnusson, Sverker Lundin, Max Käller, MultiQC: summarize analysis results for multiple tools and samples in a single report, Bioinformatics, Volume 32, Issue 19, October 2016, Pages 3047–3048.)[https://doi.org/10.1093/bioinformatics/btw354]</sup>  
+<sup>**2.**  [Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459.](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177459)</sup>  
+<sup>**3.**  [Koster, J. and S. Rahmann (2018). "Snakemake-a scalable bioinformatics workflow engine." Bioinformatics 34(20): 3600.](https://academic.oup.com/bioinformatics/article/28/19/2520/290322)</sup>  
+<sup>**4.**  [Menzel P., Ng K.L., Krogh A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257](http://www.nature.com/ncomms/2016/160413/ncomms11257/full/ncomms11257.html)</sup>  
+<sup>**5.**  [Wingett SW and Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control [version 2; referees: 4 approved]. F1000Research 2018, 7:1338](https://doi.org/10.12688/f1000research.15931.2)</sup>  
+<sup>**6.**  [Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890.](https://doi.org/10.1093/bioinformatics/bty560)</sup>  
+<sup>**7.**  [Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019).](https://doi.org/10.1186/s13059-019-1891-0)</sup>  
diff --git a/doc_requirements.txt → docs/requirements.txt b/doc_requirements.txt → docs/requirements.txt
diff --git a/scripts/config.py b/scripts/config.py
@@ -93,7 +93,8 @@ def get_resource_config():
 
 def base_config(keys=None, qc=True):
     base_keys = ('runs', 'run_ids', 'project', 'rnums', 'bcl_files', \
-                'sample_sheet', 'samples', 'sids', 'out_to', 'demux_input_dir')
+                'sample_sheet', 'samples', 'sids', 'out_to', 'demux_input_dir', \
+                'bclconvert', 'demux_data')
     this_config = {k: [] for k in base_keys}
     this_config['resources'] = get_resource_config()
     this_config['runqc'] = qc
@@ -148,10 +149,12 @@ def get_bigsky_seq_dirs():
 
 DIRECTORY_CONFIGS = {
     "bigsky": {
+        "seqroot": "/gs1/RTS/NextGen/SequencerRuns/",
         "seq": get_bigsky_seq_dirs(),
         "profile": Path(Path(__file__).parent.parent, "utils", "profiles", "bigsky").resolve(),
     },
     "biowulf": {
+        "seqroot": "/data/RTB_GRS/SequencerRuns/",
         "seq": get_biowulf_seq_dirs(),
         "profile": Path(Path(__file__).parent.parent, "utils", "profiles", "biowulf").resolve(),
     }

diff --git a/scripts/files.py b/scripts/files.py
@@ -7,7 +7,7 @@
 import xml.etree.ElementTree as ET
 from os import access as check_access, R_OK, W_OK
 from functools import partial
-from .sample_sheet import SampleSheet
+from .samplesheet import IllumniaSampleSheet
 from .config import get_current_server, LABKEY_CONFIGS, DIRECTORY_CONFIGS
 
 
@@ -28,6 +28,14 @@ def get_all_seq_dirs(top_dir, server):
     return _dirs
 
 
+def check_if_demuxed(data_dir):
+    is_demuxed = False
+    if Path(data_dir, 'Analysis').exists():
+        if list(Path(data_dir, 'Analysis').rglob('*.fastq*')):
+            is_demuxed = True
+    return is_demuxed
+
+
 def valid_run_output(output_directory, dry_run=False):
     if dry_run:
         return Path(output_directory).absolute()
@@ -74,10 +82,7 @@ def sniff_samplesheet(ss):
         Given a sample sheet file return the appropriate function to parse the
         sheet.
     """
-    # TODO: 
-    #   catalogoue and check for multiple types of sample sheets, so far just
-    #   the NextSeq, MinSeq, CellRanger are the only supported formats
-    return SampleSheet
+    return IllumniaSampleSheet
 
 
 def parse_samplesheet(ss):
@@ -92,16 +97,17 @@ def is_dir_staged(server, run_dir):
     """
         filter check for wheter or not a directory has the appropriate breadcrumbs or not
 
-        RTAComplete.txt - file transfer from instrument breadcrumb, CSV file with values:
-            Run Date, Run time, Instrument ID        
-    """
-    this_labkey_project = LABKEY_CONFIGS[server]['container_path']
-    TRANSFER_BREADCRUMB = 'RTAComplete.txt'
-    # SS_SHEET_EXISTS = LabKeyServer.runid2samplesheeturl(server, this_labkey_project, run_dir.name)
+        CopyComplete.txt - file transfer from instrument breadcrumb, blank (won't be there on instruments != NextSeq2k)
+
+        RTAComplete.txt - sequencing breadcrumb, CSV file with values:
+            Run Date, Run time, Instrument ID
 
+        RunInfo.xml - XML metainformation (RunID, Tiles, etc)
+    """
     analyzed_checks = [
-        Path(run_dir, TRANSFER_BREADCRUMB).exists(),
-        # SS_SHEET_EXISTS is not None
+        Path(run_dir, 'RTAComplete.txt').exists(),
+        Path(run_dir, 'SampleSheet.csv').exists(),
+        Path(run_dir, 'RunInfo.xml').exists(),
     ]
     return all(analyzed_checks)
 
@@ -117,7 +123,7 @@ def find_demux_dir(run_dir):
 
 def get_run_directories(runids, seq_dir=None):
     host = get_current_server()
-    seq_dirs = Path(seq_dir).absolute() if seq_dir else DIRECTORY_CONFIGS[host]['seq']
+    seq_dirs = Path(seq_dir).absolute() if seq_dir else Path(DIRECTORY_CONFIGS[host]['seqroot'])
     seq_contents = [_child for _child in seq_dirs.iterdir()]
     seq_contents_names = [child for child in map(lambda d: d.name, seq_contents)]
 
@@ -137,16 +143,24 @@ def get_run_directories(runids, seq_dir=None):
     for run_p in run_paths:
         rid = run_p.name
         this_run_info = dict(run_id=rid)
+        runinfo_xml = ET.parse(Path(run_p, 'RunInfo.xml').absolute())
+
+        try:
+            xml_rid = runinfo_xml.find("Run").attrib['Id']
+        except (KeyError, AttributeError):
+            xml_rid = None
+
         if Path(run_p, 'SampleSheet.csv').exists():
             this_run_info['samplesheet'] = parse_samplesheet(Path(run_p, 'SampleSheet.csv').absolute())
+        elif Path(run_p, f'SampleSheet_{rid}.csv').exists():
+            this_run_info['samplesheet'] = parse_samplesheet(Path(run_p, f'SampleSheet_{rid}.csv').absolute())
+        elif xml_rid and Path(run_p, f'SampleSheet_{xml_rid}.csv').exists():
+            this_run_info['samplesheet'] = parse_samplesheet(Path(run_p, f'SampleSheet_{xml_rid}.csv').absolute())
         else:
-            raise FileNotFoundError(f'Run {rid}({run_p}) does not have a sample sheet.')
-        if Path(run_p, 'RunInfo.xml').exists():
-            run_xml = ET.parse(Path(run_p, 'RunInfo.xml').absolute()).getroot()
-            this_run_info.update({info.tag: info.text for run in run_xml for info in run \
+            raise FileNotFoundError(f'Run {rid}({run_p}) does not have a find-able sample sheet.')
+
+        this_run_info.update({info.tag: info.text for run in runinfo_xml.getroot() for info in run \
                              if info.text is not None and info.text.strip() not in ('\n', '')})
-        else:
-            raise FileNotFoundError(f'Run {rid}({run_p}) does not have a RunInfo.xml file.')
         run_return.append((run_p, this_run_info))
 
     if invalid_runs:
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,3 +2,4 @@ logs @@
     .tests/illumnia_demux/dry_run_out
     .snakemake
     site
+    output