Merge pull request #58 from sbslee/0.15.0-dev

0.15.0 dev
sbslee · May 3, 2022 · 0c8c33a · 0c8c33a
2 parents bbac2c7 + 83ef1fa
commit 0c8c33a
Show file tree

Hide file tree

Showing 20 changed files with 608 additions and 62 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,21 @@
 Changelog
 *********
 
+0.15.0 (2022-05-03)
+-------------------
+
+* Add new optional arguments ``--genes`` and ``--exclude`` to :command:`prepare-depth-of-coverage` command.
+* Add new command :command:`slice-bam`.
+* Add new command :command:`print-data`.
+* Fix typo "statistcs" to "statistics" throughout the package.
+* Update :meth:`sdk.utils.simulate_copy_number` method to automatically handle duplicate sample names.
+* Improve CNV caller for CYP2A6, CYP2B6, CYP2D6, CYP2E1, GSTM1, SLC22A2, SULT1A1, UGT1A4, UGT2B15, UGT2B17.
+* Add new CNV calls for CYP2A6: ``Deletion2Hom``, ``Hybrid5``, ``Hybrid6``, ``PseudogeneDeletion``.
+* Add new CNV call for CYP2D6: ``Tandem2F``.
+* Add new CNV call for GSTM1: ``Normal,Deletion2``.
+* Add new CNV call for SULT1A1: ``Unknown1``.
+* Add new CNV call for UGT2B17: ``Deletion,PartialDeletion3``.
+
 0.14.0 (2022-04-03)
 -------------------
 

diff --git a/README.rst b/README.rst
@@ -357,7 +357,7 @@ currently defined semantic types:
 - ``SampleTable[Results]``
     * TSV file for storing various results for each sample.
     * Requires following metadata: ``Gene``, ``Assembly``, ``SemanticType``.
-- ``SampleTable[Statistcs]``
+- ``SampleTable[Statistics]``
     * TSV file for storing control gene's various statistics on read depth for each sample. Used for converting target gene's read depth to copy number.
     * Requires following metadata: ``Control``, ``Assembly``, ``SemanticType``, ``Platform``.
 - ``VcfFrame[Consolidated]``
@@ -370,11 +370,12 @@ currently defined semantic types:
     * VcfFrame for storing target gene's phased variant data.
     * Requires following metadata: ``Platform``, ``Gene``, ``Assembly``, ``SemanticType``, ``Program``.
 
-Wroking with archive files
+Working with archive files
 --------------------------
 
 To demonstrate how easy it is to work with PyPGx archive files, below we will
-show some examples. First, download an archive:
+show some examples. First, download an archive to play with, which has
+``SampleTable[Results]`` as semantic type:
 
 .. code-block:: text
 
@@ -389,6 +390,14 @@ Let's print its metadata:
     Assembly=GRCh37
     SemanticType=SampleTable[Results]
 
+Now print its main data (but display first sample only):
+
+.. code-block:: text
+
+    $ pypgx print-data grch37-CYP2D6-results.zip | head -n 2
+    	Genotype	Phenotype	Haplotype1	Haplotype2	AlternativePhase	VariantData	CNV
+    HG00276_PyPGx	*4/*5	Poor Metabolizer	*4;*10;*74;*2;	*10;*74;*2;	;	*4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;	DeletionHet
+
 We can unzip it to extract files inside (note that ``tmpcty4c_cr`` is the
 original folder name):
 
@@ -500,7 +509,7 @@ input data is from whole genome sequencing (WGS) or targeted sequencing
 This pipeline supports SV detection based on copy number analysis for genes
 that are known to have SV. Therefore, if the target gene is associated with
 SV (e.g. CYP2D6) it's strongly recommended to provide a
-``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in
+``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistics]`` file in
 addtion to a VCF file containing SNVs/indels. If the target gene is not
 associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can
 visit the `Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page
@@ -515,6 +524,9 @@ HaplotypeCaller). See the `Variant caller choice <https://pypgx.readthedocs.
 io/en/latest/faq.html#variant-caller-choice>`__ section for detailed
 discussion on when to use either option.
 
+Check out the `GeT-RM WGS tutorial <https://pypgx.readthedocs.io/en/latest/
+tutorials.html#get-rm-wgs-tutorial>`__ to see this pipeline in action.
+
 Chip pipeline
 -------------
 
@@ -534,6 +546,9 @@ The pipeline currently does not support SV detection. Please post a GitHub
 issue if you want to contribute your development skills and/or data for
 devising an SV detection algorithm.
 
+Check out the `Coriell Affy tutorial <https://pypgx.readthedocs.io/en/latest/
+tutorials.html#coriell-affy-tutorial>`__ to see this pipeline in action.
+
 Long-read pipeline
 ------------------
 
@@ -664,11 +679,13 @@ For getting help on the CLI:
        prepare-depth-of-coverage
                            Prepare a depth of coverage file for all target
                            genes with SV from BAM files.
+       print-data          Print the main data of specified archive.
        print-metadata      Print the metadata of specified archive.
        run-chip-pipeline   Run genotyping pipeline for chip data.
        run-long-read-pipeline
                            Run genotyping pipeline for long-read sequencing data.
        run-ngs-pipeline    Run genotyping pipeline for NGS data.
+       slice-bam           Slice BAM file for all genes used by PyPGx.
        test-cnv-caller     Test CNV caller for target gene.
        train-cnv-caller    Train CNV caller for target gene.
    

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -55,11 +55,13 @@ For getting help on the CLI:
        prepare-depth-of-coverage
                            Prepare a depth of coverage file for all target
                            genes with SV from BAM files.
+       print-data          Print the main data of specified archive.
        print-metadata      Print the metadata of specified archive.
        run-chip-pipeline   Run genotyping pipeline for chip data.
        run-long-read-pipeline
                            Run genotyping pipeline for long-read sequencing data.
        run-ngs-pipeline    Run genotyping pipeline for NGS data.
+       slice-bam           Slice BAM file for all genes used by PyPGx.
        test-cnv-caller     Test CNV caller for target gene.
        train-cnv-caller    Train CNV caller for target gene.
    
@@ -201,13 +203,13 @@ compute-control-statistics
    [Example] For the VDR gene from WGS data:
      $ pypgx compute-control-statistics \
      VDR \
-     control-statistcs.zip \
+     control-statistics.zip \
      1.bam 2.bam
    
    [Example] For a custom region from targeted sequencing data:
      $ pypgx compute-control-statistics \
      chr1:100-200 \
-     control-statistcs.zip \
+     control-statistics.zip \
      bam.list \
      --bed probes.bed
 
@@ -218,7 +220,7 @@ compute-copy-number
 
    $ pypgx compute-copy-number -h
    usage: pypgx compute-copy-number [-h] [--samples-without-sv TEXT [TEXT ...]]
-                                    read-depth control-statistcs copy-number
+                                    read-depth control-statistics copy-number
    
    Compute copy number from read depth for target gene.
    
@@ -233,7 +235,7 @@ compute-copy-number
    Positional arguments:
      read-depth            Input archive file with the semantic type
                            CovFrame[ReadDepth].
-     control-statistcs     Input archive file with the semantic type
+     control-statistics    Input archive file with the semantic type
                            SampleTable[Statistics].
      copy-number           Output archive file with the semantic type
                            CovFrame[CopyNumber].
@@ -703,6 +705,7 @@ prepare-depth-of-coverage
 
    $ pypgx prepare-depth-of-coverage -h
    usage: pypgx prepare-depth-of-coverage [-h] [--assembly TEXT] [--bed PATH]
+                                          [--genes TEXT [TEXT ...]] [--exclude]
                                           depth-of-coverage bams [bams ...]
    
    Prepare a depth of coverage file for all target genes with SV from BAM files.
@@ -713,22 +716,26 @@ prepare-depth-of-coverage
    have star alleles defined only by SNVs/indels (e.g. CYP3A5).
    
    Positional arguments:
-     depth-of-coverage  Output archive file with the semantic type
-                        CovFrame[DepthOfCoverage].
-     bams               One or more input BAM files. Alternatively, you can
-                        provide a text file (.txt, .tsv, .csv, or .list)
-                        containing one BAM file per line.
+     depth-of-coverage     Output archive file with the semantic type
+                           CovFrame[DepthOfCoverage].
+     bams                  One or more input BAM files. Alternatively, you can
+                           provide a text file (.txt, .tsv, .csv, or .list)
+                           containing one BAM file per line.
    
    Optional arguments:
-     -h, --help         Show this help message and exit.
-     --assembly TEXT    Reference genome assembly (default: 'GRCh37')
-                        (choices: 'GRCh37', 'GRCh38').
-     --bed PATH         By default, the input data is assumed to be WGS. If
-                        it's targeted sequencing, you must provide a BED file
-                        to indicate probed regions. Note that the 'chr' prefix
-                        in contig names (e.g. 'chr1' vs. '1') will be
-                        automatically added or removed as necessary to match
-                        the input BAM's contig names.
+     -h, --help            Show this help message and exit.
+     --assembly TEXT       Reference genome assembly (default: 'GRCh37')
+                           (choices: 'GRCh37', 'GRCh38').
+     --bed PATH            By default, the input data is assumed to be WGS. If
+                           it's targeted sequencing, you must provide a BED file
+                           to indicate probed regions. Note that the 'chr' prefix
+                           in contig names (e.g. 'chr1' vs. '1') will be
+                           automatically added or removed as necessary to match
+                           the input BAM's contig names.
+     --genes TEXT [TEXT ...]
+                           List of genes to include.
+     --exclude             Exclude specified genes. Ignored when --genes is not
+                           used.
    
    [Example] From WGS data:
      $ pypgx prepare-depth-of-coverage \
@@ -741,6 +748,22 @@ prepare-depth-of-coverage
      bam.list \
      --bed probes.bed
 
+print-data
+==========
+
+.. code-block:: text
+
+   $ pypgx print-data -h
+   usage: pypgx print-data [-h] input
+   
+   Print the main data of specified archive.
+   
+   Positional arguments:
+     input       Input archive file.
+   
+   Optional arguments:
+     -h, --help  Show this help message and exit.
+
 print-metadata
 ==============
 
@@ -876,7 +899,7 @@ run-ngs-pipeline
                            CovFrame[DepthOfCoverage].
      --control-statistics PATH
                            Archive file with the semantic type
-                           SampleTable[Statistcs].
+                           SampleTable[Statistics].
      --platform TEXT       Genotyping platform (default: 'WGS') (choices: 'WGS',
                            'Targeted')
      --assembly TEXT       Reference genome assembly (default: 'GRCh37')
@@ -897,7 +920,7 @@ run-ngs-pipeline
                            Do not plot copy number profile.
      --do-not-plot-allele-fraction
                            Do not plot allele fraction profile.
-     --cnv-caller PATH     Archive file with the semantic type Model[CNV]. By 
+     --cnv-caller PATH     Archive file with the semantic type Model[CNV]. By
                            default, a pre-trained CNV caller in the ~/pypgx-bundle
                            directory will be used.
    
@@ -913,17 +936,43 @@ run-ngs-pipeline
      CYP2D6-pipeline \
      --variants variants.vcf.gz \
      --depth-of-coverage depth-of-coverage.tsv \
-     --control-statistcs control-statistics-VDR.zip
+     --control-statistics control-statistics-VDR.zip
    
    [Example] To genotype the CYP2D6 gene from targeted sequencing data:
      $ pypgx run-ngs-pipeline \
      CYP2D6 \
      CYP2D6-pipeline \
      --variants variants.vcf.gz \
      --depth-of-coverage depth-of-coverage.tsv \
-     --control-statistcs control-statistics-VDR.zip \
+     --control-statistics control-statistics-VDR.zip \
      --platform Targeted
 
+slice-bam
+=========
+
+.. code-block:: text
+
+   $ pypgx slice-bam -h
+   usage: pypgx slice-bam [-h] [--assembly TEXT] [--genes TEXT [TEXT ...]]
+                          [--exclude]
+                          input output
+   
+   Slice BAM file for all genes used by PyPGx.
+   
+   Positional arguments:
+     input                 Input BAM file. It must be already indexed to allow
+                           random access.
+     output                Output BAM file.
+   
+   Optional arguments:
+     -h, --help            Show this help message and exit.
+     --assembly TEXT       Reference genome assembly (default: 'GRCh37')
+                           (choices: 'GRCh37', 'GRCh38').
+     --genes TEXT [TEXT ...]
+                           List of genes to include.
+     --exclude             Exclude specified genes. Ignored when --genes is not
+                           used.
+
 test-cnv-caller
 ===============
 

diff --git a/docs/create.py b/docs/create.py
@@ -384,7 +384,7 @@
 - ``SampleTable[Results]``
     * TSV file for storing various results for each sample.
     * Requires following metadata: ``Gene``, ``Assembly``, ``SemanticType``.
-- ``SampleTable[Statistcs]``
+- ``SampleTable[Statistics]``
     * TSV file for storing control gene's various statistics on read depth for each sample. Used for converting target gene's read depth to copy number.
     * Requires following metadata: ``Control``, ``Assembly``, ``SemanticType``, ``Platform``.
 - ``VcfFrame[Consolidated]``
@@ -397,11 +397,12 @@
     * VcfFrame for storing target gene's phased variant data.
     * Requires following metadata: ``Platform``, ``Gene``, ``Assembly``, ``SemanticType``, ``Program``.
 
-Wroking with archive files
+Working with archive files
 --------------------------
 
 To demonstrate how easy it is to work with PyPGx archive files, below we will
-show some examples. First, download an archive:
+show some examples. First, download an archive to play with, which has
+``SampleTable[Results]`` as semantic type:
 
 .. code-block:: text
 
@@ -416,6 +417,14 @@
     Assembly=GRCh37
     SemanticType=SampleTable[Results]
 
+Now print its main data (but display first sample only):
+
+.. code-block:: text
+
+    $ pypgx print-data grch37-CYP2D6-results.zip | head -n 2
+    	Genotype	Phenotype	Haplotype1	Haplotype2	AlternativePhase	VariantData	CNV
+    HG00276_PyPGx	*4/*5	Poor Metabolizer	*4;*10;*74;*2;	*10;*74;*2;	;	*4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;	DeletionHet
+
 We can unzip it to extract files inside (note that ``tmpcty4c_cr`` is the
 original folder name):
 
@@ -527,7 +536,7 @@
 This pipeline supports SV detection based on copy number analysis for genes
 that are known to have SV. Therefore, if the target gene is associated with
 SV (e.g. CYP2D6) it's strongly recommended to provide a
-``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in
+``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistics]`` file in
 addtion to a VCF file containing SNVs/indels. If the target gene is not
 associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can
 visit the `Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page
@@ -542,6 +551,9 @@
 io/en/latest/faq.html#variant-caller-choice>`__ section for detailed
 discussion on when to use either option.
 
+Check out the `GeT-RM WGS tutorial <https://pypgx.readthedocs.io/en/latest/
+tutorials.html#get-rm-wgs-tutorial>`__ to see this pipeline in action.
+
 Chip pipeline
 -------------
 
@@ -561,6 +573,9 @@
 issue if you want to contribute your development skills and/or data for
 devising an SV detection algorithm.
 
+Check out the `Coriell Affy tutorial <https://pypgx.readthedocs.io/en/latest/
+tutorials.html#coriell-affy-tutorial>`__ to see this pipeline in action.
+
 Long-read pipeline
 ------------------