Merge pull request #48 from sbslee/0.13.0-dev

0.13.0 dev
sbslee · Mar 1, 2022 · d8ceee8 · d8ceee8
2 parents 274cb31 + 6647721
commit d8ceee8
Show file tree

Hide file tree

Showing 43 changed files with 2,190 additions and 1,015 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,30 @@
 Changelog
 *********
 
+0.13.0 (2022-03-01)
+-------------------
+
+* Add new genotyping platform, ``LongRead``, to :command:`import-variants` command.
+* Add new command :command:`run-long-read-pipeline`.
+* Remove ``Code`` column from ``cnv-table.csv`` file. From now on, CNV codes will be generated on the fly.
+* Add new method :meth:`api.core.load_cpic_table`.
+* Move following errors from ``api.core`` submodule to ``sdk.utils`` submodule: :class:`AlleleNotFoundError`, :class:`GeneNotFoundError`, :class:`NotTargetGeneError`, :class:`PhenotypeNotFoundError`, :class:`VariantNotFoundError`.
+* Combine optional arguments ``--bam`` and ``--fn`` into single positional argument ``bams`` for following commands: :command:`compute-control-statistics`, :command:`compute-target-depth`, :command:`prepare-depth-of-coverage`.
+* Rename ``output`` argument to ``copy-number`` for :command:`compute-copy-number` command.
+* Rename ``output`` argument to ``read-depth`` for :command:`compute-read-depth` command.
+* Combine optional arguments ``--gene`` and ``--region`` into single positional argument ``gene`` for :command:`compute-control-statistics` command.
+* Deprecate :meth:`sdk.utils.parse_input_bams` method.
+* Update :meth:`api.utils.predict_alleles` method to match ``0.31.0`` version of ``fuc`` package.
+* Fix bug in :command:`filter-samples` command when ``--exclude`` argument is used for archive files with SampleTable type.
+* Remove unnecessary optional argument ``assembly`` from :meth:`api.core.get_ref_allele`.
+* Improve CNV caller for CYP2A6, CYP2B6, CYP2D6, CYP2E1, CYP4F2, GSTM1, SLC22A2, SULT1A1, UGT1A4, UGT2B15, and UGT2B17.
+* Add a new CNV call for CYP2D6: ``PseudogeneDeletion``.
+* In CYP2E1 CNV nomenclature, ``PartialDuplication`` has been renamed to ``PartialDuplicationHet`` and a new CNV call ``PartialDuplicationHom`` has been added. Furthermore, calling algorithm for CYP2E1\*S1 allele has been updated. When partial duplication is present, from now on the algorithm requires only \*7 to call \*S1 instead of both \*7 and \*4.
+* Add a new CNV call for SLC22A2: ``Intron9Deletion,Exon11Deletion``.
+* Add a new CNV call for UGT1A4: ``Intron1PartialDup``.
+* Add new CNV calls for UGT2B15: ``PartialDeletion3`` and ``Deletion``.
+* Add a new CNV call for UGT2B17: ``Deletion,PartialDeletion2``. Additionally, several CNV calls have been renamed: ``Normal`` → ``Normal,Normal``; ``DeletionHet`` → ``Normal,Deletion``; ``DeletionHom`` → ``Deletion,Deletion``; ``PartialDeletionHet`` → ``Deletion,PartialDeletion1``.
+
 0.12.0 (2022-01-29)
 -------------------
 
@@ -21,7 +45,7 @@ Changelog
 * Fix minor bug in :command:`compute-copy-number` command.
 * Update :command:`plot-cn-af` command to check input files more rigorously.
 * Improve CNV caller for CYP2A6, CYP2D6, and SLC22A2.
-* Add new method :meth:`sdk.utils.add_cn_samples` method.
+* Add new method :meth:`sdk.utils.add_cn_samples`.
 * Update :command:`compare-genotypes` command to output CNV comparisonw results as well.
 * Update :command:`estimate-phase-beagle` command. From now on, the 'chr' prefix in contig names (e.g. 'chr1' vs. '1') will be automatically added or removed as necessary to match the reference VCF’s contig names.
 * Add index files for 1KGP reference haplotype panels.

diff --git a/README.rst b/README.rst
@@ -33,6 +33,9 @@ The package is written in Python, and supports both command line interface
 (CLI) and application programming interface (API) whose documentations are
 available at the `Read the Docs <https://pypgx.readthedocs.io/en/latest/>`_.
 
+PyPGx can be used to predict PGx genotypes and phenotypes using various
+genomic data, including data from next-generation sequencing (NGS), single
+nucleotide polymorphism (SNP) array, and long-read sequencing. Importantly,
 PyPGx is compatible with both of the Genome Reference Consortium Human (GRCh)
 builds, GRCh37 (hg19) and GRCh38 (hg38).
 
@@ -172,7 +175,7 @@ directory in order for PyPGx to correctly access the moved files:
 .. code-block:: text
 
    $ cd ~
-   $ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
+   $ git clone --branch 0.13.0 --depth 1 https://github.com/sbslee/pypgx-bundle
 
 This is undoubtedly annoying, but absolutely necessary for portability
 reasons because PyPGx has been growing exponentially in file size due to the
@@ -189,35 +192,43 @@ sv>`__ such as gene deletions, duplications, and hybrids. You can visit the
 `Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page to see the
 list of genes with SV.
 
-Some of the SV events can be quite challenging to detect accurately with
-next-generation sequencing (NGS) data due to misalignment of sequence reads
-caused by sequence homology with other gene family members (e.g. CYP2D6 and
-CYP2D7). PyPGx attempts to address this issue by training a `support vector
-machine (SVM) <https://scikit-learn.org/stable/modules/generated/sk
-learn.svm.SVC.html>`__-based multiclass classifier using the `one-vs-rest
-strategy <https://scikit-learn.org/stable/modules/generated/sklearn.multi
-class.OneVsRestClassifier.html>`__ for each gene for each GRCh build. Each
-classifier is trained using copy number profiles of real NGS samples as well
-as simulated ones.
+Some of the SV events can be quite challenging to detect accurately with NGS
+data due to misalignment of sequence reads caused by sequence homology with
+other gene family members (e.g. CYP2D6 and CYP2D7). PyPGx attempts to address
+this issue by training a `support vector machine (SVM) <https://scikit-
+learn.org/stable/modules/generated/sklearn.svm.SVC.html>`__-based multiclass
+classifier using the `one-vs-rest strategy <https://scikit-learn.org/stable
+/modules/generated/sklearn.multiclass.OneVsRestClassifier.html>`__ for each
+gene for each GRCh build. Each classifier is trained using copy number
+profiles of real NGS samples as well as simulated ones.
 
 You can plot copy number profile and allele fraction profile with PyPGx to
 visually inspect SV calls. Below are CYP2D6 examples:
 
 .. list-table::
    :header-rows: 1
-   :widths: 20 80
+   :widths: 10 30 60
 
    * - SV Name
+     - Gene Model
      - Profile
    * - Normal
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-1.png
      - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-8.png
    * - DeletionHet
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-2.png
      - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-1.png
+   * - DeletionHom
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-3.png
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-6.png
    * - Duplication
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-4.png
      - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-2.png
    * - Tandem3
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-11.png
      - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-9.png
    * - Tandem2C
+     - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-10.png
      - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-7.png
 
 GRCh37 vs. GRCh38
@@ -229,10 +240,10 @@ may be tempted to use tools like ``LiftOver`` to convert GRCh37 to GRCh38, or
 vice versa, but deep down you know it's going to be a mess (and please don't
 do this). The good news is, PyPGx supports both of the builds!
 
-In many of the PyPGx actions, you can simply indicate which human genome
-build to use. For example, you can use ``assembly`` for the API and
-``--assembly`` for the CLI. **Note that GRCh37 will always be the default.**
-Below is an example of using the API:
+In many PyPGx actions, you can simply indicate which genome build to use. For
+example, for GRCh38 data you can use ``--assembly GRCh38`` in CLI and
+``assembly='GRCh38'`` in API. **Note that GRCh37 will always be the
+default.** Below is an example of using the API:
 
 .. code:: python3
 
@@ -300,7 +311,7 @@ as pairs of ``=``-separated keys and values (e.g. ``Assembly=GRCh37``):
       - ``CYP2D6``, ``GSTT1``
     * - ``Platform``
       - Genotyping platform.
-      - ``WGS``, ``Targeted``, ``Chip``
+      - ``WGS``, ``Targeted``, ``Chip``, ``LongRead``
     * - ``Program``
       - Name of the phasing program.
       - ``Beagle``, ``SHAPEIT``
@@ -411,16 +422,69 @@ input and outputs a ``SampleTable[Phenotypes]`` file:
 Pipelines
 =========
 
-PyPGx provides two pipelines for performing PGx genotype analysis: NGS pipeline and chip pipeline.
+PyPGx currently provides three pipelines for performing PGx genotype analysis
+of single gene for one or multiple samples: NGS pipeline, chip pipeline, and
+long-read pipeline. In additional to genotyping, each pipeline will perform
+phenotype prediction based on genotype results. All pipelines are compatible
+with both GRCh37 and GRCh38 (e.g. for GRCh38 use ``--assembly GRCh38`` in CLI
+and ``assembly='GRCh38'`` in API).
 
-**NGS pipeline**
+NGS pipeline
+------------
 
 .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-ngs-pipeline.png
 
-**Chip pipeline**
+Implemented as ``pypgx run-ngs-pipeline`` in CLI and
+``pypgx.pipeline.run_ngs_pipeline`` in API, this pipeline is designed for
+processing short-read data (e.g. Illumina). Users must specify whether the
+input data is from whole genome sequencing (WGS) or targeted sequencing
+(custome targeted panel sequencing or whole exome sequencing).
+
+This pipeline supports SV detection based on copy number analysis for genes
+that are known to have SV. Therefore, if the target gene is associated with
+SV (e.g. CYP2D6) it's strongly recommended to provide a
+``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in
+addtion to a VCF file containing SNVs/indels. If the target gene is not
+associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can
+visit the `Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page
+to see the full list of genes with SV. For details on SV detection algorithm,
+please see the `Structural variation detection <https://pypgx.readthedocs.io/
+en/latest/readme.html#structural-variation-detection>`__ section.
+
+Chip pipeline
+-------------
 
 .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-chip-pipeline.png
 
+Implemented as ``pypgx run-chip-pipeline`` in CLI and
+``pypgx.pipeline.run_chip_pipeline`` in API, this pipeline is designed for
+DNA chip data (e.g. Global Screening Array from Illumina). It's recommended
+to perform variant imputation on the input VCF prior to feeding it to the
+pipeline using a large reference haplotype panel (e.g. `TOPMed Imputation
+Server <https://imputation.biodatacatalyst.nhlbi.nih.gov/>`__).
+Alternatively, it's possible to perform variant imputation with the 1000
+Genomes Project (1KGP) data as reference within PyPGx using ``--impute`` in
+CLI and ``impute=True`` in API.
+
+The pipeline currently does not support SV detection. Please post a GitHub
+issue if you want to contribute your development skills and/or data for
+devising an SV detection algorithm.
+
+Long-read pipeline
+------------------
+
+.. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-long-read-pipeline.png
+
+Implemented as ``pypgx run-long-read-pipeline`` in CLI and
+``pypgx.pipeline.run_long_read_pipeline`` in API, this pipeline is designed
+for long-read data (e.g. Pacific Biosciences and Oxford Nanopore
+Technologies). The input VCF must be phased using a read-backed haplotype
+phasing tool such as `WhatsHap <https://github.com/whatshap/whatshap>`__.
+
+The pipeline currently does not support SV detection. Please post a GitHub
+issue if you want to contribute your development skills and/or data for
+devising an SV detection algorithm.
+
 Getting help
 ============
 
@@ -437,50 +501,50 @@ For getting help on the CLI:
    
    positional arguments:
      COMMAND
-       call-genotypes      Call genotypes for the target gene.
-       call-phenotypes     Call phenotypes for the target gene.
-       combine-results     Combine various results for the target gene.
+       call-genotypes      Call genotypes for target gene.
+       call-phenotypes     Call phenotypes for target gene.
+       combine-results     Combine various results for target gene.
        compare-genotypes   Calculate concordance between two genotype results.
        compute-control-statistics
-                           Compute summary statistics for the control gene from 
-                           BAM files.
+                           Compute summary statistics for control gene from BAM
+                           files.
        compute-copy-number
-                           Compute copy number from read depth for the target 
-                           gene.
+                           Compute copy number from read depth for target gene.
        compute-target-depth
-                           Compute read depth for the target gene from BAM files.
+                           Compute read depth for target gene from BAM files.
        create-consolidated-vcf
                            Create a consolidated VCF file.
-       create-regions-bed  Create a BED file which contains all regions used by 
+       create-regions-bed  Create a BED file which contains all regions used by
                            PyPGx.
        estimate-phase-beagle
-                           Estimate haplotype phase of observed variants with 
+                           Estimate haplotype phase of observed variants with
                            the Beagle program.
        filter-samples      Filter Archive file for specified samples.
-       import-read-depth   Import read depth data for the target gene.
-       import-variants     Import variant (SNV/indel) data for the target gene
+       import-read-depth   Import read depth data for target gene.
+       import-variants     Import SNV/indel data for target gene.
        plot-bam-copy-number
                            Plot copy number profile from CovFrame[CopyNumber].
        plot-bam-read-depth
                            Plot read depth profile with BAM data.
-       plot-cn-af          Plot both copy number profile and allele fraction 
+       plot-cn-af          Plot both copy number profile and allele fraction
                            profile in one figure.
        plot-vcf-allele-fraction
                            Plot allele fraction profile with VCF data.
        plot-vcf-read-depth
                            Plot read depth profile with VCF data.
-       predict-alleles     Predict candidate star alleles based on observed 
+       predict-alleles     Predict candidate star alleles based on observed
                            variants.
-       predict-cnv         Predict CNV for the target gene based on copy number 
-                           data.
+       predict-cnv         Predict CNV from copy number data for target gene.
        prepare-depth-of-coverage
-                           Prepare a depth of coverage file for all target 
-                           genes with SV.
+                           Prepare a depth of coverage file for all target
+                           genes with SV from BAM files.
        print-metadata      Print the metadata of specified archive.
-       run-chip-pipeline   Run PyPGx's genotyping pipeline for chip data.
-       run-ngs-pipeline    Run PyPGx's genotyping pipeline for NGS data.
-       test-cnv-caller     Test a CNV caller for the target gene.
-       train-cnv-caller    Train a CNV caller for the target gene.
+       run-chip-pipeline   Run genotyping pipeline for chip data.
+       run-long-read-pipeline
+                           Run genotyping pipeline for long-read sequencing data.
+       run-ngs-pipeline    Run genotyping pipeline for NGS data.
+       test-cnv-caller     Test CNV caller for target gene.
+       train-cnv-caller    Train CNV caller for target gene.
    
    optional arguments:
      -h, --help            Show this help message and exit.