Skip to content

Commit

Permalink
Merge pull request #48 from sbslee/0.13.0-dev
Browse files Browse the repository at this point in the history
0.13.0 dev
  • Loading branch information
sbslee authored Mar 1, 2022
2 parents 274cb31 + 6647721 commit d8ceee8
Show file tree
Hide file tree
Showing 43 changed files with 2,190 additions and 1,015 deletions.
26 changes: 25 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
Changelog
*********

0.13.0 (2022-03-01)
-------------------

* Add new genotyping platform, ``LongRead``, to :command:`import-variants` command.
* Add new command :command:`run-long-read-pipeline`.
* Remove ``Code`` column from ``cnv-table.csv`` file. From now on, CNV codes will be generated on the fly.
* Add new method :meth:`api.core.load_cpic_table`.
* Move following errors from ``api.core`` submodule to ``sdk.utils`` submodule: :class:`AlleleNotFoundError`, :class:`GeneNotFoundError`, :class:`NotTargetGeneError`, :class:`PhenotypeNotFoundError`, :class:`VariantNotFoundError`.
* Combine optional arguments ``--bam`` and ``--fn`` into single positional argument ``bams`` for following commands: :command:`compute-control-statistics`, :command:`compute-target-depth`, :command:`prepare-depth-of-coverage`.
* Rename ``output`` argument to ``copy-number`` for :command:`compute-copy-number` command.
* Rename ``output`` argument to ``read-depth`` for :command:`compute-read-depth` command.
* Combine optional arguments ``--gene`` and ``--region`` into single positional argument ``gene`` for :command:`compute-control-statistics` command.
* Deprecate :meth:`sdk.utils.parse_input_bams` method.
* Update :meth:`api.utils.predict_alleles` method to match ``0.31.0`` version of ``fuc`` package.
* Fix bug in :command:`filter-samples` command when ``--exclude`` argument is used for archive files with SampleTable type.
* Remove unnecessary optional argument ``assembly`` from :meth:`api.core.get_ref_allele`.
* Improve CNV caller for CYP2A6, CYP2B6, CYP2D6, CYP2E1, CYP4F2, GSTM1, SLC22A2, SULT1A1, UGT1A4, UGT2B15, and UGT2B17.
* Add a new CNV call for CYP2D6: ``PseudogeneDeletion``.
* In CYP2E1 CNV nomenclature, ``PartialDuplication`` has been renamed to ``PartialDuplicationHet`` and a new CNV call ``PartialDuplicationHom`` has been added. Furthermore, calling algorithm for CYP2E1\*S1 allele has been updated. When partial duplication is present, from now on the algorithm requires only \*7 to call \*S1 instead of both \*7 and \*4.
* Add a new CNV call for SLC22A2: ``Intron9Deletion,Exon11Deletion``.
* Add a new CNV call for UGT1A4: ``Intron1PartialDup``.
* Add new CNV calls for UGT2B15: ``PartialDeletion3`` and ``Deletion``.
* Add a new CNV call for UGT2B17: ``Deletion,PartialDeletion2``. Additionally, several CNV calls have been renamed: ``Normal`` → ``Normal,Normal``; ``DeletionHet`` → ``Normal,Deletion``; ``DeletionHom`` → ``Deletion,Deletion``; ``PartialDeletionHet`` → ``Deletion,PartialDeletion1``.

0.12.0 (2022-01-29)
-------------------

Expand All @@ -21,7 +45,7 @@ Changelog
* Fix minor bug in :command:`compute-copy-number` command.
* Update :command:`plot-cn-af` command to check input files more rigorously.
* Improve CNV caller for CYP2A6, CYP2D6, and SLC22A2.
* Add new method :meth:`sdk.utils.add_cn_samples` method.
* Add new method :meth:`sdk.utils.add_cn_samples`.
* Update :command:`compare-genotypes` command to output CNV comparisonw results as well.
* Update :command:`estimate-phase-beagle` command. From now on, the 'chr' prefix in contig names (e.g. 'chr1' vs. '1') will be automatically added or removed as necessary to match the reference VCF’s contig names.
* Add index files for 1KGP reference haplotype panels.
Expand Down
148 changes: 106 additions & 42 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ The package is written in Python, and supports both command line interface
(CLI) and application programming interface (API) whose documentations are
available at the `Read the Docs <https://pypgx.readthedocs.io/en/latest/>`_.

PyPGx can be used to predict PGx genotypes and phenotypes using various
genomic data, including data from next-generation sequencing (NGS), single
nucleotide polymorphism (SNP) array, and long-read sequencing. Importantly,
PyPGx is compatible with both of the Genome Reference Consortium Human (GRCh)
builds, GRCh37 (hg19) and GRCh38 (hg38).

Expand Down Expand Up @@ -172,7 +175,7 @@ directory in order for PyPGx to correctly access the moved files:
.. code-block:: text
$ cd ~
$ git clone --branch 0.12.0 --depth 1 https://github.com/sbslee/pypgx-bundle
$ git clone --branch 0.13.0 --depth 1 https://github.com/sbslee/pypgx-bundle
This is undoubtedly annoying, but absolutely necessary for portability
reasons because PyPGx has been growing exponentially in file size due to the
Expand All @@ -189,35 +192,43 @@ sv>`__ such as gene deletions, duplications, and hybrids. You can visit the
`Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page to see the
list of genes with SV.

Some of the SV events can be quite challenging to detect accurately with
next-generation sequencing (NGS) data due to misalignment of sequence reads
caused by sequence homology with other gene family members (e.g. CYP2D6 and
CYP2D7). PyPGx attempts to address this issue by training a `support vector
machine (SVM) <https://scikit-learn.org/stable/modules/generated/sk
learn.svm.SVC.html>`__-based multiclass classifier using the `one-vs-rest
strategy <https://scikit-learn.org/stable/modules/generated/sklearn.multi
class.OneVsRestClassifier.html>`__ for each gene for each GRCh build. Each
classifier is trained using copy number profiles of real NGS samples as well
as simulated ones.
Some of the SV events can be quite challenging to detect accurately with NGS
data due to misalignment of sequence reads caused by sequence homology with
other gene family members (e.g. CYP2D6 and CYP2D7). PyPGx attempts to address
this issue by training a `support vector machine (SVM) <https://scikit-
learn.org/stable/modules/generated/sklearn.svm.SVC.html>`__-based multiclass
classifier using the `one-vs-rest strategy <https://scikit-learn.org/stable
/modules/generated/sklearn.multiclass.OneVsRestClassifier.html>`__ for each
gene for each GRCh build. Each classifier is trained using copy number
profiles of real NGS samples as well as simulated ones.

You can plot copy number profile and allele fraction profile with PyPGx to
visually inspect SV calls. Below are CYP2D6 examples:

.. list-table::
:header-rows: 1
:widths: 20 80
:widths: 10 30 60

* - SV Name
- Gene Model
- Profile
* - Normal
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-1.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-8.png
* - DeletionHet
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-2.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-1.png
* - DeletionHom
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-3.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-6.png
* - Duplication
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-4.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-2.png
* - Tandem3
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-11.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-9.png
* - Tandem2C
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-10.png
- .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-7.png

GRCh37 vs. GRCh38
Expand All @@ -229,10 +240,10 @@ may be tempted to use tools like ``LiftOver`` to convert GRCh37 to GRCh38, or
vice versa, but deep down you know it's going to be a mess (and please don't
do this). The good news is, PyPGx supports both of the builds!

In many of the PyPGx actions, you can simply indicate which human genome
build to use. For example, you can use ``assembly`` for the API and
``--assembly`` for the CLI. **Note that GRCh37 will always be the default.**
Below is an example of using the API:
In many PyPGx actions, you can simply indicate which genome build to use. For
example, for GRCh38 data you can use ``--assembly GRCh38`` in CLI and
``assembly='GRCh38'`` in API. **Note that GRCh37 will always be the
default.** Below is an example of using the API:

.. code:: python3
Expand Down Expand Up @@ -300,7 +311,7 @@ as pairs of ``=``-separated keys and values (e.g. ``Assembly=GRCh37``):
- ``CYP2D6``, ``GSTT1``
* - ``Platform``
- Genotyping platform.
- ``WGS``, ``Targeted``, ``Chip``
- ``WGS``, ``Targeted``, ``Chip``, ``LongRead``
* - ``Program``
- Name of the phasing program.
- ``Beagle``, ``SHAPEIT``
Expand Down Expand Up @@ -411,16 +422,69 @@ input and outputs a ``SampleTable[Phenotypes]`` file:
Pipelines
=========

PyPGx provides two pipelines for performing PGx genotype analysis: NGS pipeline and chip pipeline.
PyPGx currently provides three pipelines for performing PGx genotype analysis
of single gene for one or multiple samples: NGS pipeline, chip pipeline, and
long-read pipeline. In additional to genotyping, each pipeline will perform
phenotype prediction based on genotype results. All pipelines are compatible
with both GRCh37 and GRCh38 (e.g. for GRCh38 use ``--assembly GRCh38`` in CLI
and ``assembly='GRCh38'`` in API).

**NGS pipeline**
NGS pipeline
------------

.. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-ngs-pipeline.png

**Chip pipeline**
Implemented as ``pypgx run-ngs-pipeline`` in CLI and
``pypgx.pipeline.run_ngs_pipeline`` in API, this pipeline is designed for
processing short-read data (e.g. Illumina). Users must specify whether the
input data is from whole genome sequencing (WGS) or targeted sequencing
(custome targeted panel sequencing or whole exome sequencing).

This pipeline supports SV detection based on copy number analysis for genes
that are known to have SV. Therefore, if the target gene is associated with
SV (e.g. CYP2D6) it's strongly recommended to provide a
``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in
addtion to a VCF file containing SNVs/indels. If the target gene is not
associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can
visit the `Genes <https://pypgx.readthedocs.io/en/latest/genes.html>`__ page
to see the full list of genes with SV. For details on SV detection algorithm,
please see the `Structural variation detection <https://pypgx.readthedocs.io/
en/latest/readme.html#structural-variation-detection>`__ section.

Chip pipeline
-------------

.. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-chip-pipeline.png

Implemented as ``pypgx run-chip-pipeline`` in CLI and
``pypgx.pipeline.run_chip_pipeline`` in API, this pipeline is designed for
DNA chip data (e.g. Global Screening Array from Illumina). It's recommended
to perform variant imputation on the input VCF prior to feeding it to the
pipeline using a large reference haplotype panel (e.g. `TOPMed Imputation
Server <https://imputation.biodatacatalyst.nhlbi.nih.gov/>`__).
Alternatively, it's possible to perform variant imputation with the 1000
Genomes Project (1KGP) data as reference within PyPGx using ``--impute`` in
CLI and ``impute=True`` in API.

The pipeline currently does not support SV detection. Please post a GitHub
issue if you want to contribute your development skills and/or data for
devising an SV detection algorithm.

Long-read pipeline
------------------

.. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-long-read-pipeline.png

Implemented as ``pypgx run-long-read-pipeline`` in CLI and
``pypgx.pipeline.run_long_read_pipeline`` in API, this pipeline is designed
for long-read data (e.g. Pacific Biosciences and Oxford Nanopore
Technologies). The input VCF must be phased using a read-backed haplotype
phasing tool such as `WhatsHap <https://github.com/whatshap/whatshap>`__.

The pipeline currently does not support SV detection. Please post a GitHub
issue if you want to contribute your development skills and/or data for
devising an SV detection algorithm.

Getting help
============

Expand All @@ -437,50 +501,50 @@ For getting help on the CLI:
positional arguments:
COMMAND
call-genotypes Call genotypes for the target gene.
call-phenotypes Call phenotypes for the target gene.
combine-results Combine various results for the target gene.
call-genotypes Call genotypes for target gene.
call-phenotypes Call phenotypes for target gene.
combine-results Combine various results for target gene.
compare-genotypes Calculate concordance between two genotype results.
compute-control-statistics
Compute summary statistics for the control gene from
BAM files.
Compute summary statistics for control gene from BAM
files.
compute-copy-number
Compute copy number from read depth for the target
gene.
Compute copy number from read depth for target gene.
compute-target-depth
Compute read depth for the target gene from BAM files.
Compute read depth for target gene from BAM files.
create-consolidated-vcf
Create a consolidated VCF file.
create-regions-bed Create a BED file which contains all regions used by
create-regions-bed Create a BED file which contains all regions used by
PyPGx.
estimate-phase-beagle
Estimate haplotype phase of observed variants with
Estimate haplotype phase of observed variants with
the Beagle program.
filter-samples Filter Archive file for specified samples.
import-read-depth Import read depth data for the target gene.
import-variants Import variant (SNV/indel) data for the target gene
import-read-depth Import read depth data for target gene.
import-variants Import SNV/indel data for target gene.
plot-bam-copy-number
Plot copy number profile from CovFrame[CopyNumber].
plot-bam-read-depth
Plot read depth profile with BAM data.
plot-cn-af Plot both copy number profile and allele fraction
plot-cn-af Plot both copy number profile and allele fraction
profile in one figure.
plot-vcf-allele-fraction
Plot allele fraction profile with VCF data.
plot-vcf-read-depth
Plot read depth profile with VCF data.
predict-alleles Predict candidate star alleles based on observed
predict-alleles Predict candidate star alleles based on observed
variants.
predict-cnv Predict CNV for the target gene based on copy number
data.
predict-cnv Predict CNV from copy number data for target gene.
prepare-depth-of-coverage
Prepare a depth of coverage file for all target
genes with SV.
Prepare a depth of coverage file for all target
genes with SV from BAM files.
print-metadata Print the metadata of specified archive.
run-chip-pipeline Run PyPGx's genotyping pipeline for chip data.
run-ngs-pipeline Run PyPGx's genotyping pipeline for NGS data.
test-cnv-caller Test a CNV caller for the target gene.
train-cnv-caller Train a CNV caller for the target gene.
run-chip-pipeline Run genotyping pipeline for chip data.
run-long-read-pipeline
Run genotyping pipeline for long-read sequencing data.
run-ngs-pipeline Run genotyping pipeline for NGS data.
test-cnv-caller Test CNV caller for target gene.
train-cnv-caller Train CNV caller for target gene.
optional arguments:
-h, --help Show this help message and exit.
Expand Down
Loading

0 comments on commit d8ceee8

Please sign in to comment.