diff --git a/CHANGELOG.rst b/CHANGELOG.rst index f5959baf..318dbb75 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -1,6 +1,21 @@ Changelog ********* +0.15.0 (2022-05-03) +------------------- + +* Add new optional arguments ``--genes`` and ``--exclude`` to :command:`prepare-depth-of-coverage` command. +* Add new command :command:`slice-bam`. +* Add new command :command:`print-data`. +* Fix typo "statistcs" to "statistics" throughout the package. +* Update :meth:`sdk.utils.simulate_copy_number` method to automatically handle duplicate sample names. +* Improve CNV caller for CYP2A6, CYP2B6, CYP2D6, CYP2E1, GSTM1, SLC22A2, SULT1A1, UGT1A4, UGT2B15, UGT2B17. +* Add new CNV calls for CYP2A6: ``Deletion2Hom``, ``Hybrid5``, ``Hybrid6``, ``PseudogeneDeletion``. +* Add new CNV call for CYP2D6: ``Tandem2F``. +* Add new CNV call for GSTM1: ``Normal,Deletion2``. +* Add new CNV call for SULT1A1: ``Unknown1``. +* Add new CNV call for UGT2B17: ``Deletion,PartialDeletion3``. + 0.14.0 (2022-04-03) ------------------- diff --git a/README.rst b/README.rst index a3fd0334..26b0be65 100644 --- a/README.rst +++ b/README.rst @@ -357,7 +357,7 @@ currently defined semantic types: - ``SampleTable[Results]`` * TSV file for storing various results for each sample. * Requires following metadata: ``Gene``, ``Assembly``, ``SemanticType``. -- ``SampleTable[Statistcs]`` +- ``SampleTable[Statistics]`` * TSV file for storing control gene's various statistics on read depth for each sample. Used for converting target gene's read depth to copy number. * Requires following metadata: ``Control``, ``Assembly``, ``SemanticType``, ``Platform``. - ``VcfFrame[Consolidated]`` @@ -370,11 +370,12 @@ currently defined semantic types: * VcfFrame for storing target gene's phased variant data. * Requires following metadata: ``Platform``, ``Gene``, ``Assembly``, ``SemanticType``, ``Program``. -Wroking with archive files +Working with archive files -------------------------- To demonstrate how easy it is to work with PyPGx archive files, below we will -show some examples. First, download an archive: +show some examples. First, download an archive to play with, which has +``SampleTable[Results]`` as semantic type: .. code-block:: text @@ -389,6 +390,14 @@ Let's print its metadata: Assembly=GRCh37 SemanticType=SampleTable[Results] +Now print its main data (but display first sample only): + +.. code-block:: text + + $ pypgx print-data grch37-CYP2D6-results.zip | head -n 2 + Genotype Phenotype Haplotype1 Haplotype2 AlternativePhase VariantData CNV + HG00276_PyPGx *4/*5 Poor Metabolizer *4;*10;*74;*2; *10;*74;*2; ; *4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default; DeletionHet + We can unzip it to extract files inside (note that ``tmpcty4c_cr`` is the original folder name): @@ -500,7 +509,7 @@ input data is from whole genome sequencing (WGS) or targeted sequencing This pipeline supports SV detection based on copy number analysis for genes that are known to have SV. Therefore, if the target gene is associated with SV (e.g. CYP2D6) it's strongly recommended to provide a -``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in +``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistics]`` file in addtion to a VCF file containing SNVs/indels. If the target gene is not associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can visit the `Genes `__ page @@ -515,6 +524,9 @@ HaplotypeCaller). See the `Variant caller choice `__ section for detailed discussion on when to use either option. +Check out the `GeT-RM WGS tutorial `__ to see this pipeline in action. + Chip pipeline ------------- @@ -534,6 +546,9 @@ The pipeline currently does not support SV detection. Please post a GitHub issue if you want to contribute your development skills and/or data for devising an SV detection algorithm. +Check out the `Coriell Affy tutorial `__ to see this pipeline in action. + Long-read pipeline ------------------ @@ -664,11 +679,13 @@ For getting help on the CLI: prepare-depth-of-coverage Prepare a depth of coverage file for all target genes with SV from BAM files. + print-data Print the main data of specified archive. print-metadata Print the metadata of specified archive. run-chip-pipeline Run genotyping pipeline for chip data. run-long-read-pipeline Run genotyping pipeline for long-read sequencing data. run-ngs-pipeline Run genotyping pipeline for NGS data. + slice-bam Slice BAM file for all genes used by PyPGx. test-cnv-caller Test CNV caller for target gene. train-cnv-caller Train CNV caller for target gene. diff --git a/docs/cli.rst b/docs/cli.rst index 1b22d6c0..e09639ef 100644 --- a/docs/cli.rst +++ b/docs/cli.rst @@ -55,11 +55,13 @@ For getting help on the CLI: prepare-depth-of-coverage Prepare a depth of coverage file for all target genes with SV from BAM files. + print-data Print the main data of specified archive. print-metadata Print the metadata of specified archive. run-chip-pipeline Run genotyping pipeline for chip data. run-long-read-pipeline Run genotyping pipeline for long-read sequencing data. run-ngs-pipeline Run genotyping pipeline for NGS data. + slice-bam Slice BAM file for all genes used by PyPGx. test-cnv-caller Test CNV caller for target gene. train-cnv-caller Train CNV caller for target gene. @@ -201,13 +203,13 @@ compute-control-statistics [Example] For the VDR gene from WGS data: $ pypgx compute-control-statistics \ VDR \ - control-statistcs.zip \ + control-statistics.zip \ 1.bam 2.bam [Example] For a custom region from targeted sequencing data: $ pypgx compute-control-statistics \ chr1:100-200 \ - control-statistcs.zip \ + control-statistics.zip \ bam.list \ --bed probes.bed @@ -218,7 +220,7 @@ compute-copy-number $ pypgx compute-copy-number -h usage: pypgx compute-copy-number [-h] [--samples-without-sv TEXT [TEXT ...]] - read-depth control-statistcs copy-number + read-depth control-statistics copy-number Compute copy number from read depth for target gene. @@ -233,7 +235,7 @@ compute-copy-number Positional arguments: read-depth Input archive file with the semantic type CovFrame[ReadDepth]. - control-statistcs Input archive file with the semantic type + control-statistics Input archive file with the semantic type SampleTable[Statistics]. copy-number Output archive file with the semantic type CovFrame[CopyNumber]. @@ -703,6 +705,7 @@ prepare-depth-of-coverage $ pypgx prepare-depth-of-coverage -h usage: pypgx prepare-depth-of-coverage [-h] [--assembly TEXT] [--bed PATH] + [--genes TEXT [TEXT ...]] [--exclude] depth-of-coverage bams [bams ...] Prepare a depth of coverage file for all target genes with SV from BAM files. @@ -713,22 +716,26 @@ prepare-depth-of-coverage have star alleles defined only by SNVs/indels (e.g. CYP3A5). Positional arguments: - depth-of-coverage Output archive file with the semantic type - CovFrame[DepthOfCoverage]. - bams One or more input BAM files. Alternatively, you can - provide a text file (.txt, .tsv, .csv, or .list) - containing one BAM file per line. + depth-of-coverage Output archive file with the semantic type + CovFrame[DepthOfCoverage]. + bams One or more input BAM files. Alternatively, you can + provide a text file (.txt, .tsv, .csv, or .list) + containing one BAM file per line. Optional arguments: - -h, --help Show this help message and exit. - --assembly TEXT Reference genome assembly (default: 'GRCh37') - (choices: 'GRCh37', 'GRCh38'). - --bed PATH By default, the input data is assumed to be WGS. If - it's targeted sequencing, you must provide a BED file - to indicate probed regions. Note that the 'chr' prefix - in contig names (e.g. 'chr1' vs. '1') will be - automatically added or removed as necessary to match - the input BAM's contig names. + -h, --help Show this help message and exit. + --assembly TEXT Reference genome assembly (default: 'GRCh37') + (choices: 'GRCh37', 'GRCh38'). + --bed PATH By default, the input data is assumed to be WGS. If + it's targeted sequencing, you must provide a BED file + to indicate probed regions. Note that the 'chr' prefix + in contig names (e.g. 'chr1' vs. '1') will be + automatically added or removed as necessary to match + the input BAM's contig names. + --genes TEXT [TEXT ...] + List of genes to include. + --exclude Exclude specified genes. Ignored when --genes is not + used. [Example] From WGS data: $ pypgx prepare-depth-of-coverage \ @@ -741,6 +748,22 @@ prepare-depth-of-coverage bam.list \ --bed probes.bed +print-data +========== + +.. code-block:: text + + $ pypgx print-data -h + usage: pypgx print-data [-h] input + + Print the main data of specified archive. + + Positional arguments: + input Input archive file. + + Optional arguments: + -h, --help Show this help message and exit. + print-metadata ============== @@ -876,7 +899,7 @@ run-ngs-pipeline CovFrame[DepthOfCoverage]. --control-statistics PATH Archive file with the semantic type - SampleTable[Statistcs]. + SampleTable[Statistics]. --platform TEXT Genotyping platform (default: 'WGS') (choices: 'WGS', 'Targeted') --assembly TEXT Reference genome assembly (default: 'GRCh37') @@ -897,7 +920,7 @@ run-ngs-pipeline Do not plot copy number profile. --do-not-plot-allele-fraction Do not plot allele fraction profile. - --cnv-caller PATH Archive file with the semantic type Model[CNV]. By + --cnv-caller PATH Archive file with the semantic type Model[CNV]. By default, a pre-trained CNV caller in the ~/pypgx-bundle directory will be used. @@ -913,7 +936,7 @@ run-ngs-pipeline CYP2D6-pipeline \ --variants variants.vcf.gz \ --depth-of-coverage depth-of-coverage.tsv \ - --control-statistcs control-statistics-VDR.zip + --control-statistics control-statistics-VDR.zip [Example] To genotype the CYP2D6 gene from targeted sequencing data: $ pypgx run-ngs-pipeline \ @@ -921,9 +944,35 @@ run-ngs-pipeline CYP2D6-pipeline \ --variants variants.vcf.gz \ --depth-of-coverage depth-of-coverage.tsv \ - --control-statistcs control-statistics-VDR.zip \ + --control-statistics control-statistics-VDR.zip \ --platform Targeted +slice-bam +========= + +.. code-block:: text + + $ pypgx slice-bam -h + usage: pypgx slice-bam [-h] [--assembly TEXT] [--genes TEXT [TEXT ...]] + [--exclude] + input output + + Slice BAM file for all genes used by PyPGx. + + Positional arguments: + input Input BAM file. It must be already indexed to allow + random access. + output Output BAM file. + + Optional arguments: + -h, --help Show this help message and exit. + --assembly TEXT Reference genome assembly (default: 'GRCh37') + (choices: 'GRCh37', 'GRCh38'). + --genes TEXT [TEXT ...] + List of genes to include. + --exclude Exclude specified genes. Ignored when --genes is not + used. + test-cnv-caller =============== diff --git a/docs/create.py b/docs/create.py index 2fbeefe9..e16dc502 100644 --- a/docs/create.py +++ b/docs/create.py @@ -384,7 +384,7 @@ - ``SampleTable[Results]`` * TSV file for storing various results for each sample. * Requires following metadata: ``Gene``, ``Assembly``, ``SemanticType``. -- ``SampleTable[Statistcs]`` +- ``SampleTable[Statistics]`` * TSV file for storing control gene's various statistics on read depth for each sample. Used for converting target gene's read depth to copy number. * Requires following metadata: ``Control``, ``Assembly``, ``SemanticType``, ``Platform``. - ``VcfFrame[Consolidated]`` @@ -397,11 +397,12 @@ * VcfFrame for storing target gene's phased variant data. * Requires following metadata: ``Platform``, ``Gene``, ``Assembly``, ``SemanticType``, ``Program``. -Wroking with archive files +Working with archive files -------------------------- To demonstrate how easy it is to work with PyPGx archive files, below we will -show some examples. First, download an archive: +show some examples. First, download an archive to play with, which has +``SampleTable[Results]`` as semantic type: .. code-block:: text @@ -416,6 +417,14 @@ Assembly=GRCh37 SemanticType=SampleTable[Results] +Now print its main data (but display first sample only): + +.. code-block:: text + + $ pypgx print-data grch37-CYP2D6-results.zip | head -n 2 + Genotype Phenotype Haplotype1 Haplotype2 AlternativePhase VariantData CNV + HG00276_PyPGx *4/*5 Poor Metabolizer *4;*10;*74;*2; *10;*74;*2; ; *4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default; DeletionHet + We can unzip it to extract files inside (note that ``tmpcty4c_cr`` is the original folder name): @@ -527,7 +536,7 @@ This pipeline supports SV detection based on copy number analysis for genes that are known to have SV. Therefore, if the target gene is associated with SV (e.g. CYP2D6) it's strongly recommended to provide a -``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistcs]`` file in +``CovFrame[DepthOfCoverage]`` file and a ``SampleTable[Statistics]`` file in addtion to a VCF file containing SNVs/indels. If the target gene is not associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can visit the `Genes `__ page @@ -542,6 +551,9 @@ io/en/latest/faq.html#variant-caller-choice>`__ section for detailed discussion on when to use either option. +Check out the `GeT-RM WGS tutorial `__ to see this pipeline in action. + Chip pipeline ------------- @@ -561,6 +573,9 @@ issue if you want to contribute your development skills and/or data for devising an SV detection algorithm. +Check out the `Coriell Affy tutorial `__ to see this pipeline in action. + Long-read pipeline ------------------ diff --git a/docs/genes.rst b/docs/genes.rst index 741871ea..c883c20d 100644 --- a/docs/genes.rst +++ b/docs/genes.rst @@ -533,7 +533,7 @@ Below is a summary table: - `chr4:68640596-68676652 `__ - * - :ref:`genes:UGT2B17` - - + - - ✅ - - @@ -725,6 +725,17 @@ Below is comprehensive summary of SV described from real NGS studies: - - - + * - \*4 + - Deletion2Hom + - \*4/\*4 + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - NA21093 + - * - \*4 - Deletion3Het - \*4/\*9 @@ -812,7 +823,7 @@ Below is comprehensive summary of SV described from real NGS studies: - WGS - `1KGP `__ - NA18516 - - \*34 has axons 1-4 of CYP2A7 origin and axons 5-9 of CYP2A6 origin (breakpoint in intron 4). + - \*34 has exons 1-4 of CYP2A7 origin and exons 5-9 of CYP2A6 origin (breakpoint in intron 4). * - - Hybrid4 - Indeterminate @@ -824,6 +835,28 @@ Below is comprehensive summary of SV described from real NGS studies: - `1KGP `__ - NA20515 - + * - + - Hybrid5 + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - HG00155 + - + * - + - Hybrid6 + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - HG00141 + - * - - PseudogeneDuplication - \*1/\*18 @@ -846,6 +879,17 @@ Below is comprehensive summary of SV described from real NGS studies: - `1KGP `__ - NA20828 - + * - + - PseudogeneDeletion + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - HG00625 + - Filtered alleles for CYP2A6 --------------------------- @@ -1169,6 +1213,17 @@ Below is comprehensive summary of SV described from real NGS studies: - - - + * - + - Tandem2F + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - HG00458 + - * - \*13+\*1 - Tandem3 - \*1/\*13+\*1 @@ -1780,6 +1835,17 @@ Below is comprehensive summary of SV described from real NGS studies: - `GeT-RM `__ - NA18855 - + * - \*0 + - Normal,Deletion2 + - \*0/\*A + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `GeT-RM `__ + - NA21097 + - * - \*0 - DeletionHom - \*0/\*0 @@ -2231,6 +2297,17 @@ Below is comprehensive summary of SV described from real NGS studies: - `GeT-RM `__ - NA19143 - + * - + - Unknown1 + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `GeT-RM `__ + - HG01085 + - TBXAS1 ====== @@ -2554,6 +2631,17 @@ Below is comprehensive summary of SV described from real NGS studies: - `1KGP `__ - NA19189 - + * - \*2 + - Deletion,PartialDeletion3 + - Indeterminate + - + - :download:`Model ` + - :download:`Profile ` + - :download:`Profile ` + - WGS + - `1KGP `__ + - NA21090 + - * - - Normal,PartialDeletion3 - Indeterminate diff --git a/docs/tutorials.rst b/docs/tutorials.rst index 307ebd5e..fe296e23 100644 --- a/docs/tutorials.rst +++ b/docs/tutorials.rst @@ -35,10 +35,13 @@ available for download and use from the `European Nucleotide Archive `__. We will be using this WGS dataset throughout the tutorial. -Because downloading the entire WGS dataset is not feasible for most users due -to its file size (i.e. a 30x WGS sample ≈ 90 GB), I have prepared input files -ranging from 2 KB to 17.6 MB, for both GRCh37 and GRCh38. You can download -those from: +Obtaining input files +--------------------- + +Because downloading the entire WGS dataset is probably not feasible for most +users due to large file size (i.e. a 30x WGS sample ≈ 90 GB), I have prepared +input files ranging from 2 KB to 25.5 MB, for both GRCh37 and GRCh38. You can +easily download these with: .. code-block:: text @@ -51,10 +54,12 @@ those from: $ wget https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/grch38-depth-of-coverage.zip $ wget https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/grch38-control-statistics-VDR.zip -Please visit the :ref:`readme:Pipelines` page for details on how to generate -the input files. - -Let's look at the metadata for some of these files: +Let's take a look at the metadata for some of these files. If you're not +familiar with what metadata is, please visit `Archive file, semantic type, +and metadata `__. The first one we'll +look at is an archive file with the semantic type +``CovFrame[DepthOfCoverage]``: .. code-block:: text @@ -62,19 +67,139 @@ Let's look at the metadata for some of these files: Assembly=GRCh37 SemanticType=CovFrame[DepthOfCoverage] Platform=WGS + +We can see that above archive was created using WGS data aligned to GRCh37. +It has following data structure: + +.. code-block:: text + + $ pypgx print-data grch37-depth-of-coverage.zip | head + Chromosome Position NA18519_PyPGx HG01190_PyPGx NA12006_PyPGx NA18484_PyPGx NA07055_PyPGx NA18980_PyPGx NA19213_PyPGx NA12813_PyPGx NA19003_PyPGx NA10831_PyPGx NA18524_PyPGx NA10851_PyPGx NA18966_PyPGx HG00589_PyPGx NA18855_PyPGx NA18544_PyPGx NA18518_PyPGx NA18973_PyPGx NA19143_PyPGx NA18992_PyPGx NA12873_PyPGx NA19207_PyPGx NA18942_PyPGx NA19178_PyPGx NA19789_PyPGx NA19122_PyPGx NA19174_PyPGx NA18868_PyPGx HG00436_PyPGx HG00276_PyPGx NA19239_PyPGx NA19109_PyPGx NA20509_PyPGx NA10854_PyPGx NA19226_PyPGx NA10847_PyPGx NA18552_PyPGx NA18526_PyPGx NA07029_PyPGx NA06991_PyPGx NA11832_PyPGx NA21781_PyPGx NA12145_PyPGx NA19007_PyPGx NA18861_PyPGx NA12156_PyPGx NA18952_PyPGx NA18565_PyPGx NA19920_PyPGx NA12003_PyPGx NA20296_PyPGx NA07019_PyPGx NA07056_PyPGx NA11993_PyPGx NA19147_PyPGx NA19819_PyPGx NA07000_PyPGx NA18540_PyPGx NA19095_PyPGx NA18509_PyPGx NA19917_PyPGx NA18617_PyPGx NA07357_PyPGx NA19176_PyPGx NA18959_PyPGx NA07348_PyPGx NA18564_PyPGx NA19908_PyPGx NA11839_PyPGx NA12717_PyPGx + chr1 110227417 17 0 9 12 12 13 10 0 0 0 0 1 14 10 4 26 7 6 0 0 4 19 8 6 0 15 0 17 20 0 0 15 10 11 0 7 18 0 0 0 0 22 11 0 6 0 0 0 24 17 17 12 19 0 14 0 0 13 15 8 0 24 0 10 + chr1 110227418 17 0 9 12 12 13 10 0 0 0 0 1 14 10 4 26 8 8 0 0 4 19 9 6 0 15 0 18 20 0 0 16 10 11 0 8 18 0 0 0 0 22 11 0 6 0 0 0 24 17 17 12 20 0 14 0 0 13 15 8 0 24 0 10 + chr1 110227419 17 0 10 12 12 13 10 0 0 0 0 1 14 10 4 27 8 8 0 0 5 19 9 6 0 16 0 18 20 0 0 16 11 11 0 8 18 0 0 0 0 22 12 0 6 0 0 0 24 17 17 12 20 0 14 0 0 14 15 8 0 24 0 10 + chr1 110227420 17 0 10 13 13 12 10 0 0 0 0 1 14 10 3 27 8 8 0 0 5 18 9 6 0 15 0 18 19 0 0 16 11 11 0 8 16 0 0 0 0 22 12 0 6 0 0 0 24 19 17 11 19 0 13 0 0 14 15 8 0 23 0 10 + chr1 110227421 17 0 10 13 13 12 10 0 0 0 0 1 13 10 3 27 8 8 0 0 5 18 8 7 0 15 0 19 19 0 0 16 11 11 0 8 15 0 0 0 0 22 12 0 6 0 0 0 25 20 17 11 19 0 13 0 0 15 15 8 0 23 0 10 + chr1 110227422 18 0 10 13 13 12 10 0 0 0 0 1 13 10 3 27 8 8 0 0 5 18 9 7 0 15 0 19 19 0 0 17 11 11 0 8 15 0 0 0 0 21 12 0 6 0 0 0 25 20 18 11 19 0 13 0 0 16 15 9 0 23 0 10 + chr1 110227423 18 0 10 13 13 12 10 0 0 0 0 1 13 10 3 25 8 8 0 0 5 18 9 7 0 15 0 19 18 0 0 17 11 11 0 9 15 0 0 0 0 21 13 0 6 0 0 0 25 20 18 11 19 0 13 0 0 17 15 9 0 23 0 10 + chr1 110227424 18 0 10 13 13 12 10 0 0 0 0 1 13 10 3 25 8 8 0 0 5 18 9 7 0 15 0 19 18 0 0 17 11 11 0 9 15 0 0 0 0 21 13 0 6 0 0 0 26 20 18 11 19 0 14 0 0 16 15 9 0 23 0 10 + chr1 110227425 19 0 11 13 13 12 10 0 0 0 0 1 13 10 3 25 8 8 0 0 5 18 9 8 0 15 0 20 18 0 0 17 11 11 0 9 15 0 0 0 0 21 13 0 6 0 0 0 26 20 18 13 19 0 15 0 0 16 15 9 0 23 0 10 + +The second one is an archive file with the semantic type +``SampleTable[Statistics]``: + +.. code-block:: text + $ pypgx print-metadata grch38-control-statistics-VDR.zip Control=VDR Assembly=GRCh38 SemanticType=SampleTable[Statistics] Platform=WGS +Note that this archive was created using WGS data aligned to GRCh38 and the +VDR gene as control locus, and has following data structure: + +.. code-block:: text + + $ pypgx print-data grch38-control-statistics-VDR.zip | head + count mean std min 25% 50% 75% max + NA19213_PyPGx 69459.0 40.464317079140216 7.416070659882781 5.0 35.0 40.0 45.0 67.0 + HG00436_PyPGx 69459.0 39.05070617198635 7.041075412533929 3.0 34.0 39.0 44.0 66.0 + NA12006_PyPGx 69459.0 44.49780446018514 7.565078889270334 6.0 39.0 44.0 50.0 73.0 + NA12156_PyPGx 69459.0 39.53788565916584 7.463158820634827 3.0 34.0 39.0 44.0 66.0 + NA12813_PyPGx 69459.0 37.33543529276264 6.920597209929764 7.0 33.0 37.0 42.0 67.0 + NA19207_PyPGx 69459.0 40.59959112570005 7.042408883522744 4.0 36.0 41.0 45.0 63.0 + NA07029_PyPGx 69459.0 38.69389136037086 7.075488283784741 2.0 34.0 39.0 44.0 67.0 + NA18980_PyPGx 69459.0 34.79616752328712 6.685174389736681 1.0 30.0 35.0 39.0 59.0 + NA18973_PyPGx 69459.0 36.43840251083373 7.0885860461926296 3.0 32.0 37.0 41.0 66.0 + +Finally, we'll look at the input VCF. Note that it's not an archive file per +se, but we can still peek at its data: + +.. code-block:: text + + $ zcat grch37-variants.vcf.gz | grep "#CHROM" -A 5 + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA18519_PyPGx HG01190_PyPGx NA12006_PyPGx NA18484_PyPGx NA07055_PyPGx NA18980_PyPGx NA19213_PyPGx NA12813_PyPGx NA19003_PyPGx NA10831_PyPGx NA18524_PyPGx NA10851_PyPGx NA18966_PyPGx HG00589_PyPGx NA18855_PyPGx NA18544_PyPGx NA18518_PyPGx NA18973_PyPGx NA19143_PyPGx NA18992_PyPGx NA12873_PyPGx NA19207_PyPGx NA18942_PyPGx NA19178_PyPGx NA19789_PyPGx NA19122_PyPGx NA19174_PyPGx NA18868_PyPGx HG00436_PyPGx HG00276_PyPGx NA19239_PyPGx NA19109_PyPGx NA20509_PyPGx NA10854_PyPGx NA19226_PyPGx NA10847_PyPGx NA18552_PyPGx NA18526_PyPGx NA07029_PyPGx NA06991_PyPGx NA11832_PyPGx NA21781_PyPGx NA12145_PyPGx NA19007_PyPGx NA18861_PyPGx NA12156_PyPGx NA18952_PyPGx NA18565_PyPGx NA19920_PyPGx NA12003_PyPGx NA20296_PyPGx NA07019_PyPGx NA07056_PyPGx NA11993_PyPGx NA19147_PyPGx NA19819_PyPGx NA07000_PyPGx NA18540_PyPGx NA19095_PyPGx NA18509_PyPGx NA19917_PyPGx NA18617_PyPGx NA07357_PyPGx NA19176_PyPGx NA18959_PyPGx NA07348_PyPGx NA18564_PyPGx NA19908_PyPGx NA11839_PyPGx NA12717_PyPGx + chr1 47261780 . T C 235.707 PASS DP=1519;VDB=0.326231;SGB=-40.8249;RPBZ=0.398415;MQBZ=-15.2308;MQSBZ=0.889911;BQBZ=-10.8447;SCBZ=0.105486;FS=0;MQ0F=0;AC=120;AN=140;DP4=205,13,1153,122;MQ=49 GT:PL:AD 0/0:0,57,255:19,0 0/1:204,0,172:10,11 1/1:240,45,0:0,15 0/1:147,0,165:11,10 1/1:246,54,0:0,18 1/1:255,66,0:0,22 0/1:134,0,182:15,9 1/1:255,87,0:0,29 1/1:231,54,0:0,18 1/1:224,57,0:0,19 1/1:248,36,0:0,12 0/1:120,0,176:9,7 1/1:255,54,0:0,18 1/1:198,75,0:0,25 0/1:168,0,127:7,12 1/1:255,57,0:0,19 0/1:105,0,183:9,5 1/1:223,51,0:0,17 1/1:255,63,0:0,21 1/1:255,80,0:1,31 1/1:189,60,0:0,20 0/1:148,0,214:10,12 1/1:191,45,0:0,15 0/1:98,0,175:15,6 1/1:255,69,0:0,23 0/1:158,0,100:7,16 0/1:161,0,114:5,12 0/1:255,0,138:9,14 1/1:247,81,0:0,27 1/1:227,57,0:0,19 1/1:255,63,0:0,21 1/1:255,69,0:0,23 1/1:255,75,0:0,25 1/1:255,84,0:0,28 0/1:202,0,190:14,15 1/1:224,69,0:0,23 1/1:255,66,0:0,22 1/1:255,63,0:0,21 1/1:255,39,0:0,13 1/1:255,51,0:0,17 1/1:255,72,0:0,24 1/1:231,63,0:0,21 1/1:255,78,0:0,26 1/1:255,75,0:0,25 0/1:145,0,227:16,10 1/1:200,72,0:0,24 1/1:205,72,0:0,24 1/1:207,66,0:0,22 0/1:109,0,172:12,8 0/1:174,0,135:9,14 1/1:255,66,0:0,22 1/1:255,45,0:0,15 1/1:249,54,0:0,18 1/1:255,54,0:0,18 1/1:230,72,0:0,24 1/1:247,63,0:0,21 1/1:211,81,0:0,27 1/1:255,54,0:0,18 0/1:167,0,193:13,13 1/1:255,72,0:0,24 0/1:76,0,159:11,4 1/1:236,66,0:0,22 1/1:255,78,0:0,26 1/1:218,45,0:0,15 1/1:255,60,0:0,20 1/1:255,66,0:0,22 1/1:202,78,0:0,26 1/1:255,81,0:0,27 0/1:181,0,176:16,11 1/1:231,33,0:0,11 + chr1 47261821 . G A 174.846 PASS DP=1722;VDB=0.413935;SGB=-18.2343;RPBZ=0.238211;MQBZ=-1.89867;MQSBZ=6.49061;BQBZ=1.3413;SCBZ=0.173613;FS=0;MQ0F=0;AC=1;AN=140;DP4=1407,277,14,2;MQ=52 GT:PL:AD 0/0:0,81,255:27,0 0/0:0,84,255:28,0 0/0:0,60,255:20,0 0/0:0,90,239:30,0 0/0:0,60,221:20,0 0/0:0,84,255:28,0 0/0:0,84,241:28,0 0/0:0,81,255:27,0 0/0:0,63,190:21,0 0/1:200,0,127:11,110/0:0,63,255:21,0 0/0:0,75,255:25,0 0/0:0,63,255:21,0 0/0:0,63,215:21,0 0/0:0,69,216:23,0 0/0:0,75,255:25,0 0/0:0,54,244:18,0 0/0:0,57,212:19,0 0/0:0,90,255:30,0 0/0:0,96,255:32,0 0/0:0,72,241:24,0 0/0:0,72,223:24,0 0/0:0,54,191:18,0 0/0:0,75,223:25,0 0/0:0,75,255:25,0 0/0:0,90,222:30,0 0/0:0,54,180:18,0 0/0:0,99,255:33,0 0/0:0,93,255:31,0 0/0:0,66,212:22,0 0/0:0,72,255:24,0 0/0:0,75,243:25,0 0/0:0,72,255:24,0 0/0:0,69,255:27,1 0/0:0,102,250:34,0 0/0:0,81,186:27,0 0/0:0,66,255:22,0 0/0:0,72,255:24,0 0/0:0,50,236:21,1 0/0:0,60,255:20,0 0/0:0,75,255:25,0 0/0:0,54,182:18,0 0/0:0,75,255:25,0 0/0:0,78,255:26,0 0/0:0,81,233:27,0 0/0:0,78,153:26,0 0/0:0,75,180:25,0 0/0:0,60,174:20,0 0/0:0,51,189:17,0 0/0:0,84,234:28,0 0/0:0,63,255:21,0 0/0:0,48,210:16,0 0/0:0,63,231:21,0 0/0:0,69,255:23,0 0/0:0,81,252:27,0 0/0:0,69,178:23,0 0/0:0,69,221:23,0 0/0:0,57,255:19,0 0/0:0,75,217:25,0 0/0:0,93,255:31,0 0/0:0,54,231:18,0 0/0:0,96,211:32,0 0/0:0,93,255:31,0 0/0:0,54,211:18,0 0/0:0,66,243:22,0 0/0:0,72,222:24,0 0/0:0,90,236:30,0 0/0:0,78,242:26,0 0/0:0,87,255:29,0 0/0:0,45,255:15,0 + chr1 47261822 . A T 232.856 PASS DP=1729;VDB=0.568499;SGB=-11.6626;RPBZ=-0.581723;MQBZ=-14.8734;MQSBZ=6.53808;BQBZ=1.09344;SCBZ=1.03879;FS=0;MQ0F=0;AC=88;AN=140;DP4=544,110,864,174;MQ=52 GT:PL:AD 0/0:0,81,255:27,0 0/1:255,0,226:12,17 1/1:255,60,0:0,20 0/0:0,87,255:29,0 0/0:0,63,255:21,0 0/1:152,0,255:15,11 0/1:182,0,223:17,11 1/1:255,81,0:0,27 0/1:128,0,189:13,8 1/1:255,69,0:0,23 1/1:255,66,0:0,22 0/1:246,0,193:11,14 1/1:255,60,0:0,20 1/1:255,60,0:0,20 0/0:0,66,255:22,0 1/1:255,75,0:0,25 0/0:0,54,255:18,0 1/1:255,54,0:0,18 0/1:209,0,255:19,10 0/1:255,0,255:16,161/1:255,72,0:0,24 0/1:145,0,248:15,10 0/1:113,0,170:9,6 0/1:153,0,206:16,8 1/1:255,69,0:0,23 0/0:0,87,255:29,0 0/1:149,0,187:9,10 0/1:255,0,171:12,20 0/1:176,0,255:16,13 0/1:218,0,145:11,130/1:221,0,218:14,10 0/1:237,0,184:11,15 1/1:255,72,0:0,24 1/1:255,84,0:0,28 0/1:254,0,194:16,181/1:255,75,0:0,25 1/1:255,60,0:0,20 1/1:255,69,0:0,23 0/0:0,69,255:23,0 0/0:0,60,255:20,0 1/1:255,72,0:0,24 1/1:236,54,0:0,18 1/1:255,75,0:0,25 0/1:155,0,255:18,10 0/0:0,81,255:27,0 1/1:212,75,0:0,25 0/1:196,0,133:10,15 0/1:171,0,155:9,11 0/1:105,0,188:10,7 0/1:182,0,219:14,131/1:255,63,0:0,21 1/1:255,48,0:0,16 1/1:255,63,0:0,21 1/1:255,72,0:0,24 1/1:255,78,0:0,26 1/1:232,63,0:0,21 0/0:0,66,255:22,0 0/1:150,0,215:10,7 0/1:180,0,178:13,12 0/1:243,0,190:12,180/1:106,0,222:11,6 0/1:212,0,193:13,19 1/1:255,87,0:0,29 1/1:255,57,0:0,19 0/1:203,0,189:9,13 1/1:255,69,0:0,23 0/1:233,0,146:9,20 0/0:0,81,255:27,0 0/1:180,0,249:19,9 1/1:255,45,0:0,15 + chr1 47261869 . C T 235.707 PASS DP=1863;VDB=0.677143;SGB=5.02317;RPBZ=-2.55997;MQBZ=-8.87433;MQSBZ=3.1481;BQBZ=26.6865;SCBZ=0.647961;FS=0;MQ0F=0;AC=88;AN=140;DP4=522,174,834,311;MQ=56 GT:PL:AD 0/0:0,84,255:28,0 0/1:255,0,194:12,20 1/1:255,69,0:0,23 0/0:0,93,255:31,0 0/0:0,69,255:23,0 0/1:216,0,255:17,11 0/1:218,0,238:14,14 1/1:255,90,0:0,30 0/1:190,0,148:10,9 1/1:255,60,0:0,20 1/1:255,81,0:0,27 0/1:255,0,192:13,13 1/1:255,81,0:0,27 1/1:255,81,0:1,31 0/0:0,65,255:26,1 1/1:255,78,0:0,26 0/0:0,51,255:17,0 1/1:255,63,0:0,21 0/1:240,0,244:17,12 0/1:255,0,255:17,171/1:255,69,0:0,23 0/1:186,0,239:16,11 0/1:247,0,255:15,12 0/1:231,0,221:17,11 1/1:255,69,0:0,23 0/0:0,87,255:29,0 0/1:196,0,198:11,11 0/1:255,0,213:16,20 0/1:232,0,238:16,13 0/1:255,0,175:13,150/1:223,0,245:22,11 0/1:255,0,255:15,16 1/1:255,81,0:0,27 1/1:255,99,0:0,33 0/1:255,0,209:16,161/1:255,87,0:0,29 1/1:255,75,0:0,25 1/1:255,75,0:0,25 0/0:0,66,255:22,0 0/0:0,63,255:21,0 1/1:255,78,0:0,26 1/1:218,54,0:0,18 1/1:255,78,0:0,26 0/1:173,0,255:23,12 0/0:0,72,255:24,0 1/1:255,75,0:0,25 0/1:213,0,168:11,13 0/1:247,0,188:11,12 0/1:195,0,124:6,9 0/1:173,0,205:16,121/1:255,66,0:0,22 1/1:255,72,0:0,24 1/1:255,54,0:0,18 1/1:255,93,0:0,31 1/1:255,84,0:0,28 1/1:255,66,0:0,22 0/0:0,48,255:21,1 0/1:190,0,255:13,8 0/1:255,0,173:9,13 0/1:255,0,214:16,180/1:202,0,179:12,11 0/1:255,0,218:16,17 1/1:255,84,0:0,28 1/1:255,81,0:0,27 0/1:255,0,111:7,18 1/1:255,69,0:0,23 0/1:255,0,213:13,19 0/0:0,66,255:22,0 0/1:253,0,247:21,13 1/1:255,75,0:0,25 + chr1 47261936 . C T 232.857 PASS DP=2179;VDB=0.991573;SGB=71.95;RPBZ=0.621331;MQBZ=0.919674;MQSBZ=-0.0215108;BQBZ=10.1541;SCBZ=0.212854;FS=0;MQ0F=0;AC=17;AN=140;DP4=1145,745,173,83;MQ=59 GT:PL:AD 0/0:0,87,255:29,0 0/0:0,117,255:39,0 0/0:0,72,255:24,0 0/0:0,105,255:35,0 0/1:205,0,189:10,160/1:255,0,230:10,15 0/0:0,96,255:32,0 0/0:0,96,255:32,0 0/1:225,0,222:13,12 0/0:0,69,255:23,0 0/0:0,105,255:35,0 0/0:0,78,255:26,0 0/0:0,114,255:38,0 0/0:0,123,255:41,0 0/1:210,0,255:18,100/0:0,105,255:35,0 0/0:0,78,255:26,0 0/0:0,90,255:30,0 0/0:0,96,255:32,0 0/0:0,108,255:36,0 0/0:0,84,255:28,0 0/0:0,75,255:25,0 0/1:255,0,255:15,13 0/0:0,93,255:31,0 0/0:0,84,255:28,0 0/0:0,87,255:29,0 0/0:0,81,255:27,0 0/0:0,111,255:37,0 0/1:255,0,183:10,16 0/1:255,0,251:15,170/0:0,108,255:36,0 0/0:0,99,255:33,0 0/0:0,102,255:34,0 0/0:0,99,255:33,0 0/0:0,105,255:35,0 0/0:0,117,255:39,0 0/0:0,78,255:26,0 0/0:0,102,255:34,0 1/1:255,75,0:0,25 1/1:255,99,0:0,33 0/0:0,78,255:26,0 0/0:0,66,255:22,0 0/0:0,96,255:32,0 0/0:0,87,255:29,0 0/0:0,81,255:27,0 0/0:0,93,255:31,0 0/1:224,0,252:15,13 0/0:0,96,255:32,0 0/0:0,81,255:27,0 0/0:0,102,255:34,0 0/0:0,87,255:29,0 0/0:0,108,255:36,0 0/0:0,69,255:23,0 0/0:0,96,255:32,0 0/0:0,96,255:32,0 0/0:0,93,255:31,0 1/1:255,99,0:0,33 0/0:0,81,255:27,0 0/0:0,87,255:29,0 0/0:0,102,255:34,0 0/0:0,81,255:27,0 0/1:255,0,255:20,17 0/0:0,93,255:31,0 0/0:0,84,255:28,0 0/1:100,0,255:22,6 0/0:0,87,255:29,0 0/1:255,0,255:24,19 0/0:0,78,255:26,0 0/0:0,102,255:34,0 0/0:0,66,255:22,0 + +At this point, you are now ready to move on to the next step: +:ref:`tutorials:Genotyping genes with SV`. + +(Optional) Creating input files +------------------------------- + +Optionally, in case you are interested in creating above input files on your +own, I have also prepared "mini" BAM files for GRCh37 where the original +sequencing data from GeT-RM have been sliced to contain genes used by PyPGx +only: + +.. code-block:: text + + $ mkdir grch37-bam + $ wget https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam.list + $ head -n 6 grch37-bam.list + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00276_PyPGx.sorted.markdup.recal.bai + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00276_PyPGx.sorted.markdup.recal.bam + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00436_PyPGx.sorted.markdup.recal.bai + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00436_PyPGx.sorted.markdup.recal.bam + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00589_PyPGx.sorted.markdup.recal.bai + https://storage.googleapis.com/sbslee-bucket/pypgx/getrm-wgs-tutorial/grch37-bam/HG00589_PyPGx.sorted.markdup.recal.bam + $ wget -i grch37-bam.list -P grch37-bam + +You will also need reference FASTA when creating input VCF: + +.. code-block:: text + + $ wget https://storage.googleapis.com/sbslee-bucket/ref/grch37/genome.fa + $ wget https://storage.googleapis.com/sbslee-bucket/ref/grch37/genome.fa.fai + +Once you are finished downloading the mini BAM files and the reference FASTA +file, we can create input VCF: + +.. code-block:: text + + $ pypgx create-input-vcf \ + grch37-variants.vcf.gz \ + genome.fa \ + grch37-bam/*.bam + +Note that this step can take some time to run. For example, it takes about 1 +hour to finish using my personal MacBook Air (M1, 2020) with 8 GB of memory. + +Next, we will compute depth of coverage for genes that are known to have SV: + +.. code-block:: text + + $ pypgx prepare-depth-of-coverage \ + grch37-depth-of-coverage.zip \ + grch37-bam/*.bam + +This step should be quick. It finishes in less than 30 seconds with my laptop. + +Finally, we will compute control statistics using the VDR gene as control +locus, which is required when converting read depth to copy number: + +.. code-block:: text + + $ pypgx compute-control-statistics \ + VDR \ + grch37-control-statistics-VDR.zip \ + grch37-bam/*.bam + +This step should be quick as well. It finishes in less than 5 seconds with my +laptop. + Genotyping genes with SV ------------------------ The first gene we are going to genotype is CYP2D6, which has almost 150 star alleles including those with SV (e.g. gene deletions, duplications, and hybrids). To this end, we will run PyPGx's next-generation sequencing (NGS) -pipeline: +pipeline (see :ref:`readme:NGS pipeline` for more details): .. code-block:: text @@ -103,7 +228,71 @@ Above will create a number of archive files: In addition to these files, PyPGx will have also created two directories called ``copy-number-profile`` and ``allele-fraction-profile``. -Now let's make sure the genotype results are correct by comparing them with the validation data: +Let's take a look at the results: + +.. code-block:: text + + $ pypgx print-data grch37-CYP2D6-pipeline/results.zip | head + Genotype Phenotype Haplotype1 Haplotype2 AlternativePhase VariantData CNV + HG00589_PyPGx *1/*21 Intermediate Metabolizer *21;*2; *1; ; *21:22-42524213-C-CG:0.378;*1:22-42522613-G-C,22-42523943-A-G:0.645,0.625;*2:default; Normal + NA07019_PyPGx *1/*4 Intermediate Metabolizer *1; *4;*10;*74;*2; ; *4:22-42524947-C-T:0.452;*10:22-42523943-A-G,22-42526694-G-A:1.0,0.448;*74:22-42525821-G-T:0.424;*1:22-42522613-G-C,22-42523943-A-G:0.361,1.0;*2:default; Normal + NA10851_PyPGx *1/*4 Intermediate Metabolizer *1; *4;*10;*74;*2; ; *4:22-42524947-C-T:0.467;*10:22-42523943-A-G,22-42526694-G-A:0.95,0.421;*74:22-42525821-G-T:0.447;*1:22-42522613-G-C,22-42523943-A-G:0.486,0.95;*2:default; Normal + NA18484_PyPGx *1/*17 Normal Metabolizer *1; *17;*2; ; *17:22-42525772-G-A:0.6;*1:22-42522613-G-C,22-42523943-A-G:0.625,0.391;*2:default; Normal + NA12006_PyPGx *4/*41 Intermediate Metabolizer *41;*2; *4;*10;*2; *69; *69:22-42526694-G-A,22-42523805-C-T:0.473,0.528;*4:22-42524947-C-T:0.448;*10:22-42523943-A-G,22-42526694-G-A:0.545,0.473;*41:22-42523805-C-T:0.528;*2:default; Normal + HG00436_PyPGx *2x2/*71 Indeterminate *71;*1; *2; ; *71:22-42526669-C-T:0.433;*1:22-42522613-G-C,22-42523943-A-G:0.462,0.353;*2:default; Duplication + NA19213_PyPGx *1/*1 Normal Metabolizer *1; *1; ; *1:22-42522613-G-C,22-42523943-A-G:1.0,1.0; Normal + NA19207_PyPGx *2x2/*10 Normal Metabolizer *10;*2; *2; ; *10:22-42523943-A-G,22-42526694-G-A:0.366,0.25;*2:default; Duplication + NA07029_PyPGx *1/*35 Normal Metabolizer *35;*2; *1; ; *1:22-42522613-G-C,22-42523943-A-G:0.596,0.476;*35:22-42526763-C-T:0.405;*2:default; Normal + +You can read :ref:`readme:Results interpretation` for details on how to +interpret the PyPGx results. + +Next, we can manually inspect SV calls by visualizing copy number and allele +fraction for the CYP2D6 locus (read :ref:`readme:Structural variation +detection` for details). For example, above results indicate that the samples +``HG00589_PyPGx`` and ``HG00436_PyPGx`` have ``Normal`` and ``Duplication`` +as CNV calls, respectively: + +.. list-table:: + :header-rows: 1 + :widths: 10 45 45 + + * - Sample + - Copy Number + - Allele Fraction + * - HG00589_PyPGx + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00589-copy-number.png + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00589-allele-fraction.png + * - HG00436_PyPGx + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00436-copy-number.png + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00436-allele-fraction.png + +If you want to prepare publication quality figures, it's strongly recommended +to combine copy number and allele fraction profiles together: + +.. code-block:: text + + $ pypgx plot-cn-af \ + grch37-CYP2D6-pipeline/copy-number.zip \ + grch37-CYP2D6-pipeline/imported-variants.zip \ + --samples HG00589_PyPGx HG00436_PyPGx + +.. list-table:: + :header-rows: 1 + :widths: 10 90 + + * - Sample + - Profile + * - HG00589_PyPGx + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00589-combined.png + * - HG00436_PyPGx + - .. image:: https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/HG00436-combined.png + +Note that above also adds a fitted line on top of each copy number profile to +display what the SV classifier actually "sees". + +Now let's make sure the genotype results are correct by comparing them with +the validation data: .. code-block:: text diff --git a/pypgx/__init__.py b/pypgx/__init__.py index 9214d35f..5b78bf50 100644 --- a/pypgx/__init__.py +++ b/pypgx/__init__.py @@ -53,7 +53,9 @@ predict_alleles, predict_cnv, prepare_depth_of_coverage, + print_data, print_metadata, + slice_bam, test_cnv_caller, train_cnv_caller, ) diff --git a/pypgx/api/core.py b/pypgx/api/core.py index 9e1916b8..90cdc201 100644 --- a/pypgx/api/core.py +++ b/pypgx/api/core.py @@ -1123,7 +1123,7 @@ def predict_phenotype(gene, a, b): gene deletion, duplication, and tandem arrangement. For detailed implementation, please see the `Phenotype prediction - `__ section. Parameters @@ -1199,7 +1199,7 @@ def predict_score(gene, allele): activity score system. For detailed implementation, please see the `Phenotype prediction - `__ section. Parameters diff --git a/pypgx/api/data/cnv-table.csv b/pypgx/api/data/cnv-table.csv index be638710..06e409a5 100644 --- a/pypgx/api/data/cnv-table.csv +++ b/pypgx/api/data/cnv-table.csv @@ -14,6 +14,10 @@ CYP2A6,Duplication1 CYP2A6,Duplication2 CYP2A6,Duplication3 CYP2A6,Tandem +CYP2A6,Deletion2Hom +CYP2A6,Hybrid5 +CYP2A6,Hybrid6 +CYP2A6,PseudogeneDeletion CYP2B6,Normal CYP2B6,Hybrid CYP2B6,Duplication @@ -35,6 +39,7 @@ CYP2D6,Unknown1 CYP2D6,Unknown2 CYP2D6,PseudogeneDeletion CYP2D6,PseudogeneDownstreamDel +CYP2D6,Tandem2F CYP2E1,Normal CYP2E1,Duplication1 CYP2E1,Duplication2 @@ -54,6 +59,7 @@ GSTM1,UpstreamDeletionHet GSTM1,"DeletionHet,UpstreamDeletionHet" GSTM1,PartialDuplication GSTM1,"DeletionHet,Deletion2" +GSTM1,"Normal,Deletion2" GSTT1,Normal GSTT1,DeletionHet GSTT1,DeletionHom @@ -68,6 +74,7 @@ SULT1A1,DeletionHom SULT1A1,Duplication SULT1A1,Multiplication1 SULT1A1,Multiplication2 +SULT1A1,Unknown1 UGT1A4,Normal UGT1A4,Intron1DeletionA UGT1A4,Intron1DeletionB @@ -86,3 +93,4 @@ UGT2B17,"Deletion,Deletion" UGT2B17,"Deletion,PartialDeletion1" UGT2B17,"Deletion,PartialDeletion2" UGT2B17,"Normal,PartialDeletion3" +UGT2B17,"Deletion,PartialDeletion3" diff --git a/pypgx/api/genotype.py b/pypgx/api/genotype.py index 2e5a2f4c..33ea825b 100644 --- a/pypgx/api/genotype.py +++ b/pypgx/api/genotype.py @@ -374,7 +374,7 @@ def one_row(self, r): s1, s2 = core.sort_alleles([a1, a2], by='priority', gene=self.gene, assembly=self.assembly) if r.CNV in ['Normal', 'AssumeNormal', 'UpstreamDeletionHet']: result = [a1, a2] - elif r.CNV in ['DeletionHet', 'DeletionHet,UpstreamDeletionHet']: + elif r.CNV in ['DeletionHet', 'DeletionHet,UpstreamDeletionHet', 'Normal,Deletion2']: result = [s1, '*0'] elif r.CNV in ['DeletionHom', 'DeletionHet,Deletion2']: result = ['*0', '*0'] diff --git a/pypgx/api/pipeline.py b/pypgx/api/pipeline.py index c1b3f858..548a4466 100644 --- a/pypgx/api/pipeline.py +++ b/pypgx/api/pipeline.py @@ -262,7 +262,7 @@ def run_ngs_pipeline( depth_of_coverage.check_metadata('Assembly', assembly) if control_statistics is None: - raise ValueError('SV detection requires SampleTable[Statistcs]') + raise ValueError('SV detection requires SampleTable[Statistics]') if isinstance(control_statistics, str): control_statistics = sdk.Archive.from_file(control_statistics) diff --git a/pypgx/api/utils.py b/pypgx/api/utils.py index ae3007c8..8f6b1a23 100644 --- a/pypgx/api/utils.py +++ b/pypgx/api/utils.py @@ -8,6 +8,7 @@ import zipfile import subprocess import os +import sys import pickle import warnings @@ -366,7 +367,7 @@ def compute_control_statistics( Returns ------- pypgx.Archive - Archive object with the semantic type SampleTable[Statistcs]. + Archive object with the semantic type SampleTable[Statistics]. """ gene_table = core.load_gene_table() @@ -426,7 +427,7 @@ def compute_copy_number( ---------- read_depth : str or pypgx.Archive Archive file or object with the semantic type CovFrame[ReadDepth]. - control_statistcs : str or pypgx.Archive + control_statistics : str or pypgx.Archive Archive file or object with the semandtic type SampleTable[Statistics]. samples_without_sv : list, optional @@ -1184,7 +1185,7 @@ def predict_cnv(copy_number, cnv_caller=None): return sdk.Archive(metadata, data) def prepare_depth_of_coverage( - bams, assembly='GRCh37', bed=None + bams, assembly='GRCh37', bed=None, genes=None, exclude=False ): """ Prepare a depth of coverage file for all target genes with SV from BAM @@ -1208,6 +1209,10 @@ def prepare_depth_of_coverage( Note that the 'chr' prefix in contig names (e.g. 'chr1' vs. '1') will be automatically added or removed as necessary to match the input BAM's contig names. + genes : list, optional + List of genes to include. + exclude : bool, default: False + Exclude specified genes. Ignored when ``genes=None``. Returns ------- @@ -1220,7 +1225,8 @@ def prepare_depth_of_coverage( } regions = create_regions_bed( - merge=True, sv_genes=True, assembly=assembly, + merge=True, sv_genes=True, assembly=assembly, genes=genes, + exclude=exclude ).to_regions() cf = pycov.CovFrame.from_bam(bams, regions=regions, zero=True) @@ -1246,6 +1252,33 @@ def prepare_depth_of_coverage( return sdk.Archive(metadata, cf) +def print_data(input): + """ + Print the main data of specified archive. + + Parameters + ---------- + input : pypgx.Archive + Archive file. + """ + archive = sdk.Archive.from_file(input) + if 'SampleTable' in archive.type: + data = archive.data.to_csv(sep='\t') + elif 'CovFrame' in archive.type: + data = archive.data.to_string() + elif 'VcfFrame' in archive.type: + data = archive.data.to_string() + else: + raise ValueError(f"Data cannot be printed for {archive.type}") + + # https://docs.python.org/3/library/signal.html#note-on-sigpipe + try: + print(data, end='') + except BrokenPipeError: + devnull = os.open(os.devnull, os.O_WRONLY) + os.dup2(devnull, sys.stdout.fileno()) + sys.exit(1) + def print_metadata(input): """ Print the metadata of specified archive. @@ -1260,6 +1293,29 @@ def print_metadata(input): with zf.open(f'{parent}/metadata.txt') as f: print(f.read().decode('utf-8').strip()) +def slice_bam( + input, output, assembly='GRCh37', genes=None, exclude=False +): + """ + Slice BAM file for all genes used by PyPGx. + + Parameters + ---------- + input + Input BAM file. It must be already indexed to allow random access. + output : str + Output BAM file. + assembly : {'GRCh37', 'GRCh38'}, default: 'GRCh37' + Reference genome assembly. + genes : list, optional + List of genes to include. + exclude : bool, default: False + Exclude specified genes. Ignored when ``genes=None``. + """ + bf = create_regions_bed(merge=True, assembly=assembly, + genes=genes, exclude=exclude) + pybam.slice(input, bf, path=output) + def test_cnv_caller( cnv_caller, copy_number, cnv_calls, confusion_matrix=None ): diff --git a/pypgx/cli/compute_control_statistics.py b/pypgx/cli/compute_control_statistics.py index 007499aa..b0257094 100644 --- a/pypgx/cli/compute_control_statistics.py +++ b/pypgx/cli/compute_control_statistics.py @@ -17,13 +17,13 @@ [Example] For the VDR gene from WGS data: $ pypgx {fuc.api.common._script_name()} \\ VDR \\ - control-statistcs.zip \\ + control-statistics.zip \\ 1.bam 2.bam [Example] For a custom region from targeted sequencing data: $ pypgx {fuc.api.common._script_name()} \\ chr1:100-200 \\ - control-statistcs.zip \\ + control-statistics.zip \\ bam.list \\ --bed probes.bed """ diff --git a/pypgx/cli/compute_copy_number.py b/pypgx/cli/compute_copy_number.py index 9b084fab..892412ff 100644 --- a/pypgx/cli/compute_copy_number.py +++ b/pypgx/cli/compute_copy_number.py @@ -33,8 +33,8 @@ def create_parser(subparsers): CovFrame[ReadDepth].""" ) parser.add_argument( - 'control_statistcs', - metavar='control-statistcs', + 'control_statistics', + metavar='control-statistics', help= """Input archive file with the semantic type SampleTable[Statistics].""" @@ -56,7 +56,7 @@ def create_parser(subparsers): def main(args): result = utils.compute_copy_number( - args.read_depth, args.control_statistcs, + args.read_depth, args.control_statistics, samples_without_sv=args.samples_without_sv ) result.to_file(args.copy_number) diff --git a/pypgx/cli/prepare_depth_of_coverage.py b/pypgx/cli/prepare_depth_of_coverage.py index bf066162..987be753 100644 --- a/pypgx/cli/prepare_depth_of_coverage.py +++ b/pypgx/cli/prepare_depth_of_coverage.py @@ -71,9 +71,24 @@ def create_parser(subparsers): automatically added or removed as necessary to match the input BAM's contig names.""" ) + parser.add_argument( + '--genes', + metavar='TEXT', + nargs='+', + help= +"""List of genes to include.""" + ) + parser.add_argument( + '--exclude', + action='store_true', + help= +"""Exclude specified genes. Ignored when --genes is not +used.""" + ) def main(args): archive = utils.prepare_depth_of_coverage( - args.bams, assembly=args.assembly, bed=args.bed + args.bams, assembly=args.assembly, bed=args.bed, genes=args.genes, + exclude=args.exclude ) archive.to_file(args.depth_of_coverage) diff --git a/pypgx/cli/print_data.py b/pypgx/cli/print_data.py new file mode 100644 index 00000000..70ba9478 --- /dev/null +++ b/pypgx/cli/print_data.py @@ -0,0 +1,27 @@ +import sys + +from ..api import utils + +import fuc +import pysam + +description = f""" +Print the main data of specified archive. +""" + +def create_parser(subparsers): + parser = fuc.api.common._add_parser( + subparsers, + fuc.api.common._script_name(), + description=description, + help= +"""Print the main data of specified archive.""" + ) + parser.add_argument( + 'input', + help= +"""Input archive file.""" + ) + +def main(args): + utils.print_data(args.input) diff --git a/pypgx/cli/run_ngs_pipeline.py b/pypgx/cli/run_ngs_pipeline.py index 03273333..2adf18df 100644 --- a/pypgx/cli/run_ngs_pipeline.py +++ b/pypgx/cli/run_ngs_pipeline.py @@ -26,7 +26,7 @@ CYP2D6-pipeline \\ --variants variants.vcf.gz \\ --depth-of-coverage depth-of-coverage.tsv \\ - --control-statistcs control-statistics-VDR.zip + --control-statistics control-statistics-VDR.zip [Example] To genotype the CYP2D6 gene from targeted sequencing data: $ pypgx {fuc.api.common._script_name()} \\ @@ -34,7 +34,7 @@ CYP2D6-pipeline \\ --variants variants.vcf.gz \\ --depth-of-coverage depth-of-coverage.tsv \\ - --control-statistcs control-statistics-VDR.zip \\ + --control-statistics control-statistics-VDR.zip \\ --platform Targeted """ @@ -78,7 +78,7 @@ def create_parser(subparsers): metavar='PATH', help= """Archive file with the semantic type -SampleTable[Statistcs].""" +SampleTable[Statistics].""" ) parser.add_argument( '--platform', @@ -150,7 +150,7 @@ def create_parser(subparsers): '--cnv-caller', metavar='PATH', help= -"""Archive file with the semantic type Model[CNV]. By +"""Archive file with the semantic type Model[CNV]. By default, a pre-trained CNV caller in the ~/pypgx-bundle directory will be used.""" ) diff --git a/pypgx/cli/slice_bam.py b/pypgx/cli/slice_bam.py new file mode 100644 index 00000000..37278cb8 --- /dev/null +++ b/pypgx/cli/slice_bam.py @@ -0,0 +1,58 @@ +import sys + +from ..api import utils + +import fuc +import pysam + +description = f""" +Slice BAM file for all genes used by PyPGx. +""" + +def create_parser(subparsers): + parser = fuc.api.common._add_parser( + subparsers, + fuc.api.common._script_name(), + description=description, + help= +"""Slice BAM file for all genes used by PyPGx.""" + ) + parser.add_argument( + 'input', + help= +"""Input BAM file. It must be already indexed to allow +random access.""" + ) + parser.add_argument( + 'output', + help= +"""Output BAM file.""" + ) + parser.add_argument( + '--assembly', + metavar='TEXT', + default='GRCh37', + help= +"""Reference genome assembly (default: 'GRCh37') +(choices: 'GRCh37', 'GRCh38').""" + ) + parser.add_argument( + '--genes', + metavar='TEXT', + nargs='+', + help= +"""List of genes to include.""" + ) + parser.add_argument( + '--exclude', + action='store_true', + help= +"""Exclude specified genes. Ignored when --genes is not +used.""" + ) + +def main(args): + utils.slice_bam( + args.input, args.output, assembly=args.assembly, genes=args.genes, + exclude=args.exclude + ) diff --git a/pypgx/sdk/utils.py b/pypgx/sdk/utils.py index b02a93bc..54ac7c82 100644 --- a/pypgx/sdk/utils.py +++ b/pypgx/sdk/utils.py @@ -314,7 +314,14 @@ def simulate_copy_number( s = data - noise s[data == 0] = 0 s[s < 0] = 0 - target.data.df[f'{sv}_{i+1}'] = s + + j = 1 + name = f'{sv}_{i+j}' + while name in target.data.samples: + j += 1 + name = f'{sv}_{i+j}' + + target.data.df[name] = s return target diff --git a/pypgx/version.py b/pypgx/version.py index ef919940..a842d05a 100644 --- a/pypgx/version.py +++ b/pypgx/version.py @@ -1 +1 @@ -__version__ = '0.14.0' +__version__ = '0.15.0'