From 9985825ede2b5f54702fcf0a01995aebbb741d4f Mon Sep 17 00:00:00 2001 From: marianattestad Date: Wed, 19 Aug 2020 13:30:32 -0700 Subject: [PATCH] Update README with pointers to useful information. * Add a section about what DeepVariant supports in terms of data samples at the top of the documentation. * Add links to a few blog posts on the front page where they are relevant. * Add a section on "How DeepVariant works" with a diagram that includes pileup images. PiperOrigin-RevId: 327496328 --- LICENSE | 2 +- README.md | 69 +++++++++++++----- docs/deepvariant-gvcf-support.md | 3 +- .../DeepVariant-gvcf-sizes-figure.png | Bin .../DeepVariant-workflow-figure.png | Bin docs/images/inference_flow_diagram.svg | 1 + docs/trio-merge-case-study.md | 26 ++++--- 7 files changed, 69 insertions(+), 32 deletions(-) rename docs/{ => images}/DeepVariant-gvcf-sizes-figure.png (100%) rename docs/{ => images}/DeepVariant-workflow-figure.png (100%) create mode 100644 docs/images/inference_flow_diagram.svg diff --git a/LICENSE b/LICENSE index 486de6e8..7f6763f0 100644 --- a/LICENSE +++ b/LICENSE @@ -1,4 +1,4 @@ -Copyright 2017 Google LLC. +Copyright 2020 Google LLC. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: diff --git a/README.md b/README.md index c96e812f..3d942c8d 100644 --- a/README.md +++ b/README.md @@ -4,12 +4,29 @@ [![announcements](https://img.shields.io/badge/announcements-blue)](https://groups.google.com/d/forum/deepvariant-announcements) [![blog](https://img.shields.io/badge/blog-orange)](https://goo.gl/deepvariant) -DeepVariant is an analysis pipeline that uses a deep neural network to call -genetic variants from next-generation DNA sequencing data. DeepVariant relies on -[Nucleus](https://github.com/google/nucleus), a library of Python and C++ code -for reading and writing data in common genomics file formats (like SAM and VCF) -designed for painless integration with the -[TensorFlow](https://www.tensorflow.org/) machine learning framework. +DeepVariant is a deep learning-based variant caller that takes aligned reads (in +BAM or CRAM format), produces pileup image tensors from them, classify each +tensor using a convolutional neural network, and finally reports the results in +a standard VCF or gVCF file. + +DeepVariant supports: + +* Germline variant-calling in diploid organisms. + * For somatic data or any other samples where the genotypes go beyond two + copies of DNA, DeepVariant will not work out of the box because the only + genotypes supported are hom-alt, het, and hom-ref. + * The models included with DeepVariant are only trained on human data. For + other organisms, see the + [blog post on non-human variant-calling](https://google.github.io/deepvariant/posts/2018-12-05-improved-non-human-variant-calling-using-species-specific-deepvariant-models/) + for some possible pitfalls and how to handle them. +* Calling from NGS and long-read sequencing data. + * NGS (Illumina) data for either a + [whole genome](docs/deepvariant-case-study.md) or + [whole exome](docs/deepvariant-exome-case-study.md). + * PacBio HiFi data, see the + [PacBio case study](docs/deepvariant-pacbio-model-case-study.md). + * ONT long-read data by using + [PEPPER-DeepVariant](https://github.com/kishwarshafin/pepper/blob/master/docs/PEPPER_variant_calling.md). ## How to run @@ -30,21 +47,21 @@ docker run \ --num_shards=$(nproc) **This will use all your cores to run make_examples. Feel free to change.** ``` -To see all flags you can use, run: -``` -docker run google/deepvariant:"${BIN_VERSION}" --help -``` - +To see all flags you can use, run: `docker run +google/deepvariant:"${BIN_VERSION}" --help` If you're using GPUs, or want to use Singularity instead, see -[Quick Start](docs/deepvariant-quick-start.md) for more details. +[Quick Start](docs/deepvariant-quick-start.md) for more details or see all the +[setup options](#deepvariant_setup) available including solutions on external +platforms. For more information, also see: - * [Full documentation list](docs/README.md) - * [Best practices for multi-sample variant calling with DeepVariant](docs/trio-merge-case-study.md) - * [(Advanced) Training tutorial](docs/deepvariant-training-case-study.md) - +* [Full documentation list](docs/README.md) +* [Detailed usage guide](docs/deepvariant-details.md) with more information on + the input and output file formats and how to work with them. +* [Best practices for multi-sample variant calling with DeepVariant](docs/trio-merge-case-study.md) +* [(Advanced) Training tutorial](docs/deepvariant-training-case-study.md) ## How to cite @@ -69,7 +86,9 @@ doi: https://doi.org/10.1101/2020.02.10.942086 * **High accuracy** - In 2016 DeepVariant won [PrecisionFDA Truth Challenge](https://precision.fda.gov/challenges/truth/results) for best SNP Performance. DeepVariant maintains high accuracy across data - from different sequencing technologies, prep methods, and species. + from different sequencing technologies, prep methods, and species. For + [lower coverage](https://google.github.io/deepvariant/posts/2019-09-10-twenty-is-the-new-thirty-comparing-current-and-historical-wgs-accuracy-across-coverage/), + using DeepVariant makes an especially great difference. * **Flexibility** - Out-of-the-box use for [PCR-positive](https://ai.googleblog.com/2018/04/deepvariant-accuracy-improvements-for.html) samples and @@ -94,6 +113,22 @@ doi: https://doi.org/10.1101/2020.02.10.942086 (1): Time estimates do not include mapping. +## How DeepVariant works + +![diagram of stages in DeepVariant](docs/images/inference_flow_diagram.svg) + +For more information on the pileup images and how to read them, please see the +["Looking through DeepVariant's Eyes" blog post](https://google.github.io/deepvariant/posts/2020-02-20-looking-through-deepvariants-eyes/). + +DeepVariant relies on [Nucleus](https://github.com/google/nucleus), a library of +Python and C++ code for reading and writing data in common genomics file formats +(like SAM and VCF) designed for painless integration with the +[TensorFlow](https://www.tensorflow.org/) machine learning framework. Nucleus +was built with DeepVariant in mind and open-sourced separately so it can be used +by anyone in the genomics research community for other projects. See this blog +post on +[Using Nucleus and TensorFlow for DNA Sequencing Error Correction](https://google.github.io/deepvariant/posts/2019-01-31-using-nucleus-and-tensorflow-for-dna-sequencing-error-correction/). + ## DeepVariant Setup ### Prerequisites diff --git a/docs/deepvariant-gvcf-support.md b/docs/deepvariant-gvcf-support.md index 9b27ca92..d28a5f62 100644 --- a/docs/deepvariant-gvcf-support.md +++ b/docs/deepvariant-gvcf-support.md @@ -145,8 +145,7 @@ number of records generated relative to the baseline of a 50x whole genome with `--gvcf_gq_binsize 1`) at different coverage levels, for GQ bins of size 1, 3, 5, and 10. The value of each bar is written in blue font above it for clarity. -![gVCF -size](DeepVariant-gvcf-sizes-figure.png?raw=true "DeepVariant gVCF sizes") +![gVCF size](images/DeepVariant-gvcf-sizes-figure.png?raw=true "DeepVariant gVCF sizes") ### Runtime diff --git a/docs/DeepVariant-gvcf-sizes-figure.png b/docs/images/DeepVariant-gvcf-sizes-figure.png similarity index 100% rename from docs/DeepVariant-gvcf-sizes-figure.png rename to docs/images/DeepVariant-gvcf-sizes-figure.png diff --git a/docs/DeepVariant-workflow-figure.png b/docs/images/DeepVariant-workflow-figure.png similarity index 100% rename from docs/DeepVariant-workflow-figure.png rename to docs/images/DeepVariant-workflow-figure.png diff --git a/docs/images/inference_flow_diagram.svg b/docs/images/inference_flow_diagram.svg new file mode 100644 index 00000000..67633780 --- /dev/null +++ b/docs/images/inference_flow_diagram.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/trio-merge-case-study.md b/docs/trio-merge-case-study.md index 8d2a8986..df62e638 100644 --- a/docs/trio-merge-case-study.md +++ b/docs/trio-merge-case-study.md @@ -80,10 +80,10 @@ aria2c -c -x10 -s10 -d "${DIR}" https://storage.googleapis.com/deepvariant/exome There have been newer version of the truth files, including [v4.1, GRCh37 for HG002](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37), -and [v4.2, GRCh38 for HG002-4](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/). +and +[v4.2, GRCh38 for HG002-4](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchmark_07092020/). In the future we will plan to update this documentation with newer versions. - HG002: ``` @@ -188,12 +188,12 @@ When we ran on this WES trio, it took only about 13 seconds. For more details on performance, see [GLnexus performance guide](https://github.com/dnanexus-rnd/GLnexus/wiki/Performance). -For a WGS cohort, we recommend using `--config -DeepVariantWGS` instead of `DeepVariantWES`. Another preset -`DeepVariant_unfiltered` is available in `glnexus:v1.2.7` or later versions for -merging DeepVariant gVCFs with no QC filters or genotype revision (see [GitHub -issue #326](https://github.com/google/deepvariant/issues/326) for a potential -use case). The details of these presets can be found +For a WGS cohort, we recommend using `--config DeepVariantWGS` instead of +`DeepVariantWES`. Another preset `DeepVariant_unfiltered` is available in +`glnexus:v1.2.7` or later versions for merging DeepVariant gVCFs with no QC +filters or genotype revision (see +[GitHub issue #326](https://github.com/google/deepvariant/issues/326) for a +potential use case). The details of these presets can be found [here](../deepvariant/cohort_best_practice). ## Annotate the merged VCF with Mendelian discordance information using RTG Tools @@ -275,8 +275,9 @@ do done ``` -| Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) | -| ------ | ----- | ----- | -------- | --------------- | --------------- | ------------------ | +| Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st | [7]tv (1st | [8]ts/tv (1st | +: : : : : ALT) : ALT) : ALT) : +| ------ | ----- | ----- | -------- | ---------- | ---------- | ------------- | | HG002 | 30016 | 11709 | 2.56 | 30002 | 11693 | 2.57 | | HG003 | 29880 | 11747 | 2.54 | 29871 | 11731 | 2.55 | | HG004 | 30133 | 11860 | 2.54 | 30120 | 11848 | 2.54 | @@ -296,8 +297,9 @@ done Which resulted in this table: -| Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st ALT) | [7]tv (1st ALT) | [8]ts/tv (1st ALT) | -| ------ | ----- | ----- | -------- | --------------- | --------------- | ------------------ | +| Sample | [3]ts | [4]tv | [5]ts/tv | [6]ts (1st | [7]tv (1st | [8]ts/tv (1st | +: : : : : ALT) : ALT) : ALT) : +| ------ | ----- | ----- | -------- | ---------- | ---------- | ------------- | | HG002 | 24474 | 9255 | 2.64 | 24469 | 9245 | 2.65 | | HG003 | 24175 | 9182 | 2.63 | 24172 | 9174 | 2.63 | | HG004 | 24313 | 9334 | 2.60 | 24306 | 9327 | 2.61 |