Section 6: Gene-based Annotation

This section demonstrates how to run our gene-based annotation process for a VCF/mVCF table.

There are two functionalities supported by AnnotationHive regarding gene annotation: 1) finding the closest gene to each variant, and 2) finding all genes that overlap with each variant within an input proximity threshold.

Finding the closest gene for each variant Here are the key options:

--geneBasedAnnotation=true
--geneBasedMinAnnotation=true

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<Your_Google_Cloud_Project_Name> --runner=DataflowRunner --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/<Staging_Address>/ --bigQueryDatasetId=<YOUR_BigQuery_Dataset_ID> --genericAnnotationTables=<Table address Plus selected fields> (e.g., myProject:myPublicAnnotationSets.hg19_refGene:name:name2 - selecting name and name2 from hg19_refGene table) --VCFTables=<VCF_Table_Names>(e.g., genomics-public-data:1000_genomes_phase_3.variants_20150220_release) --outputBigQueryTable=<Output_Table_Name> --geneBasedAnnotation=true --geneBasedMinAnnotation=true --searchRegion=<chromID1:Start1:End1,Chrom2,Start2;...;ChromN:StartN:EndN>" -Pdataflow-runner

Here is an example:

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<Your_Google_Cloud_Project_Name> --runner=DataflowRunner --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/statging/ --bigQueryDatasetId=test --genericAnnotationTables=<Your_Google_Cloud_Project_Name>:AnnotationHive.hg19_UCSC_refGene:name:name2 --geneBasedAnnotation=true --geneBasedMinAnnotation=true  --outputBigQueryTable=BRCA1_BRAC2_closest_genes_test_chr17 --VCFTables=<Your_Google_Cloud_Project_Name>:test.NA12877_chr17 --searchRegions=chr17:41196311:41277499" -Pdataflow-runner

Finding all overlapped genes within a specific proximity threshold for each variant. Here are the key options:

--geneBasedAnnotation=true
--proximityThreshold=10000

mvn compile exec:java -Dexec.mainClass=com.google.cloud.genomics.cba.StartAnnotationHiveEngine -Dexec.args="BigQueryAnnotateVariants --projectId=<Your_Google_Cloud_Project_Name> --runner=DataflowRunner --stagingLocation=gs://<Your_Google_Cloud_Bucket_Name>/<Staging_Address>/ --bigQueryDatasetId=<YOUR_BigQuery_Dataset_ID> --genericAnnotationTables=<Table address Plus selected fields> (e.g., myProject:myPublicAnnotationSets.hg19_refGene:name:name2 - selecting name and name2 from hg19_refGene table) --VCFTables=<VCF_Table_Names>(e.g., genomics-public-data:1000_genomes_phase_3.variants_20150220_release) --outputBigQueryTable=<Output_Table_Name> --geneBasedAnnotation=true --proximityThreshold=<An_Ineteger_Number>" -Pdataflow-runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!