Per position allele count reporting utility ( like bam-readcount) #297

jpdna · 2018-02-20T02:49:45Z

Looking for general comments about utility and approach - not ready for final review yet.

This PR adds functionality to produce a per-position report of counts of reads reporting an A/C/G/T/N or ins/del at each genomic position covered by at least one read in an input alignmentRecordRDD

The functionality is intended to mirror that of the bam-readcount program https://github.com/genome/bam-readcount which is useful to get such per-position stats for use in downstream applications such as testing new variant calling models or evaluating background sequencing noise at a given position.

The approach of this implementation is to perform a alignmentRecordRDD.mapPartitions, in which each partition builds in memory a hash of type mutable.Map[ReferencePosition, AlleleCounts] which is then returned as RDD[ReferencePosition, AlleleCounts], then combined across partitions by reduceByKey adding counts at positions, to resolve partition overlapping reads. This per-partition approach seemed efficient given the size of the hash is limited to the fraction of the genome in a partition ( which can be small if genomic pos sorted), and avoids a flatmap that would produce a large RDD with every base of every read as elements.

The CIGAR/MD reading algorithm was adapted from existing DiscoverVariants, but the context of reporting every covered position and use of the mapPartitions is sufficiently different that I think a new function/object makes sense rather than adding options to DiscoverVariants

Before finishing up this PR I wanted to get feedback on:

Just checking that this user facing per-position summary function doesn't already exist in Avocado or elsewhere in ADAM?
Is a command line utility in Avocado a reasonable place for this functionality to live?

Todo:

further validate, compare against Bam-readcount output, add tests.

AmplabJenkins · 2018-02-20T02:50:01Z

Can one of the admins verify this patch?

fnothaft · 2018-02-20T05:12:12Z

+1! This looks like a great start. I'll make more of a pass over this later this week. Thanks for the first cut @jpdna!

fnothaft · 2018-02-20T05:12:21Z

Jenkins, add to whitelist.

coveralls · 2018-02-20T06:08:06Z

Coverage decreased (-3.2%) to 76.705% when pulling d6137af on jpdna:posAlleleStats into e0979dd on bigdatagenomics:master.

AmplabJenkins · 2018-02-20T06:09:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/avocado-prb/232/

Build result: FAILURE

GitHub pull request #297 of commit d6137af automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-staging-worker-02 (ubuntu staging-02 staging) in workspace /home/jenkins/workspace/avocado-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/avocado.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/avocado.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/avocado.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/297/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains 19dfbc2 # timeout=10Checking out Revision 19dfbc2 (origin/pr/297/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 19dfbc267513fdb14751f43d9775194719e3c13fFirst time build. Skipping changelog.Triggering avocado-prb » 2.6.0,2.11,2.0.0,centosTriggering avocado-prb » 2.3.0,2.10,2.0.0,centosTriggering avocado-prb » 2.3.0,2.11,2.0.0,centosTriggering avocado-prb » 2.6.0,2.10,2.0.0,centosavocado-prb » 2.6.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.10,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.6.0,2.10,2.0.0,centos completed with result FAILURE
Test FAILed.

jpdna added 4 commits February 19, 2018 08:49

posAlleleStats working

0f3dca3

fix alleles - working

948b70e

phred filter working

bab6695

added reduceByKey to merge partitions

d6137af

fnothaft mentioned this pull request Mar 7, 2018

base counts per position bigdatagenomics/adam#1825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per position allele count reporting utility ( like bam-readcount) #297

Per position allele count reporting utility ( like bam-readcount) #297

jpdna commented Feb 20, 2018

AmplabJenkins commented Feb 20, 2018

fnothaft commented Feb 20, 2018

fnothaft commented Feb 20, 2018

coveralls commented Feb 20, 2018 •

edited

Loading

AmplabJenkins commented Feb 20, 2018

Per position allele count reporting utility ( like bam-readcount) #297

Are you sure you want to change the base?

Per position allele count reporting utility ( like bam-readcount) #297

Conversation

jpdna commented Feb 20, 2018

AmplabJenkins commented Feb 20, 2018

fnothaft commented Feb 20, 2018

fnothaft commented Feb 20, 2018

coveralls commented Feb 20, 2018 • edited Loading

AmplabJenkins commented Feb 20, 2018

Build result: FAILURE

coveralls commented Feb 20, 2018 •

edited

Loading