diff --git a/.gitignore b/.gitignore index 50bbbee..2441103 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,6 @@ out jars project target +.bloop +.metals +.vscode diff --git a/docs/tools/SvPileup.md b/docs/tools/SvPileup.md index 2891721..308909f 100644 --- a/docs/tools/SvPileup.md +++ b/docs/tools/SvPileup.md @@ -20,11 +20,11 @@ Two output files will be created: tag. The `be` SAM tag contains a comma-delimited list of breakpoints to which a given alignment belongs. Each element is -a semi-colon delimited, with four fields: +semi-colon delimited, with four fields: 1. The unique breakpoint identifier (same identifier found in the tab-delimited output). -2. Either "left" or "right, corresponding to if the read shows evidence of the genomic left or right side of the - breakpoint as found in the breakpoint file (i.e. `left_pos` or `right_pos`). +2. Either "left" or "right, corresponding to whether the read shows evidence of the genomic left or right side of + the breakpoint as found in the breakpoint file (i.e. `left_pos` or `right_pos`). 3. Either "from" or "into", such that when traversing the breakpoint would read through "from" and then into "into" in the sequencing order of the read pair. For a split-read alignment, the "from" contains the aligned portion of the read that comes from earlier in the read in sequencing order. For an alignment of a read-pair diff --git a/docs/tools/index.md b/docs/tools/index.md index 349f179..f44a6b1 100644 --- a/docs/tools/index.md +++ b/docs/tools/index.md @@ -4,7 +4,8 @@ title: fgsv tools # fgsv tools -The following tools are available in fgsv version 0.0.3-994cece. +The following tools are available in fgsv version 0.0.3-3f469cb. + ## All tools All tools. diff --git a/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala b/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala index 7e12d11..1533544 100644 --- a/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala +++ b/src/main/scala/com/fulcrumgenomics/sv/tools/SvPileup.scala @@ -5,7 +5,7 @@ import com.fulcrumgenomics.bam.api.{SamSource, SamWriter} import com.fulcrumgenomics.bam.{Bams, Template} import com.fulcrumgenomics.commons.io.PathUtil import com.fulcrumgenomics.commons.util.LazyLogging -import com.fulcrumgenomics.fasta.SequenceDictionary +import com.fulcrumgenomics.fasta.{SequenceDictionary, Topology} import com.fulcrumgenomics.sopt.{arg, clp} import com.fulcrumgenomics.sv.EvidenceType._ import com.fulcrumgenomics.sv._ @@ -52,11 +52,11 @@ object TargetBedRequirement extends FgBioEnum[TargetBedRequirement] { | tag. | |The `be` SAM tag contains a comma-delimited list of breakpoints to which a given alignment belongs. Each element is - |a semi-colon delimited, with four fields: + |semi-colon delimited, with four fields: | |1. The unique breakpoint identifier (same identifier found in the tab-delimited output). - |2. Either "left" or "right, corresponding to if the read shows evidence of the genomic left or right side of the - | breakpoint as found in the breakpoint file (i.e. `left_pos` or `right_pos`). + |2. Either "left" or "right, corresponding to whether the read shows evidence of the genomic left or right side of + | the breakpoint as found in the breakpoint file (i.e. `left_pos` or `right_pos`). |3. Either "from" or "into", such that when traversing the breakpoint would read through "from" and then into | "into" in the sequencing order of the read pair. For a split-read alignment, the "from" contains the aligned | portion of the read that comes from earlier in the read in sequencing order. For an alignment of a read-pair @@ -68,7 +68,7 @@ object TargetBedRequirement extends FgBioEnum[TargetBedRequirement] { |Therefore, if the template (alignments for a read pair) contain both types of evidence, then the `be` tag |will only be added to the split-read alignments (i.e. the primary and supplementary alignments of the read |in the pair that has split-read evidence), and will not be found in the mate's alignment. - | + | |## Example output | |The following shows two breakpoints: @@ -153,7 +153,8 @@ class SvPileup maxWithinReadDistance = maxAlignedSegmentInnerDistance, maxReadPairInnerDistance = maxReadPairInnerDistance, minUniqueBasesToAdd = minUniqueBasesToAdd, - slop = slop + slop = slop, + dict = source.dict, ) val filteredEvidences = targets match { @@ -320,12 +321,14 @@ object SvPileup extends LazyLogging { * adding them. * @param slop the number of bases of slop to allow when determining which records to track for the * left or right side of an aligned segment when merging segments + * @param dict the sequence dictionary to use for determining if a contig is circular */ def findBreakpoints(template: Template, maxWithinReadDistance: Int, maxReadPairInnerDistance: Int, minUniqueBasesToAdd: Int, - slop: Int = 0 + slop: Int = 0, + dict: SequenceDictionary ): IndexedSeq[BreakpointEvidence] = { val segments = AlignedSegment.segmentsFrom(template, minUniqueBasesToAdd=minUniqueBasesToAdd, slop=slop) @@ -334,11 +337,11 @@ object SvPileup extends LazyLogging { NoBreakpoints case 2 => // Special case for 2 since most templates will generate two segments and we'd like it to be efficient - val bp = findBreakpoint(segments.head, segments.last, maxWithinReadDistance, maxReadPairInnerDistance) + val bp = findBreakpoint(segments.head, segments.last, maxWithinReadDistance, maxReadPairInnerDistance, dict) if (bp.isEmpty) NoBreakpoints else bp.toIndexedSeq case _ => segments.iterator.sliding(2).flatMap { case Seq(seg1, seg2) => - findBreakpoint(seg1, seg2, maxWithinReadDistance, maxReadPairInnerDistance) + findBreakpoint(seg1, seg2, maxWithinReadDistance, maxReadPairInnerDistance, dict) }.toIndexedSeq } } @@ -347,9 +350,10 @@ object SvPileup extends LazyLogging { private def findBreakpoint(seg1: AlignedSegment, seg2: AlignedSegment, maxWithinReadDistance: Int, - maxReadPairInnerDistance: Int): Option[BreakpointEvidence] = { + maxReadPairInnerDistance: Int, + dict: SequenceDictionary): Option[BreakpointEvidence] = { if (isInterContigBreakpoint(seg1, seg2) || - isIntraContigBreakpoint(seg1, seg2, maxWithinReadDistance, maxReadPairInnerDistance) + isIntraContigBreakpoint(seg1, seg2, maxWithinReadDistance, maxReadPairInnerDistance, dict) ) { val ev = if (seg1.origin.isInterRead(seg2.origin)) EvidenceType.ReadPair else EvidenceType.SplitRead Some(BreakpointEvidence(from=seg1, into=seg2, evidence=ev)) @@ -370,9 +374,10 @@ object SvPileup extends LazyLogging { r1.refIndex != r2.refIndex } - /** Determines if the two segments are provide evidence of a breakpoint joining two different regions from - * the same contig. Returns true if: - * - the two segments overlap (implying some kind of duplication) (note overlapping reads will get a merged seg) + /** Determines if the two segments provide evidence of a breakpoint joining two different regions from + * the same contig. If the contig is circular (i.e. labeled `TP:circular` in the `SQ` header), then + * reads that span the origin are considered contiguous. Returns true if: + * - the two segments overlap (implying some kind of duplication) (note: overlapping reads will get a merged seg) * - the strand of the two segments differ (implying an inversion or other rearrangement) * - the second segment is before the first segment on the genome * - the distance between the two segments is larger than the maximum allowed (likely a deletion) @@ -381,29 +386,46 @@ object SvPileup extends LazyLogging { * @param seg2 the second alignment segment * @param maxWithinReadDistance the maximum distance between segments if they are from the same read * @param maxBetweenReadDistance the maximum distance between segments if they are from different reads + * @param dict the sequence dictionary to use for determining if a contig is circular */ def isIntraContigBreakpoint(seg1: AlignedSegment, seg2: AlignedSegment, maxWithinReadDistance: Int, - maxBetweenReadDistance: Int): Boolean = { + maxBetweenReadDistance: Int, + dict: SequenceDictionary): Boolean = { require(seg1.range.refIndex == seg2.range.refIndex) // The way aligned segments are generated for a template, if we have all the reads in the expected orientation - // the segments should all come out on the same strand. Therefore any difference in strand is odd. In addition - // any segment that "moves backwards" down the genome is odd, as genome position and read position should increase - // together. - if (seg1.positiveStrand != seg2.positiveStrand) true - else if (seg1.positiveStrand && seg2.range.start < seg1.range.end) true - else if (!seg1.positiveStrand && seg1.range.start < seg2.range.start) true - else { - val maxDistance = if (seg1.origin.isInterRead(seg2.origin)) maxBetweenReadDistance else maxWithinReadDistance - - val innerDistance = { - if (seg1.range.start <= seg2.range.start) seg2.range.start - seg1.range.end - else seg1.range.start - seg2.range.end + // the segments should all come out on the same strand. Therefore any difference in strand is odd. + val positiveStrand = seg1.positiveStrand + positiveStrand != seg2.positiveStrand || { + // Otherwise, any segment that "moves backwards" down the genome is odd, as genome position and read position + // should increase together (unless the contig is circular). + val contig = dict(seg1.range.refIndex) + val isCircular = contig.topology.contains(Topology.Circular) + (!isCircular && ( + (positiveStrand && seg2.range.start < seg1.range.end) || + (!positiveStrand && seg1.range.start < seg2.range.end) + )) || { + // If the contig is circular and the segments span the origin, treat them as contiguous when + // calculating the distance between them. + val innerDistance = if (isCircular && positiveStrand && seg2.range.end <= seg1.range.start) { + require(seg1.range.end <= contig.length) + (contig.length - seg1.range.end) + seg2.range.start + } + else if (isCircular && !positiveStrand && seg1.range.end <= seg2.range.start) { + require(seg2.range.end <= contig.length) + (contig.length - seg2.range.end) + seg1.range.start + } + else if (seg1.range.start <= seg2.range.start) { + seg2.range.start - seg1.range.end + } + else { + seg1.range.start - seg2.range.end + } + val maxDistance = if (seg1.origin.isInterRead(seg2.origin)) maxBetweenReadDistance else maxWithinReadDistance + innerDistance > maxDistance } - - innerDistance > maxDistance } } } diff --git a/src/test/scala/com/fulcrumgenomics/sv/tools/SvPileupTest.scala b/src/test/scala/com/fulcrumgenomics/sv/tools/SvPileupTest.scala index 3998a97..47bd574 100644 --- a/src/test/scala/com/fulcrumgenomics/sv/tools/SvPileupTest.scala +++ b/src/test/scala/com/fulcrumgenomics/sv/tools/SvPileupTest.scala @@ -5,6 +5,7 @@ import com.fulcrumgenomics.alignment.Cigar import com.fulcrumgenomics.bam.Template import com.fulcrumgenomics.bam.api.{SamRecord, SamWriter} import com.fulcrumgenomics.commons.io.PathUtil +import com.fulcrumgenomics.fasta.{SequenceDictionary, SequenceMetadata, Topology} import com.fulcrumgenomics.sv.EvidenceType.{ReadPair, SplitRead} import com.fulcrumgenomics.sv.SegmentOrigin.{Both, ReadOne, ReadTwo} import com.fulcrumgenomics.sv._ @@ -47,35 +48,35 @@ class SvPileupTest extends UnitSpec { // any overlapping segments or jumping backwards between segments is indicative of a breakpoint // What you might get from a read pair with a gap between the two reads - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=150) shouldBe false + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=150, dict=builder.dict) shouldBe false // same segment but jumping backwards - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=earlier, maxWithinReadDistance=5, maxBetweenReadDistance=150) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=earlier, maxWithinReadDistance=5, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true // overlapping segments but still jumping backwards - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=overlap, maxWithinReadDistance=5, maxBetweenReadDistance=150) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=overlap, maxWithinReadDistance=5, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true // non-overlapping segments from the same read, testing various values for maxWithinReadDistance // Note that the inner distance between two blocks is defined as `later.start - earlier.end`, so for // this case that is 150-100 = 50, so a breakpoint should be called when the maxWithinReadDistance < 50. - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance= 5, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=25, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=48, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=49, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=50, maxBetweenReadDistance=150) shouldBe false - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=51, maxBetweenReadDistance=150) shouldBe false + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance= 5, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=25, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=48, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=49, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=50, maxBetweenReadDistance=150, dict=builder.dict) shouldBe false + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later, maxWithinReadDistance=51, maxBetweenReadDistance=150, dict=builder.dict) shouldBe false // non-overlapping segments where the later segment is "both" so indicates a split read breakpoint - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance= 5, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=25, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=49, maxBetweenReadDistance=150) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=75, maxBetweenReadDistance=150) shouldBe false + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance= 5, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=25, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=49, maxBetweenReadDistance=150, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=Both), maxWithinReadDistance=75, maxBetweenReadDistance=150, dict=builder.dict) shouldBe false // non-overlapping segments where the later segment is "Read2" so between read distance should be used - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=5 ) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=25) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=49) shouldBe true - SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=75) shouldBe false + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=5, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=25, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=49, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(seg1=earlier, seg2=later.copy(origin=ReadTwo), maxWithinReadDistance=5, maxBetweenReadDistance=75, dict=builder.dict) shouldBe false } "SvPileup.isIntraContigBreakpoint" should "identify when two segments flip strand" in { @@ -88,28 +89,33 @@ class SvPileupTest extends UnitSpec { val r2 = f2.copy(positiveStrand=false) // Simple tests that should not call breakpoints - SvPileup.isIntraContigBreakpoint(f1, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe false - SvPileup.isIntraContigBreakpoint(r2, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe false + SvPileup.isIntraContigBreakpoint(f1, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe false + SvPileup.isIntraContigBreakpoint(r2, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe false // Now what if we make them different reads - SvPileup.isIntraContigBreakpoint(f1, f2.copy(origin=ReadTwo), maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe false - SvPileup.isIntraContigBreakpoint(r2, r1.copy(origin=ReadTwo), maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe false + SvPileup.isIntraContigBreakpoint(f1, f2.copy(origin=ReadTwo), maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe false + SvPileup.isIntraContigBreakpoint(r2, r1.copy(origin=ReadTwo), maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe false // But any combination on different strands should yield a breakpoint - SvPileup.isIntraContigBreakpoint(f1, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(f1, r2, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(f2, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(f2, r2, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(r1, f1, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(r1, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(r2, f1, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true - SvPileup.isIntraContigBreakpoint(r2, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500) shouldBe true + SvPileup.isIntraContigBreakpoint(f1, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(f1, r2, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(f2, r1, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(f2, r2, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(r1, f1, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(r1, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(r2, f1, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true + SvPileup.isIntraContigBreakpoint(r2, f2, maxWithinReadDistance=500, maxBetweenReadDistance=500, dict=builder.dict) shouldBe true } ////////////////////////////////////////////////////////////////////////////// // Objects and functions used in testing findBreakpoint() ////////////////////////////////////////////////////////////////////////////// - private val builder = new SamBuilder(readLength=100) + private val builder = { + val seqs = (Range.inclusive(1, 22) ++ Seq("X", "Y")).map { chr => + SequenceMetadata(name="chr" + chr, length=200e6.toInt) + } ++ Seq(SequenceMetadata(name="chrM", length=16000, topology = Some(Topology.Circular))) + new SamBuilder(readLength=100, sd=Some(SequenceDictionary(seqs:_*))) + } import SamBuilder.{Minus, Plus, Strand} /** Construct a read/rec with the information necessary for breakpoint detection. */ @@ -136,7 +142,8 @@ class SvPileupTest extends UnitSpec { template = t, maxWithinReadDistance = 5, maxReadPairInnerDistance = 1000, - minUniqueBasesToAdd = 10 + minUniqueBasesToAdd = 10, + dict = builder.dict ) /** Short hand for constructing a BreakpointEvidence. */ @@ -293,6 +300,24 @@ class SvPileupTest extends UnitSpec { ) } + it should "not call a breakpoint from a read pair on opposite sides of a circular contig origin" in { + val template = t( + r("chrM", 15800, Plus, r=1, cigar="100M", supp=false), + r("chrM", 100, Minus, r=2, cigar="100M", supp=false), + ) + call(template) should contain theSameElementsInOrderAs IndexedSeq.empty + } + + it should "not call a breakpoint from a split read that spans a circular contig origin" in { + val template = t( + r("chrM", 15951, Plus, r=1, cigar="50M50S", supp=false), + r("chrM", 1, Plus, r=1, cigar="50S50M", supp=true), + r("chrM", 300, Minus, r=2, cigar="100M", supp=false), + ) + call(template) should contain theSameElementsInOrderAs IndexedSeq.empty + } + + it should "call a breakpoint from a single-end split read with no mate" in { val r1Half1 = r("chr1", 100, Plus, r=0, cigar="50M50S", supp=false) val r1Half2 = r("chr7", 800, Plus, r=0, cigar="50S50M", supp=true)