-
Notifications
You must be signed in to change notification settings - Fork 1
Roadmap: Graph Summarization #3
Comments
My Goal for this week is adding sequence and sequence identity penalties into my summarization prototype. |
…ions with unordered set equality in string representation.
…xt: add Node.horizontal_merge()
…dded Node.intersection() and union()
I've read "Distribution of recombination crossovers and the origin of haplotype blocks" (Wang et al. 2002) and "The structure of haplotype blocks in the human genome" (Gabriel 2002). The first paper points to the 4 Gamete Test for recombination points outlined in (Kaplan 1985). The approach appeals to me because it's very binary and easily implemented. We'd eventually want to upgrade to (Gabriel 2002) method. The 4 Gamete Test is clearly overly sensitive, however that's precisely what we'd want for the first level of summarization. Everything that can be easily dismissed gets dismissed first, leaving us with a pile of candidates we can devote compute power to telling if they're false positives / erroneous. I'm currently working on a generalization of their approach to graph genomes with an arbitrary number of paths and arbitrary number of alternatives. I think it's possible to make a 5 gamete test, a 6 gamete test and so on that may serve as zoom layers. I'll keep working on it for now. |
From 2004 Paper: https://academic.oup.com/bioinformatics/article/21/2/263/186662 |
Haploblockerhttp://dx.doi.org/10.1534/genetics.119.302283 Haploblocker looks like they have a lot of reusable concepts and methods. Figure 2 (below) looks very similar to the graph summarization algorithm that I specced out and prototyped for graph genomes. The key unexpected similarity is that they use SNP data to build a graph with slices that represent all the variation. So from dissimilar data, they build essentially the same data structure as us. I think this would work on full sequences. Figure 5 looks very similar to a glimpse of what I was proposing to do with sorting and aggregating Paths into Ribbons. Bernardo (Earlham) pointed out that a dendrogram of similar paths was necessarily local, not global. This figure is a great depiction of why. TubeMap gives us the option to gradually rearrange ordering as we go, so we could conceivable keep fragmentation in check. Though we need to represent that graphically if we're allowing paths to migrate from one ribbon color to another. Computing Time"with the full dataset 895 (501 haplotypes, 80,200 SNPs) needing 55 seconds on default, 75 896 seconds with a target coverage and 13.3 minutes in the adaptive 897 mode." They appear to agree on the applicability of the method to sequence. |
Adapting Haploblocker method to Graph GenomesMany of the concepts transfer 1:1. I suppose "Great minds think alike." The question of what order to do the operations in is answered by the paper: "we first alternately apply SM and SG until no further changes occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of the window cluster." HaploBlocker MCMB may actually serve as a decent zoom level. The most zoomed in would have zero Block-filtering. As we zoom out, we increase MCMB and save each intermediary state along the way of summarization through simplification. This is a sleek solution to an arbitrary threshold: "without prior information, we recommend to set a target on what share of the SNP dataset is represented by at least one block (“coverage”)". Notable Deviations
Defining SlicesSlices are necessary to Translate MSA / Haplotype Methods to a Graph Genome.
Programmatically this can be determined by:
Deletions: An issue to address is the absence or deletion variants that will alter what “1 node downstream” means. If this is a large deletion, then they shouldn’t be compared, but small deletions are an alternative without a representative node. |
The slices concept is a limited version of that of bubbles. See
"Superbubbles, Ultrabubbles and Cacti"
https://www.biorxiv.org/content/10.1101/101493v1.
The most general version of these structures are called "snarls". In vg,
there is a method to decompose the graph into a set of snarls. However,
this implementation has some scaling problems. The snarl tree would be the
thing you'd like to use here.
…On Wed, Jul 3, 2019, 08:19 Josiah Seaman ***@***.***> wrote:
Adapting Haploblocker method to Graph Genomes
Many of the concepts transfer 1:1. I suppose "Great minds think alike".
Simple Merge (SM) = Merge Horizontal
Split Group (SG) = Split Vertical
Neglect Node (NN) = Merge Vertical
Extended-block-identification = Migrate Paths (which I had also marked
optional)
The question of what order to do the operations in is answered by the
paper: "we first alternately apply SM and SG until no further changes
occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of the
window cluster."
HaploBlocker MCMB may actually serve as a decent zoom level. The most
zoomed in would have zero Block-filtering. As we zoom out, we increase MCMB
and save each intermediary state along the way of summarization through
simplification. This is a sleek solution to an arbitrary threshold:
"without prior information, we recommend to set a target on what share of
the SNP dataset is represented by at least one block (“coverage”)".
Notable Deviations
- Multiple paths can share the same node, so I'm not sure whether
their block overlap concept is the same or not.
- Block-identification is a large scale Horizontal Merge. It looks
like a second, high level way of removing uninformative variation. It's
unclear at this time if we will include this or not.
- Block-extension: the choice of 1 in 20 windows having a mismatch is
another parameter we may associate with zoom.
- Visual Ordering: Figure 5 (above) shows a gradual decay of coherence
because ordering is strictly preserved. Our advantage is we can use TubeMap
path crossovers to keep the appearance of blocks looking coherent. It'd be
good to know how much of this decoherence is due to set membership and how
much of it is due to simple limitations of 1D sorting.
Defining Slices
*Slices are necessary to Translate MSA / Haplotype Methods to a Graph
Genome.*
Slices are a set of nodes which can be reasonably compared with each other
and considered “alternatives” in a branch of the graph genome. These
correspond to nucleotides in a vertical slice of an MSA. In graph genomes,
it’s a bit more complicated, so here is a proposed technical definition. In
order for a set of Nodes to be in the same slice they need to:
1. Be placed in structurally equivalent positions
2. Be non-diverging i.e. in the same bubble. Large translocations are
divergent.
Programmatically this can be determined by:
3. From Node B, step upstream to Node A and take all nodes downstream
of A to find Node C. This can be an irregular stepsize, meaning that later
summarization steps after horizontal merges will bring a node C into the
same slice with B when previously that was not shared.
4. From Nodes B and C, iterate downstream 10 steps and every node ID
into a set. If the two sets share a member node ID it means they share a
downstream node and are part of the same bubble. Maximum bubble size is
purposely defined within some limited view window. Higher levels of
summarization means each node is more sequence and thus the bubble size in
nucleotides will be larger. This will result in large bubble alternative
not being compared until near the last stages of summarization when working
with large structural features.
*Deletions:* An issue to address is the absence or deletion variants that
will alter what “1 node downstream” means. If this is a large deletion,
then they shouldn’t be compared, but small deletions are an alternative
without a representative node.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABDQEIXL6ZEY7AJ2PXOHQ3P5SKMLANCNFSM4HW4RTGA>
.
|
Thank you for the tip Eric! I was thinking of slices as a very limited
scope of the beginning of a bubble. But if we can get a local snarl at any
point in the graph, that's terrific. I'll have to look at the technical
details to see how far that can be taken. I was hoping to do it iteratively
while summarizing.
…On Wed, Jul 3, 2019, 17:16 Erik Garrison ***@***.***> wrote:
The slices concept is a limited version of that of bubbles. See
"Superbubbles, Ultrabubbles and Cacti"
https://www.biorxiv.org/content/10.1101/101493v1.
The most general version of these structures are called "snarls". In vg,
there is a method to decompose the graph into a set of snarls. However,
this implementation has some scaling problems. The snarl tree would be the
thing you'd like to use here.
On Wed, Jul 3, 2019, 08:19 Josiah Seaman ***@***.***> wrote:
> Adapting Haploblocker method to Graph Genomes
>
> Many of the concepts transfer 1:1. I suppose "Great minds think alike".
> Simple Merge (SM) = Merge Horizontal
> Split Group (SG) = Split Vertical
> Neglect Node (NN) = Merge Vertical
> Extended-block-identification = Migrate Paths (which I had also marked
> optional)
>
> The question of what order to do the operations in is answered by the
> paper: "we first alternately apply SM and SG until no further changes
> occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of
the
> window cluster."
>
> HaploBlocker MCMB may actually serve as a decent zoom level. The most
> zoomed in would have zero Block-filtering. As we zoom out, we increase
MCMB
> and save each intermediary state along the way of summarization through
> simplification. This is a sleek solution to an arbitrary threshold:
> "without prior information, we recommend to set a target on what share of
> the SNP dataset is represented by at least one block (“coverage”)".
> Notable Deviations
>
> - Multiple paths can share the same node, so I'm not sure whether
> their block overlap concept is the same or not.
> - Block-identification is a large scale Horizontal Merge. It looks
> like a second, high level way of removing uninformative variation. It's
> unclear at this time if we will include this or not.
> - Block-extension: the choice of 1 in 20 windows having a mismatch is
> another parameter we may associate with zoom.
> - Visual Ordering: Figure 5 (above) shows a gradual decay of coherence
> because ordering is strictly preserved. Our advantage is we can use
TubeMap
> path crossovers to keep the appearance of blocks looking coherent. It'd
be
> good to know how much of this decoherence is due to set membership and
how
> much of it is due to simple limitations of 1D sorting.
>
> Defining Slices
>
> *Slices are necessary to Translate MSA / Haplotype Methods to a Graph
> Genome.*
> Slices are a set of nodes which can be reasonably compared with each
other
> and considered “alternatives” in a branch of the graph genome. These
> correspond to nucleotides in a vertical slice of an MSA. In graph
genomes,
> it’s a bit more complicated, so here is a proposed technical definition.
In
> order for a set of Nodes to be in the same slice they need to:
>
> 1. Be placed in structurally equivalent positions
> 2. Be non-diverging i.e. in the same bubble. Large translocations are
> divergent.
> Programmatically this can be determined by:
> 3. From Node B, step upstream to Node A and take all nodes downstream
> of A to find Node C. This can be an irregular stepsize, meaning that
later
> summarization steps after horizontal merges will bring a node C into the
> same slice with B when previously that was not shared.
> 4. From Nodes B and C, iterate downstream 10 steps and every node ID
> into a set. If the two sets share a member node ID it means they share a
> downstream node and are part of the same bubble. Maximum bubble size is
> purposely defined within some limited view window. Higher levels of
> summarization means each node is more sequence and thus the bubble size
in
> nucleotides will be larger. This will result in large bubble alternative
> not being compared until near the last stages of summarization when
working
> with large structural features.
>
> *Deletions:* An issue to address is the absence or deletion variants that
> will alter what “1 node downstream” means. If this is a large deletion,
> then they shouldn’t be compared, but small deletions are an alternative
> without a representative node.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
#3
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AABDQEIXL6ZEY7AJ2PXOHQ3P5SKMLANCNFSM4HW4RTGA
>
> .
>
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#3>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARG2FGYZQQEULECFSSXBHLP5TGEBANCNFSM4HW4RTGA>
.
|
I actually think that the snarl decomposition can be problematic, and I'm interested in a path-based method to determine this. Maybe that's more similar to the slices you're describing. The idea would be to find regions where you have colinear-ish alignment between different paths in the graph. Within these chains, places where they don't match are bubbles, and these can be identified by the start and and end of the bubble. Bigger, complex structures are harder to see, and to some extent get ignored by following this definition. |
Complex structures should be ignored by designed because those are areas
where we don't want to oversimplify. Slices are for the very common very
simple case. I'd like to still keep my graph a graph where it's locally
complicated, especially if that is rare.
…On Wed, Jul 3, 2019, 18:14 Erik Garrison ***@***.***> wrote:
I actually think that the snarl decomposition can be problematic, and I'm
interested in a path-based method to determine this. Maybe that's more
similar to the slices you're describing. The idea would be to find regions
where you have colinear-ish alignment between different paths in the graph.
Within these chains, places where they don't match are bubbles, and these
can be identified by the start and and end of the bubble. Bigger, complex
structures are harder to see, and to some extent get ignored by following
this definition.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#3>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARG2FGCSB5KEM3B5CZPQK3P5TM6RANCNFSM4HW4RTGA>
.
|
@josiahseaman how is this all related to the variable length markov chains used in Beagle? |
@ekg I wasn't familiar with Beagle. Looking through Browning's 2006 and 2007 papers I can see there's similarities but I'm not 100% sure how to translate it. I'm talking with the author of Haploblocker today, who I'm hoping has more experience with the wider literature than myself. What would you propose we use from the lesson learned with Beagle? If there is an adjustable graph summarization out there, I'd be very happy to just use off the shelf software and close this issue. That would allow us to focus on visualization. |
A couple of thoughts on the use of HaploBlocker for zooming etc. Merge Vertical + Data qualityI think merge vertical is much more similar to allowing for 1 error/deviation per 20 markers than what I call neglect nodes. In neglect nodes I am removing haplotypes locally from the cluster. By this rare variants (which will not be in any haplotype block later) are not further considered - not sure you want that. An alternative for this would potentially be to introduce a node that is including all rare variants. In regard to that there should be a way to account for the data quality. From experience I would assume 2-3% calling errors for low-coverage sequencing (0.5x or similar), <1% for high-coverage (20x+) and 0.2-0.5% on a SNP chip. Further the share of missingness is potentially different – e.g. with 0.5x you would expect 60-70% of all alleles to be called NA even though they are present). ZoomingFor high level zooming the current merging techniques are probably not enough. Looking for long range relations is probably required for that (e.g. group-wise IBD as in HaploBlocker or screening for commonly occurring paths in vg?). As for HaploBlocker this comes with some challenges:
Computing timeThe dataset I used contained genomic data from a single maize landrace and thereby little genomic diversity. For substainitly more diverse computing time should increase – as only one core was used this should still not be the biggest issue. It was enough for 1000G Humans Phase 3. But there are still lots of potential improvements (mostly in the merging procedure and in case of included NAs) to speed it up if it has to be calculated more than once. FurtherHow do you define a “complex structure” – aren’t those like the most interesting to have a graph based representation of? |
Status Update: Python HaploBlocker is running - 13 AugustTorsten and I have been working to develop a Jupyter notebook implementation of the first three steps of HaploBlocker. Input data and tests are feature equivalent. It was very useful to have a working R implementation to compare against for tests. I then created a Django app and migrated the code and concepts into HaploBlocker/haplonetwork.py and tests.py. Split_groups (which I had conceived of as split vertical) is working but could still use more thorough tests. The main goal of Graph Summarization is within sight. We still need to integrate HaploBlocker and Graph together as a database application. This will be a pretty major overhaul with it's own issue #23. Once that is done we should have:
Next:
We're well to the point where it would be helpful to have a front end #24 so we can visualize and browse our test data sets. |
…s can visit Nodes from any zoom layer. There is one path per accession per zoom layer. Paths and nodes contain links to their parent/child summaries.
Implement a method of summarizing the variation inside of a graph.
Prototype Progress
The text was updated successfully, but these errors were encountered: