-
Notifications
You must be signed in to change notification settings - Fork 1
Backend Graph Specification #4
Comments
Version 2 - Summary Layers
((Updated by Josiah 2019-08-15: Clarified Ribbon vs. Path. Added AggregateTraversal)) Version 3 - Ribbon Haplotypes
Version 4 - Annotation
|
|
For the first issue, I prefer to store a Path index into Node. Because it is simpler than using Intersection object. |
For the second issue, checkpoints seem ad-hoc ideas to represent nucleotide coordinates on paths, but they are useful when users wish to understand what part of the whole genome do they see. Suppose if paths contain a list of segments with a checkpoint, and if you want to merge two segments across the checkpoint, then it becomes unclear how to handle the checkpoint. However, if you specify checkpoints on the farthest view (aka. bird's eyes), then the checkpoints can propagate to all zoom layers including nucleotide zoom level because we can assume that we do not merge nodes on zooming up. I wonder if there is a good way to represent checkpoints in this spec. |
Check points are ad hoc, but they may scale very nicely all the same. To be clear, checkpoints are just a speed optimization feature. It's possible to start at the beginning of a path and count every nucleotide in every node until you reach the index you are looking for. Checkpoints are just a precomputed sum stored haphazardly throughout the path. So you can delete a checkpoint and nothing bad happens, it just takes fractionally longer to calculate. As the graph is iteratively summarized, checkpoints would become increasingly dense and outnumber nodes, which is bad. So every time a merge operation spans a checkpoint, just delete the checkpoint and append the two lists together. That way there's roughly the same ratio of checkpoint to node. The nice thing about preserving checkpoints at large scale that were precomputed at nucleotide scale is that there's no rounding error propagation. They'll stay as accurate as the graph allows. The competitor solution is to have a "coordinate mapping" service that's a completely separate data structure. This may be cleaner in the long run. Either way, we don't have to decide yes/no on checkpoints immediately. We need to decide #5 first. @ekg Does VG already have a better solution for the problem of mapping old TAIR10 coordinates onto a graph in random access? |
@josiahseaman what's a TAIR10 coordinate? In vg we have the xg index, which handles coordinate mappings of the graph. I've submitted a proposal for the upcoming Biohackathon that extends this model with a server interface that could be used by graph genome browsers. |
Great! That sounds like exactly what we're looking for. So we'd use the service you create at BioHackathon in combination with the links from node to summary node (Version 2 - Summary Stacks) to be able to find for example, the beginning and end node at zoom level 3 corresponding to nucleotides 23,758,110 and 45,687,084. That'd be two calls to Erik's service and three lookups to iterate up zoom levels. I think we can put that question to rest now. P.S. TAIR10 is just an example of a particular genome assembly, I could also have said Hg38 coordinate just to point out the coordinate concept is dependent on a particular assembly. |
xg provides graph to genome path position mappings. That's pretty much the basis for these assembly coordinate systems. |
My goal: I have a python data interface to interact with xg server on version 1. |
Toshiyuki: |
This week:
|
…between Nodes and Paths. Paths get built from Node declarations. Still needs work on repr()
…on with a Paths first open structure with no Slices.
As I was working on commit 098b4cb I was taking notes for @6br . There were a couple of decisions I wasn't sure about. The big change was the lack of slices in Graph class. Instead, everything is just Nodes and Paths. I ended up keeping dictionaries of both which should be replaced with a database lookup in the future. I managed to replace GFA.to_graph with only this: def to_graph(self):
# Extract all paths into graph
path_names = [p.name for p in self.gfa.paths]
graph = Graph(path_names) # Paths can be empty at start
for path in self.gfa.paths:
for node in path.segment_names:
graph.append_node_to_path(node.name, node.orient, path.name)
for segment in self.gfa.segments:
graph.nodes[segment.name].seq = segment.sequence
return graph This build the full sequence for all accessions and contains the full graph structure. But it doesn't do slices or topological sort. The GFA lines: Why is to_gfa a @Property? There's nothing incorrect, it just surprised me. The future of graph testsMy mockup declaration format has reached the end of its life. You really can't declare inversions or any profoundly non-linear structure with this format: |
Thank you for your comments. As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.
In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example, |
Node should definitely have an ID, it's currently the key in my node
dictionary. I'm uncomfortable with this implementation because it should be
a database lookup, but we haven't started django or flare yet. We need to
get your branch and mine merged ASAP. If you've decided on flare, we need
it for node id tracking.
In going to subclass Graph to SlicedGraph, so we can separate sort and
slicing to it's own optional step.
…On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama ***@***.***> wrote:
Thank you for your comments.
Your modification for GFA.to_graph is reasonable because you decided to
change the declaration of Graph class. Then, I will move the algorithms to
convert paths to slices to the other method. I think slices should be
managed in our backend server, so it makes sense to implement these
algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to
replicate nodes, but however, I think we need to allow node replication if
we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes.
Instead, it is better to replicate node 2+ as 2+’. For example,
Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]),
([Node(2+’, (B))]), ([Node(4+, (A, B))]).
In addition, I think each Node should have id, we call it "index" to
distinguish nodes that share the same sequence. Node id is already defined
in an original GFA file, and I added this modification of the definition of
Node class on dagify branch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ>
.
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama ***@***.***> wrote:
Thank you for your comments.
Your modification for GFA.to_graph is reasonable because you decided to
change the declaration of Graph class. Then, I will move the algorithms to
convert paths to slices to the other method. I think slices should be
managed in our backend server, so it makes sense to implement these
algorithms by ourselves.
As long as we use paths as the internal structure, then we don't have to
replicate nodes, but however, I think we need to allow node replication if
we wish to convert paths to slices.
pathA: 1+, 2+, 3+, 4+
pathB: 1+, 3+, 2+, 4+
In this case, we cannot simply dagify without replication of nodes.
Instead, it is better to replicate node 2+ as 2+’. For example,
Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]),
([Node(2+’, (B))]), ([Node(4+, (A, B))]).
In addition, I think each Node should have id, we call it "index" to
distinguish nodes that share the same sequence. Node id is already defined
in an original GFA file, and I added this modification of the definition of
Node class on dagify branch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ>
.
|
Minor note: node name should be have + that's path specific, not specific
to the node.
…On Wed, Jul 10, 2019, 11:21 Josiah Seaman ***@***.***> wrote:
Node should definitely have an ID, it's currently the key in my node
dictionary. I'm uncomfortable with this implementation because it should be
a database lookup, but we haven't started django or flare yet. We need to
get your branch and mine merged ASAP. If you've decided on flare, we need
it for node id tracking.
In going to subclass Graph to SlicedGraph, so we can separate sort and
slicing to it's own optional step.
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama ***@***.***>
wrote:
> Thank you for your comments.
> Your modification for GFA.to_graph is reasonable because you decided to
> change the declaration of Graph class. Then, I will move the algorithms to
> convert paths to slices to the other method. I think slices should be
> managed in our backend server, so it makes sense to implement these
> algorithms by ourselves.
>
> As long as we use paths as the internal structure, then we don't have to
> replicate nodes, but however, I think we need to allow node replication if
> we wish to convert paths to slices.
>
> pathA: 1+, 2+, 3+, 4+
>
> pathB: 1+, 3+, 2+, 4+
>
>
> In this case, we cannot simply dagify without replication of nodes.
> Instead, it is better to replicate node 2+ as 2+’. For example,
> Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]),
> ([Node(2+’, (B))]), ([Node(4+, (A, B))]).
> In addition, I think each Node should have id, we call it "index" to
> distinguish nodes that share the same sequence. Node id is already defined
> in an original GFA file, and I added this modification of the definition of
> Node class on dagify branch.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ>
> .
>
On Wed, Jul 10, 2019, 02:26 Toshiyuki Yokoyama ***@***.***>
wrote:
> Thank you for your comments.
> Your modification for GFA.to_graph is reasonable because you decided to
> change the declaration of Graph class. Then, I will move the algorithms to
> convert paths to slices to the other method. I think slices should be
> managed in our backend server, so it makes sense to implement these
> algorithms by ourselves.
>
> As long as we use paths as the internal structure, then we don't have to
> replicate nodes, but however, I think we need to allow node replication if
> we wish to convert paths to slices.
>
> pathA: 1+, 2+, 3+, 4+
>
> pathB: 1+, 3+, 2+, 4+
>
>
> In this case, we cannot simply dagify without replication of nodes.
> Instead, it is better to replicate node 2+ as 2+’. For example,
> Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]),
> ([Node(2+’, (B))]), ([Node(4+, (A, B))]).
> In addition, I think each Node should have id, we call it "index" to
> distinguish nodes that share the same sequence. Node id is already defined
> in an original GFA file, and I added this modification of the definition of
> Node class on dagify branch.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#4>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AARG2FC32YK6PCZDMIBSXYLP6U3FPANCNFSM4HW6S6IQ>
> .
>
|
…node id is imported in GFA.to_graph
…a now working with new Graph definition, no need for slices.
…between Nodes and Paths. Paths get built from Node declarations. Still needs work on repr()
I confirmed modifications on NodeTraversal and PathIndex. |
#4: Add DAGify method for linearizing the order of nodes and creating SlicedGraph using recursive longest common strings.
…taining features except for SlicedGraph
…taining features except for SlicedGraph
…s can visit Nodes from any zoom layer. There is one path per accession per zoom layer. Paths and nodes contain links to their parent/child summaries.
((Updated by Josiah 2019-07-10))
((Updated by Josiah 2019-08-15: Removed PathIndex and Slices))
Version 1 - Basics
The text was updated successfully, but these errors were encountered: