Skip to content
This repository has been archived by the owner on Mar 20, 2020. It is now read-only.

Backend Graph Specification #4

Closed
6br opened this issue Jun 11, 2019 · 16 comments
Closed

Backend Graph Specification #4

6br opened this issue Jun 11, 2019 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@6br
Copy link
Contributor

6br commented Jun 11, 2019

((Updated by Josiah 2019-07-10))
((Updated by Josiah 2019-08-15: Removed PathIndex and Slices))

Version 1 - Basics

  • GraphGenome is the top level object which holds references to:
    • The set of all nodes in the Graph by uuid
    • Set of all Paths in the Graph by accession name
      • One path per accession (contiguous). Each chromosome = ?
    • Method for topological sort to present graph nodes in a consistent order every time
  • Paths contain
    • One unique accession (a.k.a. specimen) name
    • List of NodeTraversals in order (reverse lookup)
    • Can visit the same node multiple times
      • Each node traversal is tracked by NodeTraversal and annotations (epigenetic state, etc.) is tied to a NodeTraversal, not the Node itself.
    • Ex. 5+ 9+ 10+ 50+ 23- 78+
  • Node contains:
    • Sequence (optional)
    • uuid to distinguish from other nodes of same sequence
    • Set of Node Traversals intersecting the node (reverse lookup)
      • You can ask each Traversal what is the next "downstream" or "upstream" node. Meaning you can build a set of neighboring nodes, including self for duplications.
      • Ex. { (Path5, Order 67, +) (Path10, Order 160, -), (Path5, Order 2756, +) }
  • NodeTraversal
    • Node - being traversed
    • Path - the specimen doing the traversing
    • strand: either + or - represents reverse complements and inversions
    • Order - order in which the traversal occurs in this path
      • e.g. 5 would be the 5th node visited by the Path and N+1 is the downstream node
@6br
Copy link
Contributor Author

6br commented Jun 11, 2019

Version 2 - Summary Layers

  • GraphGenome has a list of SummaryLayers, at increasing levels of summarization.
  • Node Changes:
    • Node has link to summary_parent: a node in the next SummaryLayer up.
    • Backlinks from one node can lead to multiple, more specific nodes when zooming in.
  • Nodes: Tracking Alternative Haplotypes
    • While summarizing some sequence information was lost. We need to quantify how much.
    • Nodes from split_groups contain a link to their alternative_allele and a sequence size and percentage identity.
    • alternative_allele ForeignKey(Node)
    • alternative_overlap_bp number of base pairs the two alleles agree on
      • When simple_merge Nodes with alternative_allele, alternative_overlap_bp can be added to ensure percentages don't drift based on nodes of differing sizes.
  • Path changes:
    • Each specimen will have one Path per SummaryLayer
    • Paths belong to one summary level. Paths can be built using the transitive property on Node summary_parents.

((Updated by Josiah 2019-08-15: Clarified Ribbon vs. Path. Added AggregateTraversal))

Version 3 - Ribbon Haplotypes

  • Key idea: instead of sending 1,000 paths for 1,000 specimens, there are ~20 Ribbons sent with specimens counts. A detail query returns a list of specimens in each ribbon. Ribbons are a container to summarize many paths that share commonality i.e. a haplotype.
  • Ribbons contain:
    • Dendrogram of paths computed like a gene tree of similar sequences, show how diverged each pair of specimens are
      • Similarity is defined by the sequence of nodes along their path
    • Branch lengths for similarity
    • Presented Ribbons are not hard boundaries, just a cutoff based on data and level of detail. There is a fundamental problem with coloring local versus global solutions will not match. Further discussion/investigation is required.
  • AggregateTraversal (NodeTraversal): Ribbons have higher level traversal objects
    • coverage_count The number of specimens inside this ribbon that actually traverse this node. From this can be calculated the number of specimens bypassing a node.
    • Possibly sum in copy number expansion as well (this is summarizing)

Version 4 - Annotation

  • Annotation Stack is a list of annotations
  • Annotation metadata is a tree/graph of annotations that represents relationship between annotations
    • For example, a gene contains isoforms, and an isoform contain introns and exons.
  • Annotation has two types of
    • Contiguous annotation (i.e. paths)
    • Continuous value annotation (vg pack format, gGFF format)
      (WIP)

@josiahseaman
Copy link
Member

josiahseaman commented Jun 11, 2019

  • Did you end up choosing Node storing a Path index, or Nodes and Paths both pointing to an Intersection object? Both versions are currently in the spec. Once you've decided, let's remove the redundant spec.
  • In version 2, checkpoints should probably be their own object. Paths would contain a list of segments with a checkpoint at the beginning of every segment. I'm still not convinced checkpoints are ideal, but it's an acceptable rough draft.

@6br
Copy link
Contributor Author

6br commented Jun 12, 2019

For the first issue, I prefer to store a Path index into Node. Because it is simpler than using Intersection object.

@6br
Copy link
Contributor Author

6br commented Jun 12, 2019

For the second issue, checkpoints seem ad-hoc ideas to represent nucleotide coordinates on paths, but they are useful when users wish to understand what part of the whole genome do they see. Suppose if paths contain a list of segments with a checkpoint, and if you want to merge two segments across the checkpoint, then it becomes unclear how to handle the checkpoint. However, if you specify checkpoints on the farthest view (aka. bird's eyes), then the checkpoints can propagate to all zoom layers including nucleotide zoom level because we can assume that we do not merge nodes on zooming up. I wonder if there is a good way to represent checkpoints in this spec.

@josiahseaman
Copy link
Member

Check points are ad hoc, but they may scale very nicely all the same. To be clear, checkpoints are just a speed optimization feature. It's possible to start at the beginning of a path and count every nucleotide in every node until you reach the index you are looking for. Checkpoints are just a precomputed sum stored haphazardly throughout the path. So you can delete a checkpoint and nothing bad happens, it just takes fractionally longer to calculate. As the graph is iteratively summarized, checkpoints would become increasingly dense and outnumber nodes, which is bad. So every time a merge operation spans a checkpoint, just delete the checkpoint and append the two lists together. That way there's roughly the same ratio of checkpoint to node. The nice thing about preserving checkpoints at large scale that were precomputed at nucleotide scale is that there's no rounding error propagation. They'll stay as accurate as the graph allows.

The competitor solution is to have a "coordinate mapping" service that's a completely separate data structure. This may be cleaner in the long run. Either way, we don't have to decide yes/no on checkpoints immediately. We need to decide #5 first.

@ekg Does VG already have a better solution for the problem of mapping old TAIR10 coordinates onto a graph in random access?

@ekg
Copy link

ekg commented Jun 13, 2019

@josiahseaman what's a TAIR10 coordinate?

In vg we have the xg index, which handles coordinate mappings of the graph. I've submitted a proposal for the upcoming Biohackathon that extends this model with a server interface that could be used by graph genome browsers.

@josiahseaman
Copy link
Member

Great! That sounds like exactly what we're looking for. So we'd use the service you create at BioHackathon in combination with the links from node to summary node (Version 2 - Summary Stacks) to be able to find for example, the beginning and end node at zoom level 3 corresponding to nucleotides 23,758,110 and 45,687,084. That'd be two calls to Erik's service and three lookups to iterate up zoom levels. I think we can put that question to rest now.

P.S. TAIR10 is just an example of a particular genome assembly, I could also have said Hg38 coordinate just to point out the coordinate concept is dependent on a particular assembly.

@ekg
Copy link

ekg commented Jun 13, 2019

xg provides graph to genome path position mappings. That's pretty much the basis for these assembly coordinate systems.

@josiahseaman josiahseaman added the enhancement New feature or request label Jun 17, 2019
@6br
Copy link
Contributor Author

6br commented Jun 17, 2019

My goal: I have a python data interface to interact with xg server on version 1.

6br pushed a commit to 6br/vgbrowser that referenced this issue Jun 19, 2019
6br pushed a commit to 6br/vgbrowser that referenced this issue Jun 19, 2019
6br pushed a commit to 6br/vgbrowser that referenced this issue Jun 19, 2019
6br pushed a commit to 6br/vgbrowser that referenced this issue Jun 19, 2019
6br pushed a commit that referenced this issue Jun 20, 2019
6br pushed a commit that referenced this issue Jun 23, 2019
6br pushed a commit that referenced this issue Jun 24, 2019
@josiahseaman
Copy link
Member

Toshiyuki:
"We have several candidates to worth trying as a backend data store. First I tried xg. Erik will implement a server version of xg, but it is not yet available. So, I decided to use a binary version of xg so far. Next, we need to interact with xg using GFA format. I searched the library to represent gfa format, and I found gfapy. But, due to dialects of GFA format, the output of gfapy and xg is different in several cases. Also, there is no interface to compare gfa formats. I implemented an initial version of interface between gfa and xg (gfa.py). I am not satisfied with the function of gfapy itself. Another approach is to have our own data format in python class (graph.py). If we select this approach, I need to convert them into gfa format to store in xg format or other file formats. We first need to choose two approach (1) "

@josiahseaman
Copy link
Member

josiahseaman commented Jun 24, 2019

This week:

  • I'll split out my Graph object code to a python file and push it to GitHub.
  • Toshiyuki will add import from GFA or VG into our python object format.
  • Finish Graph summarization with sequence for Split_vertical, propagate_split, merge_horizontal

josiahseaman added a commit that referenced this issue Jun 24, 2019
6br pushed a commit that referenced this issue Jun 28, 2019
6br pushed a commit that referenced this issue Jun 28, 2019
6br pushed a commit that referenced this issue Jun 30, 2019
6br pushed a commit that referenced this issue Jun 30, 2019
6br pushed a commit that referenced this issue Jul 1, 2019
6br pushed a commit that referenced this issue Jul 1, 2019
6br pushed a commit that referenced this issue Jul 5, 2019
6br pushed a commit that referenced this issue Jul 5, 2019
6br pushed a commit that referenced this issue Jul 8, 2019
6br pushed a commit that referenced this issue Jul 9, 2019
6br pushed a commit that referenced this issue Jul 9, 2019
josiahseaman added a commit that referenced this issue Jul 9, 2019
…between Nodes and Paths. Paths get built from Node declarations. Still needs work on repr()
josiahseaman added a commit that referenced this issue Jul 9, 2019
6br pushed a commit that referenced this issue Jul 9, 2019
6br pushed a commit that referenced this issue Jul 9, 2019
josiahseaman added a commit that referenced this issue Jul 9, 2019
…on with a Paths first open structure with no Slices.
@josiahseaman
Copy link
Member

As I was working on commit 098b4cb I was taking notes for @6br . There were a couple of decisions I wasn't sure about. The big change was the lack of slices in Graph class. Instead, everything is just Nodes and Paths. I ended up keeping dictionaries of both which should be replaced with a database lookup in the future.

I managed to replace GFA.to_graph with only this:

    def to_graph(self):
        # Extract all paths into graph
        path_names = [p.name for p in self.gfa.paths]
        graph = Graph(path_names)  # Paths can be empty at start
        for path in self.gfa.paths:
            for node in path.segment_names:
                graph.append_node_to_path(node.name, node.orient, path.name)
        for segment in self.gfa.segments:
            graph.nodes[segment.name].seq = segment.sequence
        return graph

This build the full sequence for all accessions and contains the full graph structure. But it doesn't do slices or topological sort. The GFA lines: L 1 + 2 + 0M have not yet been addressed.

Why is to_gfa a @Property? There's nothing incorrect, it just surprised me.

The future of graph tests

My mockup declaration format has reached the end of its life. You really can't declare inversions or any profoundly non-linear structure with this format: [['CAAATAAG', {x, y, z}], ['A', {y, z}, 'G', {x}], ['C', {x, y, z}], ['TTG', {x, y, z}], ['A', {z}, 'G', {x, y}], ['AAATTTTCTGGAGTTCTAT', {x, y, z}], ['T', {x, y, z}], ['ATAT', {x, y, z}], ['T', {x, y, z}], ['CCAACTCTCTG', {x, y, z}]]. It's useful but adding strands to the path mentions is not equivalent to having strands in paths when traversing nodes. We may just want to have a bunch of pickle files from here on out to support our tests. When we first make a test, we use a debugger to look at the structure and verify it by hand, then pickle it to a file and compare against the pickle in our test. The test test_load_gfa_to_graph() isn't passing but that is because it has nothing to compare to. As far as I can tell, the gfa.to_graph function is working correctly.

@6br
Copy link
Contributor Author

6br commented Jul 10, 2019

Thank you for your comments.
Your modification for GFA.to_graph is reasonable because you decided to change the declaration of Graph class. Then, I will move the algorithms to convert paths to slices to the other method. I think slices should be managed in our backend server, so it makes sense to implement these algorithms by ourselves.

As long as we use paths as the internal structure, then we don't have to replicate nodes, but however, I think we need to allow node replication if we wish to convert paths to slices.

pathA: 1+, 2+, 3+, 4+ 
pathB: 1+, 3+, 2+, 4+

In this case, we cannot simply dagify without replication of nodes. Instead, it is better to replicate node 2+ as 2+’. For example,
Slice([Node(1+, (A, B))]), Slice([Node(2+, (A))]), ([Node(3+, (A, B))]), ([Node(2+’, (B))]), ([Node(4+, (A, B))]).
In addition, I think each Node should have id, we call it "index" to distinguish nodes that share the same sequence. Node id is already defined in an original GFA file, and I added this modification of the definition of Node class on dagify branch.

@josiahseaman
Copy link
Member

josiahseaman commented Jul 10, 2019 via email

@josiahseaman
Copy link
Member

josiahseaman commented Jul 10, 2019 via email

josiahseaman added a commit that referenced this issue Jul 10, 2019
josiahseaman added a commit that referenced this issue Jul 10, 2019
…a now working with new Graph definition, no need for slices.
josiahseaman added a commit that referenced this issue Jul 10, 2019
…between Nodes and Paths. Paths get built from Node declarations. Still needs work on repr()
@6br
Copy link
Contributor Author

6br commented Jul 11, 2019

I confirmed modifications on NodeTraversal and PathIndex.

josiahseaman added a commit that referenced this issue Jul 19, 2019
#4: Add DAGify method for linearizing the order of nodes and creating SlicedGraph using recursive longest common strings.
josiahseaman added a commit that referenced this issue Aug 15, 2019
josiahseaman added a commit that referenced this issue Aug 15, 2019
josiahseaman added a commit that referenced this issue Aug 22, 2019
…s can visit Nodes from any zoom layer. There is one path per accession per zoom layer. Paths and nodes contain links to their parent/child summaries.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants