This document explains the load algorithm KBase uses for loading new versions of graph data (at the time of writing, this is targeted towards taxonomy and ontology graphs) as a batch into a graph database possibly containing one or more previous versions of the data.
KBase indends to use ArangoDB for graph queries and so many of the conventions for field names, etc., originate there.
The algorithm is based on ArangoDB time travelling.
- Only one instance of the loader runs at once. Unspecified behavior may result otherwise.
- There is only one extant node per external ID (e.g. per NCBI Taxonomy ID) at any one time.
- With the algorithm described, any queries that are performed while a load is in progress
are not repeatable as the database is in an inconsistent state.
- See delta_load_atomicity.md for potential solutions.
- This algorithm does not address relinking data that is linked to the graph, but not part of
the graph, to merged nodes.
- For example, linking genomes that were linked into a taxonomy graph to a new node if the current node is merged.
- The algorithm only considers merges that occurred in the current load delta.
- NCBI merge data is cumulative and does not specify when a merge occurred.
- The algorithm expects that both nodes exist in the prior load.
- This means that deleted -> merged transitions (which occur in NCBI) will be ignored
- Taking deleted nodes into account becomes complex
- Nodes can transition from merged -> deleted or vice versa
- Need to check for merge edges in previously deleted nodes
- What if a node is merged, deleted, and merged again
- Do we add merged edges for both cases? How do we know when the merge actually occurred without reconstructing the entire graph history every load?
- If tax dumps are skipped, merging become even more complex
_key
defines the unique ID of the node.id
defines the ID of the node. This is usually the ID of the node from the external data source (e.g. a taxid for NCBI taxonomy). This ID is not guaranteed to be unique within the graph, but is unique for a particular version of the graph.first_version
is a string that denotes in which version the node first appeared.last_version
is a string that denotes in which version the node last appeared.created
denotes the date, in unix epoch milliseconds, when the node began to exist.expired
denotes the date, in unix epoch milliseconds, when the node ceased to exist.
_key
defineds the unique ID of the edge.id
defines the ID of the edge. This is usually the ID of the edge from the external data, or an ID generated from the data. This ID uniquely identifies the ege in a particular load. An ID and a timestamp or a load version uniquely identifies the edge in all loads._from
uniquely defines the node from which the edge originates. It contains a_key
value._to
uniquely defines the node where the edge terminates. It contains a_key
value.from
contains the nodeid
from which the edge originates.to
contains the nodeid
where the edge terminates.first_version
is a string that denotes in which version the edge first appeared.last_version
is a string that denotes in which version the edge last appeared.created
denotes the date, in unix epoch milliseconds, when the edge began to exist.expired
denotes the date, in unix epoch milliseconds, when the edge ceased to exist.type
denotes the type of the edge - eitherstd
for an edge which is part of the taxonomy or ontology (e.g the graph proper) ormerge
for an edge that denotes that thefrom
node has been merged into theto
node (and therefore thefrom
node must be expired).- Rather than a
type
field, edge types may be implemented as separate collections, depending on the use case.
- Rather than a
- In ArangoDB, the
?from
and?to
fields are prefixed by the name of the collection in which the node resides.
Input | Description |
---|---|
nodes_source |
a source of node data |
edges_source |
a source of edge data |
merge_source |
a source of merged nodes data |
timestamp |
the timestamp (as the unix epoch in milliseconds) to use as the created date for the new nodes and edges. timestamp - 1 will be used for the expired date for deleted or merged nodes and edges. |
version |
the version of the load. This is applied to nodes and edge and used to create unique _key s from node IDs. |
def create_node(node):
node._key = generate_key(...)
node.first_version = version
node.last_version = version
node.created = timestamp
node.expired = ∞
save_node_in_db(node)
def create_edge(merged_node, merged_into_node, edge)
edge._from: merged_node._key
edge._to: merged_into_node._key
edge.from: merged_node.id
edge.to: merged_into_node.id
edge.first_version: version
edge.last_version: version
edge.created: timestamp
edge.expired: ∞
save_edge_in_db(edge)
def main():
for node in get_nodes(nodes_source):
existing = get_node_from_db(node.id, timestamp)
if not existing:
create_node(node)
elif existing != node:
# for continous data you'd need the deletion of the node & edges and creation
# of the new node and edges to be a transaction.
# Since this is a batch load we don't worry about it. The whole load would have
# to be a transaction to make the db stable during the load.
set_node_expiration_in_db(existing._key, timestamp - 1)
create_node(node)
else:
set_last_node_version_in_db(node._key, version) # mark node as extant in current load
# For merges, we only consider merges that occurred in the current release - e.g if either
# node doesn't exist, we ignore the merge. It starts getting complicated otherwise.
# It is assumed the set of nodes from get_nodes() and the set of nodes from get_merges() are
# disjoint.
if merge_source:
for merged_id, merged_into_id in get_merges(merge_source):
merged = get_node_from_db(merged_id, timestamp)
merged_into = get_node_from_db(merged_into_id, timestamp)
if merged and merged_into:
# don't need to check whether the edge exists because the algorithm will never
# leave a node undeleted with an outgoing merge edge. If both nodes exist,
# there's no preexisting edge.
set_node_expiration_in_db(merged._key, timestamp - 1)
create_edge(merged, merged_into, {type: merge})
# since not all sources of graphs possess delete info, we figure it out ourselves
# may be possible to do this in one query
for node in find_extant_nodes_without_last_version_in_db(timestamp, version):
set_node_expiration_in_db(node._key, timestamp - 1)
for edge in get_edges(edges_source)
existing = get_edge_from_db(edge.id, timestamp)
edge.type = std
from = get_node_from_db(edge.from, timestamp)
to = get_node_from_db(edge.to, timestamp)
if existing:
if existing != edge or existing._from != from._key or existing._to != to._key:
set_edge_expiration_in_db(existing._key, timestamp - 1)
create_edge(from, to, edge)
else:
# mark edge as extant in current load
set_last_edge_version_in_db(existing._key, version)
else:
create_edge(from, to, edge)
# If a node's edges are changed but the node is otherwise unchanged, the old edges will not
# be deleted, so we need to delete any edges that aren't in the current version but still
# exist
# May be possible to do this in one query
for edge in find_extant_edges_without_last_version_in_db(timestamp, version):
set_edge_expiration_in_db(edge._key, timestamp - 1)
- Node and edge equality does not include the
_key
,_to
,_from
,created
,expired
, and*_version
fields, but does include any other fields defined on the node or edge as part of the graph schema. generate_key(...)
could have a number of different implementations:- Completely random, like a UUID
- node.id + '_' + version
- node.id + a UUID per load
- node.id + a user specified suffix per load
- The only requirement is that it is guaranteed to be unique
- As of 2019-7-23 node.id + '_' + version was chosen
The following indexes are needed for decent performance of the delta loader & node / edge finding and traversal queries. Multiple fields separated by a comma indicate a compound index.
Note that unique doesn't suggest that the index should be made unique in the database, as that can prevent sharding.
Index | Purpose | Unique? |
---|---|---|
_key | default | Yes |
id, expired, created | find nodes via an external ID and a timestamp. Used to locate the prior node when updating a node. | Yes |
expired, created, last_version | find extant nodes with or without a particular load version. Used to expire extant nodes not in the current load. | No |
Index | Purpose | Unique? |
---|---|---|
_key | default | Yes |
_from | default | No |
_to | default | No |
id, expired, created | find edges via an exernal ID and a timestamp. Used to locate the prior node when updating a node. | Yes |
expired, created, last_version | find extant edges with or without a particular load version. Used to expire extant edges not in the current load. | No |
_from, expired, created | traverse downstream from a node given a timestamp. | No |
_to, expired, created | traverse upstream from a node given a timestamp. | No |
The expired
field is always before the created
field in indexes as the most common use case
will be excluding non-extant, or expired, nodes.
May also want indexes on to
and from
but they are not necessary for the delta loader or
traversals.
- To speed up the algorithm
- The nodes and edges could be split into parts and parallelized.
- The nodes must all be loaded prior to the edges to establish the
_key
field.- Unless you really want to make the algorithm complex.
- The nodes must all be loaded prior to the edges to establish the
- The new taxonomy could be loaded into a temporary DB and then DB queries used to update
the real DB based on comparison to the temporary DB.
- In the case of ArangoDB, Foxx might be useful.
- The nodes and edges could be split into parts and parallelized.
- We do not use the proxy vertexes described in the time travelling article as we have very few large nodes and updates are expected to be as a batch, not continuous. This may need to change in the future.