New diff query and parser #3280

ajtmccarty · 2024-05-11T00:59:56Z

New query and Python parsing class for getting a diff for a given base_branch, diff_branch, from_time and to_time.

The Query

The goal is to get every "full path" (Root -> Node -> Attribute/Relationship -> Value/Other Node) that has been altered on one of the given branches in the given time frame. First, we identify every edge that should be included in the diff. This is actually pretty straightforward b/c we are just filtering edges based on branch, from, and to like we do in all our cypher queries. Determining which "full paths" to include in the response is somewhat harder and is what most of the lines in the query are working on.

There are only 6 basic paths we need to consider about for a given diff

(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:HAS_VALUE]->(:AttributeValue)           //1
(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:IS_VISIBLE|IS_PROTECTED]->(:Boolean)    //2
(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:HAS_OWNER|HAS_SOURCE]->(:Node)          //3
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]->(:Relationship)-[:IS_RELATED]-(:Node)                     //4
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]-(:Relationship)-[:IS_VISIBLE|IS_PROTECTED]->(:Boolean)     //5
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]-(:Relationship)-[:HAS_OWNER|HAS_SOURCE]->(:Node)           //6

Each path has 4 cypher nodes and 3 cypher edges in it. The query runs in this order

MATCH (p:Node|Attribute|Relationship)-[diff_rel]->(q) get every cypher edge that is on one of the branches in the diff and has a change within the timeframe of the diff. This should be relatively fast b/c we've got indices from and branch on HAS_ATTRIBUTE and HAS_VALUE, but it might be worth adding more on the other edge types in the query: IS_PART_OF, IS_RELATED, etc.
get the full paths with the deepest edges in the diff: this includes the HAS_VALUE, IS_VISIBLE, IS_PROTECTED, HAS_OWNER, and HAS_SOURCE. if the path identified is on the diff_branch, we need to make sure we include the latest base_branch version of that path so that we can correctly set previous_value and new_value. if the edge in question touches a Relationship node, we also need to get the far side of that relationship so that we can correctly set the peer_id of the relationship being updated in the diff.
get the full paths that include a diff edge of type HAS_ATTRIBUTE or IS_RELATED. this includes getting all the paths with the latest edges in the given timeframe below the diff edge, preferring paths on the same branch as the diff_rel. For example, if a HAS_ATTRIBUTE edge is included in the diff, we would also want to include all the HAS_VALUE, IS_VISIBLE, IS_PROTECTED, HAS_SOURCE, and HAS_OWNER paths that connect to the HAS_ATTRIBUTE via the linked Attribute node
get the full paths that include a diff edge of type IS_PART_OF. this includes getting all the paths with the latest edges in the given timeframe below the diff edge, preferring paths on the same branch as the diff_rel. For example, if a node is added or deleted on the branch, then IS_PART_OF will be included in our diff_rels and we need to get all the latest paths in the given timeframe with preference for the branch of the diff_rel

The Python parser

The DiffQueryParser class and its associated dataclasses ingest every "full path" returned by the query and turn them into a simple hierarchical datastructure (see dataclasses in core.diff.model.path). Hopefully, this simple internal data structure is enough to handle everything we want to do with the diff

Remaining work

This new logic is not actually used anywhere yet. We will need to replace existing diff logic with this new query and class and perhaps make changes along the way.

Schema Diff: our schema are stored in the same manner as our data, so this should work just fine for schema diffs and it should be pretty easy to identify which part of a given diff is schema vs data based on the kind of the nodes in the diff
Identify Conflicts: update existing conflict-checking logic to use the new classes
Serialization: new components to handle serializing the internal diff data classes for the API, including a summary
Merging: merge logic can be updated to use the new diff data structure
Caching and retrieving: we can save calculated diffs into cypher in a simple graph structure and then retrieve them
Combining Diff: new component to combine multiple diffs across a given timeframe to get the aggregated diff for that timeframe
Pagination: new component to paginate a diff for a given pair of branches and timeframe

cla-assistant · 2024-05-27T14:16:48Z

All committers have signed the CLA.

cla-assistant · 2024-05-27T14:17:07Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ogenstad

Left some minor comments, looks good to me though I'd probably need to spend a bit more time to understand everything.

ogenstad · 2024-07-12T12:50:28Z

backend/infrahub/api/diff/serializer.py

+
+class DiffSummarySerializer:
+    async def serialize(self, diff_root: DiffRoot) -> dict[str, Any]:  # pylint: disable=unused-argument
+        return {}


Will these classes be used anywhere?

I suppose I didn't really tie them into the rest of the skeleton
I've deleted them for now

ogenstad · 2024-07-12T12:53:28Z

backend/infrahub/core/diff/combiner.py

+
+class DiffCombiner:
+    def combine(self, earlier_diff: DiffRoot, later_diff: DiffRoot) -> DiffRoot:  # pylint: disable=unused-argument
+        return earlier_diff


I'm guessing this is mostly a placeholder for now.

yes there are a few stub classes in here for me/the team to fill in later

ogenstad · 2024-07-12T12:58:24Z

backend/infrahub/core/diff/coordinator.py

+            await self.diff_repo.save_diff_root(diff_root=calculated_diffs.diff_branch_diff)
+            missing_time_range_diffs.append(calculated_diffs.diff_branch_diff)
+        full_time_range_diffs = calculated_timeframe_diffs + missing_time_range_diffs
+        full_time_range_diffs.sort(key=lambda dr: dr.from_time)


Do we want to support this sort option on the objects themselves?

we might at some point, but right now it is just a list, so I think sorting it with the method is acceptable for now

ogenstad · 2024-07-12T13:00:03Z

backend/infrahub/core/diff/exceptions.py

@@ -0,0 +1,12 @@
+from neo4j.graph import Path


Potentially add a from __future__ import annotations and move this import into a TYPECHECKING block.

Not sure where it will be imported but it's probably better to be sure if we later run agents that doesn't have access to the database then we don't want the neo4j module to be loaded at all.

good idea. made some updates to do this

ajtmccarty added group/backend Issue related to the backend (API Server, Git Agent) type/housekeeping Maintenance task labels May 11, 2024

ajtmccarty requested a review from dgarros May 11, 2024 01:00

ajtmccarty force-pushed the 2834-diff-query branch from 76c2bc1 to 0861581 Compare May 13, 2024 23:56

ajtmccarty force-pushed the 2834-diff-query branch from 0861581 to 5fbfcda Compare June 6, 2024 01:51

ajtmccarty changed the base branch from develop to diff-refactor June 7, 2024 19:19

ajtmccarty force-pushed the 2834-diff-query branch from 5fbfcda to 2454472 Compare June 9, 2024 23:29

ajtmccarty force-pushed the 2834-diff-query branch from b29cca8 to 470cf90 Compare June 18, 2024 00:33

github-actions bot added type/documentation Improvements or additions to documentation group/frontend Issue related to the frontend (React) group/python-sdk group/sync-engine Issue related to the Synchronization engine group/ci Issue related to the CI pipeline labels Jun 18, 2024

ajtmccarty force-pushed the 2834-diff-query branch from 470cf90 to 109f160 Compare June 19, 2024 00:14

github-actions bot removed type/documentation Improvements or additions to documentation group/frontend Issue related to the frontend (React) group/sync-engine Issue related to the Synchronization engine group/ci Issue related to the CI pipeline labels Jun 19, 2024

ajtmccarty force-pushed the diff-refactor branch from 1d4a286 to 4df897b Compare June 19, 2024 13:47

ajtmccarty changed the base branch from diff-refactor to develop June 19, 2024 15:45

ajtmccarty force-pushed the 2834-diff-query branch from 109f160 to a56dcb6 Compare June 19, 2024 15:45

ajtmccarty changed the title ~~WIP new query for getting diff~~ New diff query and parser Jun 19, 2024

ajtmccarty marked this pull request as ready for review June 19, 2024 19:35

ajtmccarty requested a review from a team June 19, 2024 19:35

This was referenced Jun 19, 2024

Diff: verify new diff query work for schema changes #3697

Open

Diff: new component to identify diff conflicts #3698

Closed

Diff: component to serialize internal diff dataclasses #3699

Open

Diff: update merge logic to use new Diff components #3700

Open

This was referenced Jun 19, 2024

Diff: database storage for calculated diff #3701

Closed

Diff: ability to combine incremental diffs #3702

Open

ajtmccarty added 19 commits July 1, 2024 15:55

include properties in relationship delete query

f0fb1eb

fix reversed property edges

f891177

format

e698f61

fix reversed property edges

a3fd6dc

WIP new query for getting diff

f70245e

more WIP on new single diff query

d5b91d7

some updates to the DiffAllPathsQuery

17a4a43

move some directories and use an enum

535f077

WIP classes to work with new diff query and structure

5255a48

new diff internal structure and query parser

121d9ac

refactor diff query parser to set actions/timestamps during final pass

f8e9c57

another unit test

daa2687

initial support for relationships

6440ecf

full and correct support for relationships with unit tests

86ebc74

remove unused file

93e13b3

remove WIP code pieces

f3dd837

mangle query so that memgraph can use it

a7de821

format

5739989

fix unit test import

d124131

ajtmccarty force-pushed the 2834-diff-query branch from 3f4a3a1 to d124131 Compare July 1, 2024 22:55

ajtmccarty added 2 commits July 1, 2024 17:28

type fixes, class skeleton

6f2aecb

Merge branch 'develop' into 2834-diff-query

d1b76cb

ogenstad reviewed Jul 12, 2024

View reviewed changes

ajtmccarty added 3 commits July 15, 2024 11:50

Merge branch 'develop' into 2834-diff-query

4242087

small cleanups

82665b0

Merge branch 'develop' into 2834-diff-query

c527ad7

ajtmccarty merged commit 57ce8bb into develop Jul 16, 2024
45 checks passed

ajtmccarty deleted the 2834-diff-query branch July 16, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New diff query and parser #3280

New diff query and parser #3280

ajtmccarty commented May 11, 2024 •

edited

Loading

cla-assistant bot commented May 27, 2024 •

edited

Loading

cla-assistant bot commented May 27, 2024

ogenstad left a comment

ogenstad Jul 12, 2024

ajtmccarty Jul 15, 2024

ogenstad Jul 12, 2024

ajtmccarty Jul 15, 2024

ogenstad Jul 12, 2024

ajtmccarty Jul 15, 2024

ogenstad Jul 12, 2024

ajtmccarty Jul 15, 2024

New diff query and parser #3280

New diff query and parser #3280

Conversation

ajtmccarty commented May 11, 2024 • edited Loading

The Query

The Python parser

Remaining work

cla-assistant bot commented May 27, 2024 • edited Loading

cla-assistant bot commented May 27, 2024

ogenstad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajtmccarty commented May 11, 2024 •

edited

Loading

cla-assistant bot commented May 27, 2024 •

edited

Loading