Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New diff query and parser #3280

Merged
merged 24 commits into from
Jul 16, 2024
Merged

New diff query and parser #3280

merged 24 commits into from
Jul 16, 2024

Conversation

ajtmccarty
Copy link
Contributor

@ajtmccarty ajtmccarty commented May 11, 2024

New query and Python parsing class for getting a diff for a given base_branch, diff_branch, from_time and to_time.

The Query

The goal is to get every "full path" (Root -> Node -> Attribute/Relationship -> Value/Other Node) that has been altered on one of the given branches in the given time frame. First, we identify every edge that should be included in the diff. This is actually pretty straightforward b/c we are just filtering edges based on branch, from, and to like we do in all our cypher queries. Determining which "full paths" to include in the response is somewhat harder and is what most of the lines in the query are working on.

There are only 6 basic paths we need to consider about for a given diff

(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:HAS_VALUE]->(:AttributeValue)           //1
(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:IS_VISIBLE|IS_PROTECTED]->(:Boolean)    //2
(:Root)<-[:IS_PART_OF]-(:Node)-[:HAS_ATTRIBUTE]->(:Attribute)-[:HAS_OWNER|HAS_SOURCE]->(:Node)          //3
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]->(:Relationship)-[:IS_RELATED]-(:Node)                     //4
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]-(:Relationship)-[:IS_VISIBLE|IS_PROTECTED]->(:Boolean)     //5
(:Root)<-[:IS_PART_OF]-(:Node)-[:IS_RELATED]-(:Relationship)-[:HAS_OWNER|HAS_SOURCE]->(:Node)           //6

Each path has 4 cypher nodes and 3 cypher edges in it. The query runs in this order

  1. MATCH (p:Node|Attribute|Relationship)-[diff_rel]->(q) get every cypher edge that is on one of the branches in the diff and has a change within the timeframe of the diff. This should be relatively fast b/c we've got indices from and branch on HAS_ATTRIBUTE and HAS_VALUE, but it might be worth adding more on the other edge types in the query: IS_PART_OF, IS_RELATED, etc.
  2. get the full paths with the deepest edges in the diff: this includes the HAS_VALUE, IS_VISIBLE, IS_PROTECTED, HAS_OWNER, and HAS_SOURCE. if the path identified is on the diff_branch, we need to make sure we include the latest base_branch version of that path so that we can correctly set previous_value and new_value. if the edge in question touches a Relationship node, we also need to get the far side of that relationship so that we can correctly set the peer_id of the relationship being updated in the diff.
  3. get the full paths that include a diff edge of type HAS_ATTRIBUTE or IS_RELATED. this includes getting all the paths with the latest edges in the given timeframe below the diff edge, preferring paths on the same branch as the diff_rel. For example, if a HAS_ATTRIBUTE edge is included in the diff, we would also want to include all the HAS_VALUE, IS_VISIBLE, IS_PROTECTED, HAS_SOURCE, and HAS_OWNER paths that connect to the HAS_ATTRIBUTE via the linked Attribute node
  4. get the full paths that include a diff edge of type IS_PART_OF. this includes getting all the paths with the latest edges in the given timeframe below the diff edge, preferring paths on the same branch as the diff_rel. For example, if a node is added or deleted on the branch, then IS_PART_OF will be included in our diff_rels and we need to get all the latest paths in the given timeframe with preference for the branch of the diff_rel

The Python parser

The DiffQueryParser class and its associated dataclasses ingest every "full path" returned by the query and turn them into a simple hierarchical datastructure (see dataclasses in core.diff.model.path). Hopefully, this simple internal data structure is enough to handle everything we want to do with the diff

Remaining work

This new logic is not actually used anywhere yet. We will need to replace existing diff logic with this new query and class and perhaps make changes along the way.

  • Schema Diff: our schema are stored in the same manner as our data, so this should work just fine for schema diffs and it should be pretty easy to identify which part of a given diff is schema vs data based on the kind of the nodes in the diff
  • Identify Conflicts: update existing conflict-checking logic to use the new classes
  • Serialization: new components to handle serializing the internal diff data classes for the API, including a summary
  • Merging: merge logic can be updated to use the new diff data structure
  • Caching and retrieving: we can save calculated diffs into cypher in a simple graph structure and then retrieve them
  • Combining Diff: new component to combine multiple diffs across a given timeframe to get the aggregated diff for that timeframe
  • Pagination: new component to paginate a diff for a given pair of branches and timeframe

@ajtmccarty ajtmccarty added group/backend Issue related to the backend (API Server, Git Agent) type/housekeeping Maintenance task labels May 11, 2024
@ajtmccarty ajtmccarty requested a review from dgarros May 11, 2024 01:00
Copy link

cla-assistant bot commented May 27, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

cla-assistant bot commented May 27, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ajtmccarty ajtmccarty changed the base branch from develop to diff-refactor June 7, 2024 19:19
@github-actions github-actions bot added type/documentation Improvements or additions to documentation group/frontend Issue related to the frontend (React) group/python-sdk group/sync-engine Issue related to the Synchronization engine group/ci Issue related to the CI pipeline labels Jun 18, 2024
@github-actions github-actions bot removed type/documentation Improvements or additions to documentation group/frontend Issue related to the frontend (React) group/sync-engine Issue related to the Synchronization engine group/ci Issue related to the CI pipeline labels Jun 19, 2024
@ajtmccarty ajtmccarty changed the base branch from diff-refactor to develop June 19, 2024 15:45
@ajtmccarty ajtmccarty changed the title WIP new query for getting diff New diff query and parser Jun 19, 2024
@ajtmccarty ajtmccarty marked this pull request as ready for review June 19, 2024 19:35
@ajtmccarty ajtmccarty requested a review from a team June 19, 2024 19:35
Copy link
Contributor

@ogenstad ogenstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, looks good to me though I'd probably need to spend a bit more time to understand everything.


class DiffSummarySerializer:
async def serialize(self, diff_root: DiffRoot) -> dict[str, Any]: # pylint: disable=unused-argument
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these classes be used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I didn't really tie them into the rest of the skeleton
I've deleted them for now


class DiffCombiner:
def combine(self, earlier_diff: DiffRoot, later_diff: DiffRoot) -> DiffRoot: # pylint: disable=unused-argument
return earlier_diff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is mostly a placeholder for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there are a few stub classes in here for me/the team to fill in later

await self.diff_repo.save_diff_root(diff_root=calculated_diffs.diff_branch_diff)
missing_time_range_diffs.append(calculated_diffs.diff_branch_diff)
full_time_range_diffs = calculated_timeframe_diffs + missing_time_range_diffs
full_time_range_diffs.sort(key=lambda dr: dr.from_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support this sort option on the objects themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might at some point, but right now it is just a list, so I think sorting it with the method is acceptable for now

@@ -0,0 +1,12 @@
from neo4j.graph import Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially add a from __future__ import annotations and move this import into a TYPECHECKING block.

Not sure where it will be imported but it's probably better to be sure if we later run agents that doesn't have access to the database then we don't want the neo4j module to be loaded at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. made some updates to do this

@ajtmccarty ajtmccarty merged commit 57ce8bb into develop Jul 16, 2024
45 checks passed
@ajtmccarty ajtmccarty deleted the 2834-diff-query branch July 16, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
group/backend Issue related to the backend (API Server, Git Agent) type/housekeeping Maintenance task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants