Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure properties for each node type are consistent #54

Open
bhpayne opened this issue Jan 20, 2025 · 2 comments
Open

ensure properties for each node type are consistent #54

bhpayne opened this issue Jan 20, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@bhpayne
Copy link
Member

bhpayne commented Jan 20, 2025

Is there a way to ensure the consistency of the graph?

  • does each node type (expression, derivation, step) have the same properties? (e.g., every expression should have name_latex present.)
  • are the values for each property consistent? (e.g., true, false, elephant)
  • do edges connect the right nodes? (e.g., expression HAS_INPUT expression is not a valid edge.)
@bhpayne bhpayne added the enhancement New feature or request label Jan 20, 2025
@bhpayne bhpayne self-assigned this Jan 20, 2025
@bhpayne bhpayne moved this to Todo - not urgent in v8_flask_neo4j Jan 20, 2025
@bhpayne
Copy link
Member Author

bhpayne commented Jan 20, 2025

We can check that nodes of a given type at least have the same properties present using something like

MATCH (n:expression)
UNWIND keys(n) AS propertyName
RETURN propertyName, count(*) AS count
ORDER BY count DESC

and then manually check that the count is the same.

Source Gemini 2.0 Flash Thinking Experimental.

Prompt:

Cypher is a query language for Neo4j, a property graph database.

The Cypher query

MATCH (n:expression) WHERE n.id='1114820451' RETURN n

matches nodes where the property id has a specific value.

A node may have more than one property.

Provide a Cypher query that counts how many nodes have each property present.

Thoughts:

Let's break down the thought process to arrive at the Cypher query for counting node properties.

1. Understanding the Goal:

The core objective is to find out, for each distinct property name across all nodes, how many nodes possess that property. This immediately suggests we need to group by the property name.

2. Accessing Node Properties:

How do we access the properties of a node in Cypher? We can use a pattern like (n) to match all nodes. To get the names of the properties, the keys(n) function is the key. This function returns a list of strings representing the property names.

3. Handling Multiple Properties Per Node:

A node can have many properties. If we just used keys(n), each node would contribute a list of property names. We need to process these lists individually. This is where UNWIND comes in. UNWIND takes a list and turns each element of the list into a separate row. So, UNWIND keys(n) AS propName will iterate through the property names of each node n.

4. Grouping by Property Name:

Now that we have individual property names in each row (thanks to UNWIND), we can group them. The GROUP BY clause in Cypher is used for this. We want to group by the property name, so GROUP BY propName is the correct syntax.

5. Counting Within Each Group:

After grouping, we need to count the number of times each property name appears. The count(*) aggregation function does exactly this within each group.

6. Returning the Results:

Finally, we need to specify what we want to return. We want the property name and its count, so RETURN propName, count(*) is the correct RETURN clause.

Putting it all Together (Initial Draft):

Based on the above steps, a first attempt at the query might look like this:

MATCH (n)
UNWIND keys(n) AS propName
GROUP BY propName
RETURN propName, count(*)

7. Considering Empty Property Sets (Refinement):

What happens if a node has no properties? keys(n) would return an empty list. UNWIND on an empty list results in no rows being generated for that node. This is the desired behavior – nodes without properties shouldn't contribute to the counts.

8. Considering Ordering (Optional Refinement):

The order of the results isn't explicitly requested, but it's often helpful to see the counts sorted. Adding an ORDER BY clause improves readability. Sorting by count in descending order seems like a reasonable default.

9. Final Query:

Incorporating the optional ordering leads to the final, refined query:

MATCH (n)
WITH keys(n) AS propertyKeys
UNWIND propertyKeys AS propertyName
RETURN propertyName, count(*) AS count
ORDER BY count DESC

Alternative (Less Efficient) Approach and Why it's not Preferred:

One might initially think of something like:

MATCH (n)
RETURN keys(n), count(n) // Incorrect

This is wrong because keys(n) returns a list, and count(n) counts the nodes, not the occurrences of individual properties. We need to break down the property lists using UNWIND.

Another potential (and less efficient) approach might involve iterating through all possible property names and then counting nodes with that specific property. However, this requires knowing all possible property names in advance, which is not practical for a general solution. The UNWIND approach is dynamic and works regardless of the specific properties present.

This detailed thought process illustrates how to break down the problem, leverage specific Cypher functions, and refine the solution to achieve the desired outcome.

Response

MATCH (n)
UNWIND keys(n) AS propertyName
RETURN propertyName, count(*) AS count
ORDER BY count DESC

Explanation:

  1. MATCH (n): This matches all nodes in the graph and assigns each node to the variable n.
  2. UNWIND keys(n) AS propertyName:
    • keys(n) returns a list of the property names of the current node n.
    • UNWIND takes this list and creates a separate row for each element in the list. So, if a node has properties {'id': '123', 'name': 'John'} , this part will generate two rows: one with propertyName as 'id' and another with propertyName as 'name'.
  3. RETURN propertyName, count(*) AS count:
    • GROUP BY propertyName is implicitly done when using aggregate functions like count(*) without an explicit GROUP BY clause. Cypher will group the rows based on the propertyName.
    • count(*) counts the number of rows within each group (i.e., the number of nodes that have a specific propertyName).
    • AS count assigns the alias "count" to the result of the count(*) aggregation.
  4. ORDER BY count DESC: This sorts the results in descending order based on the count, showing the properties present in the most nodes first.

How it Works:

The query iterates through each node, extracts its property names, and then "unwinds" those names into individual rows. This allows us to count the occurrences of each property name across all nodes.

Example:

Let's say you have the following nodes:

Node 1: {id: 'A', name: 'Alice'}
Node 2: {name: 'Bob', age: 30}
Node 3: {id: 'C', city: 'New York'}
Node 4: {name: 'David'}

The query would produce the following output:

propertyName | count
-------------|-------
name         | 3
id           | 2
age          | 1
city         | 1

This indicates that the property "name" is present in 3 nodes, "id" in 2 nodes, and "age" and "city" each in 1 node.

@bhpayne
Copy link
Member Author

bhpayne commented Jan 20, 2025

Not all properties are present by design. For example,

MATCH (n:value_with_units)
UNWIND keys(n) AS propertyName
RETURN propertyName, count(*) AS count
ORDER BY count DESC

returns

<Record propertyName='id' count=20>
<Record propertyName='number_decimal' count=20>
<Record propertyName='number_power' count=20>
<Record propertyName='note_latex' count=20>
<Record propertyName='author_name_latex' count=20>
<Record propertyName='dimension_length_unit' count=10>
<Record propertyName='dimension_time_unit' count=6>
<Record propertyName='dimension_mass_unit' count=6>
<Record propertyName='dimension_temperature_unit' count=2>
<Record propertyName='dimension_amount_of_substance_unit' count=2>
<Record propertyName='dimension_charge_unit' count=1>

which is appropriate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo - not urgent
Development

No branches or pull requests

1 participant