Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Size Limit for .graphml import? #447

Open
MikeB2019x opened this issue Jun 28, 2023 · 8 comments
Open

File Size Limit for .graphml import? #447

MikeB2019x opened this issue Jun 28, 2023 · 8 comments

Comments

@MikeB2019x
Copy link

Expected Behavior

I have been using the following command to import .graphml files in to neo4j

CALL apoc.import.graphml("xxx.graphml", {readLabels: true, storeNodeIds:true})

This has worked in the past with .graphml files up to 1 GB in size.

Actual Behavior

I've recently had to work with larger .graphml files. An import of a file that was 3GB in size proceeded without error except that while all the nodes were imported only half the edges were. No error or warning was thrown.

Note that the .graphml is an xml document that begins with meta info, followed by node info, followed by edge info. Since the import stops midway through the edge info I'm wondering if there is a setting/limit on number of lines or size of the .graphml file?

How to Reproduce the Problem

  1. Create a large graph in networkx. (4.6M nodes w/20 attributes (float) each, 4.8M edges)
  2. Export as a .graphml file
  3. Import .graphml file in to Neo4j using command above.

Versions

  • OS: Mac Pro M1 w/ Ventrua 13.3.1 (a)
  • Neo4j: 4.4.o (community)
  • Neo4j-Apoc: 4.4.0.1
@gem-neo4j
Copy link
Contributor

Hi! I tried this out and did indeed run into issues with a larger file, although I could see an OOM in my logs (have you checked the logs? perhaps you also have this?), the fix for me was to adjust the batchSize in the config.

CALL apoc.import.graphml("xxx.graphml", {readLabels: true, storeNodeIds:true, batchSize: 100})

I am unsure on the optimum number here, but the default is 20,000 so I imagine 100 was a bit extreme in lowness 😅

I'll also ticket this to see if we can make either performance improvements or at least throw an exception instead of crashing the query!

Let me know if this helps :)

@MikeB2019x
Copy link
Author

MikeB2019x commented Jun 29, 2023

Thank you for the 'batch size' tip, it will be useful b/c the next batch of files will be larger.

Note, my situation is slightly different as there is no error, it's just that half the edges are ignored/not imported. For example, the graphml contains 4M nodes, 4M edges but after an error free upload the neo4j db shows 4M nodes and 2M edges. I looked to see if the graphml had duplicates of edges but that is not the case.

@gem-neo4j
Copy link
Contributor

The logs are in debug.log :)

The file is imported in batches of transactions, so if all the edges are last in the file, then it potentially crashes before it hits them, but the transaction has already committed the nodes, which might explain the discrepancy.

@MikeB2019x
Copy link
Author

MikeB2019x commented Jun 30, 2023

Yeah found them =D This is the log from executing the import, with no batch, into an empty db. Nothing indicates an error to me or what am I missing?

2023-06-30 03:16:54.710+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1241, gcTime=1272, gcCount=2}
2023-06-30 03:17:01.769+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.228] version=227, last transaction in previous log=25471, rotation took 50 millis.
2023-06-30 03:17:12.914+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 861 seconds. Reason: NodesAllCardinality changed from 10.0 to 599999.0, which is a divergence of 0.9999833333055556 which is greater than threshold 0.614853273406578. Query id: 83
2023-06-30 03:17:18.541+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.229] version=228, last transaction in previous log=25474, rotation took 67 millis, started after 16705 millis.
2023-06-30 03:17:23.979+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=323, gcTime=404, gcCount=1}
2023-06-30 03:17:36.260+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.230] version=229, last transaction in previous log=25477, rotation took 67 millis, started after 17651 millis.
2023-06-30 03:17:53.573+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.231] version=230, last transaction in previous log=25480, rotation took 68 millis, started after 17246 millis.
2023-06-30 03:18:11.850+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.232] version=231, last transaction in previous log=25483, rotation took 85 millis, started after 18192 millis.
2023-06-30 03:18:12.934+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 59 seconds. Reason: NodesAllCardinality changed from 599999.0 to 2599999.0, which is a divergence of 0.7692310650888712 which is greater than threshold 0.7404586799070909. Query id: 92
2023-06-30 03:18:33.509+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.233] version=232, last transaction in previous log=25486, rotation took 118 millis, started after 21541 millis.
2023-06-30 03:18:56.764+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.234] version=233, last transaction in previous log=25489, rotation took 128 millis, started after 23127 millis.
2023-06-30 03:19:21.378+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.235] version=234, last transaction in previous log=25492, rotation took 104 millis, started after 24509 millis.
2023-06-30 03:19:51.403+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.236] version=235, last transaction in previous log=25497, rotation took 179 millis, started after 29846 millis.
2023-06-30 03:19:53.004+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 99 seconds. Reason: CardinalityByLabelsAndRelationshipType(None,None,None) changed from 1.0 to 722148.0, which is a divergence of 0.9999986152423049 which is greater than threshold 0.7329806490780797. Query id: 107
2023-06-30 03:20:16.987+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.237] version=236, last transaction in previous log=25503, rotation took 270 millis, started after 25315 millis.

@gem-neo4j
Copy link
Contributor

Hmm okay, how does the query log look for it? Also did it work with trying the batchSize? I can't reproduce a case where it just misses the relationships 🙈

@MikeB2019x
Copy link
Author

MikeB2019x commented Jun 30, 2023

Thank you for the replies! Yes, the process works when using batchSize but I get the same result i.e. half the edges but no error. Note that I have confirmed the .graphml file is correct. If I open it in networkx all nodes and edges are present.

@MikeB2019x
Copy link
Author

Okay, here's what happened. This is an example of the edges as represented in the .graphml:

    <edge source="node_invoice__204779" target="node_payments__200180" />

Notice there is no label. The edge list that was used does not have any relationship name it just specifies the two nodes. After import, this is what was appearing in the browser:
image
The number is exactly half of the number of edges. What looks to have happened is that a label has been added during or after import but to a subset of the nodes. If I click on one of the edges I get:
image
I had noticed the 'related' tag but didn't think it through as I assumed it had been applied to all the edges. So:

MATCH()-[e]->() RETURN count(e)

Returns 2323193. Which matched what I was seeing. But
when I used:

MATCH ()-[e]-() return count(e)
MATCH ()-[e:RELATED]-() return count(e)

both return 4646386. Why would there be a difference, with these queries? I expected them all to return the same value. And even if there was a difference I would have expected the UI to be showing the result of the last two queries. Thoughts?

@gem-neo4j
Copy link
Contributor

The "RELATED" type is added as every relationship must have one type, and if none is specified APOC adds that generic one.

The reason why those 2 queries return double the amount is because they are returning 2 of every relationship. Matching on a path with no direction will return (a)-->(b) as well as (b)<--(a). If you only want one of each you need to add a direction :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants