You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use graph to accumulate pages while I crawl. The crawler is a stream of pages: above page (previous) and sub page (current). Such relationship I have to record to find paths between pages later.
I noticed that from nth node it becomes impossible to compute paths. Literally, when my crawler reached a goal and needed to collect paths from A to B pages, it took all night and there was no calculation result (the calculation was definitely still going on, judging by the load).
I decided to log every thousandth relationship recorded and try to find all paths from the starting page to the current page (I record page to subpage relationship).
The results are as follows (relation a sub page to the above page -> time to find all paths from the first page to the sub page):
Where abv and sub are structs, abv_ref and sub_ref are strings.
Tested like this:
IO.puts("Task now: #{x}")ifrem(x,1000)===0doIO.puts("Trying to get all paths from the start to current urls..."){d,result}=:timer.tc(fn->Graph.get_paths(graph,start_ref,sub_ref)end)IO.puts("Got result in #{d}µ (#{div(d,1_000_000)}s): #{inspect(result)}")end
Where result is a list of strings, start_ref is string as well.
I don't know if my problem is unique or if it's a general property of the library graph. Is there a benchmark that tests such a large amount of relations? Send links if available. If not, my suggestion is to make one.
The text was updated successfully, but these errors were encountered:
centuriononon
changed the title
Infinity complexity
Unexpectedly endless pathfinding from the nth size.
Nov 15, 2023
The get_paths function performs a depth-first traversal of the entire graph in order to discover all the paths, so it is expected that it would get increasingly expensive as the size of the graph increases, but 11k nodes should not be a problem. I've certainly worked with larger graphs with this library.
That said, it isn't clear to me what the structure of your graph is: How dense is the graph? What is the depth vs breadth? How long are the paths you expect to observe on average? All of this will help discover where the bug here is.
The get_paths function performs a depth-first traversal of the entire graph in order to discover all the paths, so it is expected that it would get increasingly expensive as the size of the graph increases, but 11k nodes should not be a problem. I've certainly worked with larger graphs with this library.
Okay, I see.
That said, it isn't clear to me what the structure of your graph is: How dense is the graph? What is the depth vs breadth? How long are the paths you expect to observe on average? All of this will help discover where the bug here is.
I am going to reproduce the same conditions as I had in that case. It was about multiple layers in depth and large amount of nodes in breadth (amount 200-1000 per node). So, I would say we can try to create similar structure manually with control within a test.
It may take a while, but once I get started and there are little results, I'll let you know!
Machine: Intel Core i5-8250U (1.60GHz), 16gb RAM
OS: Ubuntu 22.04
Elixir version: Elixir 1.15.0-rc.0
OTP version: 25.0
libgraph version: 0.16.0
I use graph to accumulate pages while I crawl. The crawler is a stream of pages: above page (previous) and sub page (current). Such relationship I have to record to find paths between pages later.
I noticed that from nth node it becomes impossible to compute paths. Literally, when my crawler reached a goal and needed to collect paths from A to B pages, it took all night and there was no calculation result (the calculation was definitely still going on, judging by the load).
I decided to log every thousandth relationship recorded and try to find all paths from the starting page to the current page (I record page to subpage relationship).
The results are as follows (relation a sub page to the above page -> time to find all paths from the first page to the sub page):
I used this to record relationship:
Where abv and sub are structs, abv_ref and sub_ref are strings.
Tested like this:
Where result is a list of strings, start_ref is string as well.
I don't know if my problem is unique or if it's a general property of the library graph. Is there a benchmark that tests such a large amount of relations? Send links if available. If not, my suggestion is to make one.
The text was updated successfully, but these errors were encountered: