Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of Graph.delete_vertex/2 in large graphs #80

Open
bartblast opened this issue Apr 9, 2024 · 8 comments · May be fixed by #84
Open

Poor performance of Graph.delete_vertex/2 in large graphs #80

bartblast opened this issue Apr 9, 2024 · 8 comments · May be fixed by #84

Comments

@bartblast
Copy link

I'm getting ~18 ms on vertex deletes in large graphs, is this expected? And if so, is it theoretically possible to do anything about it?

@stevensonmt
Copy link
Contributor

stevensonmt commented Jan 10, 2025

Large graphs in terms of number of vertices, number of edges, or both? Looking at the relevant code I would think just number of nodes should not cause an issue, but number of edges would be more impactful.

  def delete_vertex(
        %__MODULE__{out_edges: oe, in_edges: ie, edges: em, vertex_identifier: vertex_identifier} =
          g,
        v
      ) do
    vs = g.vertices
    ls = g.vertex_labels

    with v_id <- vertex_identifier.(v),
         true <- Map.has_key?(vs, v_id),
         oe <- Map.delete(oe, v_id),
         ie <- Map.delete(ie, v_id),
         vs <- Map.delete(vs, v_id),
         ls <- Map.delete(ls, v_id) do
      oe = for {id, ns} <- oe, do: {id, MapSet.delete(ns, v_id)}, into: %{}
      ie = for {id, ns} <- ie, do: {id, MapSet.delete(ns, v_id)}, into: %{}
      em = for {{id1, id2}, _} = e <- em, v_id != id1 && v_id != id2, do: e, into: %{}
      %__MODULE__{g | vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em}
    else
      _ -> g
    end
  end

Style-wise and performance-wise (though a tiny bit) I think pattern matching on the vertices and vertex_labels fields the way all other fields are matched in the function head would be better. For each vertex deleted it traverses 3 collections of edges (out edges, in edges, and edges). At some point that gets pricy. Can probably do a single pass over em, then reduce into {oe, ie, em} and you only have to MapSet.delete(ns, v_id) for ids you know were an edge with the deleted vertex.
Something like:

  def delete_vertex(
        %__MODULE__{vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em, vertex_identifier: vertex_identifier} =
          g,
        v
      ) do
   
    with v_id <- vertex_identifier.(v),
         true <- Map.has_key?(vs, v_id),
         oe <- Map.delete(oe, v_id),
         ie <- Map.delete(ie, v_id),
         vs <- Map.delete(vs, v_id),
         ls <- Map.delete(ls, v_id) do
    {em, oe, ie} = 
        em
        |> Enum.reduce({em, oe, ie}, fn 
                                     {{v_id, id2} = k, _}, {e, o, i} -> {Map.delete(e, k), o, Map.update!(i, id2, fn ns -> MapSet.delete(ns, v_id) end)}
                                     {{id1, v_id} = k, _}, {e, o, i} -> {Map.delete(e, k), Map.update!(o, id1, fn ns -> MapSet.delete(ns, v_id) end), i} 
                                      _, acc -> acc 
                       end)

      %__MODULE__{g | vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em}
    else
      _ -> g
    end
  end

Alternatively change the data structure to just %__MODULE__{edges: %{v_id => out: %MapSet{}, in: %MapSet{}}} then retrieving all edges has to combine the out and in fields but calling all the out and in edges independently should be equivalent to current. Updating it one pass when deleting a vertex becomes simpler using Map.pop(edges, v_id) to get the sets of out and in vertices to visit and remove v_id. Not sure how changing that data structure would complicate the rest of the lib though.

EDIT
Just did a dirty test in iex with my fork implementing the above.

g = Graph.new()
vs = 1..1_000_000 |> Enum.to_list()
vs |> Enum.each(fn i -> Graph.add_vertex(g, i) end)
vs |> Enum.zip(Enum.reverse(vs)) |> Enum.each(fn {i,o} -> Graph.add_edge(g, i, o) end)
vs |> Enum.zip(Enum.reverse(vs)) |> Enum.each(fn {i,o} -> Graph.add_edge(g, o, i) end)
:timer.tc(fn -> 1..1_000_000//7 |> Enum.each(fn i -> Graph.delete_vertex(g, i) end) end, :millisecond)

Weirdly it returns almost the exact same time for the current version and for my change. I think my test is flawed. 121-128ms current version, 118-125ms my version over ten runs of the delete call. Not sure if my example is too simplistic, not enough data, or I've just misdiagnosed the problem.

@stevensonmt
Copy link
Contributor

stevensonmt commented Jan 11, 2025

So the issue was the test I was using. With the following I get 25-30x speedup using my version versus the previous version:

g = (Graph.new() |> Graph.add_edges(Enum.map(1..1_000_000, fn i -> {i, 1_000_000 - i} end)))
:timer.tc(fn -> g |> Graph.delete_vertex(Enum.random(1..1_000_000)) end, :millisecond)
# {45, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>}
:timer.tc(fn -> g |> Graph.delete_vertex_old(Enum.random(1..1_000_000)) end, :millisecond)
# {1228, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>}

For graphs of 1000 nodes it shows ~8-10x speedup over the current implementation.
For graphs of 100 nodes it shows ~3-4x speedup.
For graphs of 10 nodes there is almost no significant difference, maybe 1.1-1.2x improvement.

PR submitted #84

@stevensonmt
Copy link
Contributor

I think there might be a similar performance issue with replace_vertex for the same reason. Will open new issue.

@bartblast
Copy link
Author

Large graphs in terms of number of vertices, number of edges, or both?

@stevensonmt The performance issue (~18ms for vertex deletes) is occurring on Hologram's call graph when installed in an application - so it's dealing with a large number of nodes (MFAs) and edges (function calls) in total.

@bartblast
Copy link
Author

@bitwalker Fixing this issue would speedup a lot Hologram incremental compilation and live reloading.

@bitwalker
Copy link
Owner

@bartblast I'll try and get to this today, sorry for the delay - unfortunately I have minimal time to maintain some of my open source projects for the time being.

@stevensonmt
Copy link
Contributor

stevensonmt commented Jan 13, 2025

Thinking about this some more and I think there might be an even more efficient way to do this. Let's say there are 1000 vertices, each vertex has a 100 incident edges and 100 emanating edges. That's 200k edges for the whole graph. The old version was visiting each vertex twice for the incident and emanating edge collection and then all the edges in the graph once. The proposed version only has to visit the collection of all edges. I think there is a way to visit only the subset of edges involving the vertex of interest. Not where I can test this out until this evening, but something like this:

def delete_vertex(
        %__MODULE__{vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em, vertex_identifier: vertex_identifier} =
          g,
        v
      ) do
   
    with v_id <- vertex_identifier.(v),
         true <- Map.has_key?(vs, v_id),
         {outs, oe} <- Map.pop(oe, v_id),
         {ins, ie} <- Map.pop(ie, v_id),
         vs <- Map.delete(vs, v_id),
         ls <- Map.delete(ls, v_id) do

    {ie, em} = outs 
               |> Enum.reduce({ie, em}, fn id1, {i, e} -> 
                         {Map.update!(i, id1, fn ns -> MapSet.delete(ns, v_id) end), Map.delete(e, {v_id, id1})} 
                end)

    {oe, em} = ins
               |> Enum.reduce({oe, em}, fn id1, {o, e} -> 
                        {Map.update!(o, id1, fn ns -> MapSet.delete(ns, v_id) end), Map.delete(e, {id1, v_id})} 
                 end)

      %__MODULE__{g | vertices: vs, vertex_labels: ls, out_edges: oe, in_edges: ie, edges: em}
    else
      _ -> g
    end
  end

Now instead of visiting 200k edges it visits 200 that are identified as involving v_id.

Does this make sense?

Need to test it. Intuitively I think this might be faster for sparsely connected graphs (each vertex has relatively few edges) where the current proposal could be faster for densely connected graphs (each vertex is connected to nearly every other vertex).

@stevensonmt
Copy link
Contributor

stevensonmt commented Jan 14, 2025

So my intuition was correct for a sparsely connected graph the version above is vastly superior. This is for a sparse graph with many vertices, each with one in edge and one out edge.

g = (Graph.new() |> Graph.add_edges(Enum.map(1..1_000_000, fn i -> {i, 1_000_000 - i} end)))
i = Enum.random(1..1_000_000)
test = fn g, i, del_fun -> del_fun.(g, i) end
g1 = test.(g, i, fn g, i -> Graph.delete_vertex(g, i) end)
g2 = test.(g, i, fn g, i -> Graph.delete_vertex_alt(g, i) end)
g3 = test.(g, i, fn g, i -> Graph.delete_vertex_old(g, i) end)
# sanity check to make sure they all do the same thing.
g1 == g2 # true
g1 == g3 # true
g2 == g3 # true
i in Graph.vertices(g) # true
i in Graph.vertices(g1) # false
i in Graph.vertices(g2) # false
i in Graph.vertices(g3) # false
Graph.Edge.new(i, 1_000_000 - i) in Graph.edges(g) # true
Graph.Edge.new(i, 1_000_000 - i) in Graph.edges(g1) # false
Graph.Edge.new(i, 1_000_000 - i) in Graph.edges(g2) # false
Graph.Edge.new(i, 1_000_000 - i) in Graph.edges(g3) # false
# swapping order of vertices as {1_000_000 - i, i} was the same

funs = [fn g, i -> Graph.delete_vertex(g, i) end, fn g, i -> Graph.delete_vertex_alt(g, i) end, fn g, i -> Graph.delete_vertex_old(g, i) end]
funs |> Task.async_stream(fn fun -> :timer.tc(fn -> test.g, i, fun) end, :microsecond) end) |> Enum.to_list()
#[
#  ok: {1148844, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>}, # current proposal in PR
#  ok: {116, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>}, # new approach as above
#  ok: {2397265, #Graph<type: directed, num_vertices: 999896, num_edges: 999998>} # version in current release
#]

Subsequent runs were similar.

Changing the graph to ~2k vertices but ~1.9M edges yields:

vs = Stream.repeatedly(fn -> Enum.random(1..10_000) end) |> Enum.take(1_000)
g = (Graph.new() |> Graph.add_edges(1..1_000 |> Enum.flat_map(fn i -> vs |> Enum.map(fn v -> [{v,i}, {i, v}] end) end) |> List.flatten()))
[g1, g2, g3] = funs |> Enum.map(fn fun -> test.(g, 2513, fun) end)
g == g1
#false
g == g2
#false
g == g3
#false
g1 == g2
#true
g2 == g3
#true
g1 == g3
#true
# won't continue to bore you with evidence that I checked that the edges were actually deleted appropriately but they were
funs |> Task.async_stream(fn fun -> :timer.tc(fn -> test.(g, 2513, fun) end) end) |> Enum.to_list()
#[
#  ok: {1546322, #Graph<type: directed, num_vertices: 1869, num_edges: 1898944>},
#  ok: {10594, #Graph<type: directed, num_vertices: 1869, num_edges: 1898944>},
#  ok: {1578386, #Graph<type: directed, num_vertices: 1869, num_edges: 1898944>}
#]

So still significantly better. You also see the old version and the current PR version converging in a graph with fewer nodes but more edges. This is because the old version is more heavily impacted by the number of nodes with any edges but both are relatively equally affected by the number of total edges in the graph. The new version is affected only by the number of edges connected to the vertex being deleted. A more extreme graph with 1_000_000 vertices with 999_999 edges would likely show an even bigger improvement but my poor 14-year-old i5 w/16Gb slow RAM just can't.

EDIT
Caught a bug with the above implementation. When the vertex being deleted has no in edges or out edges there's a problem piping nil into Enum.reduce. Fixing that now and will update PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants