DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

ptheywood · 2024-06-11T10:38:22Z

Encountered a bug when working on a non-trivial model, where when using the DAG api (dependsOn etc) errors would occur when abstracting the definition of agent function behaviours and inclusion in the control flow DAG to mehtods in a separate file which take a reference of the ModelDescription object. The same abstraciton but using layers behaves fine.

E.g. something along the lines of (untested)

main.cu

// ... 

#include "flamegpu/flamegpu.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}

Splitting out the agent funciton(s) into methods in a .cu file, with an associated header

other.cuh

#include "flamegpu/flamegpu.h"
namespace other {
void define(flamegpu::ModelDescription& model);
}  // namespace other```

other.cu

#include "other.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}
namespace other {
void define(flamegpu::ModelDescription& model){
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // add to the DAG, ideally in a separate method by getting mutable refs to functions, but error occurred even without that.

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);
}
}  // namespace other

main.cu

#include "flamegpu/flamegpu.h"
#include "other.h"

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.newAgent("agent");

    other::define(model);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}

In the separate larger model where this occurred, this resulted in runtime errors under linux (CUDA 12.5, GCC 11) resulted in runtime errors for the split case, while the first case was fine.

The runtime error was:

terminate called after throwing an instance of 'std::bad_array_new_length'
  what():  std::bad_array_new_length

Which via gdb had a backtrace pointing at DependencyNode::getDependents called by DependencyGraph::validateSubTree(DependencyNode* node, std::vector<DependencyNode*>& functionStack)

DependencyNode::dependents is a std::vector<DependencyNode*> dependents; but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

The text was updated successfully, but these errors were encountered:

Robadob · 2024-06-11T10:41:52Z

but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

Default implicit constructor, which is implicitly called by subclass's constructor. I don't think that is the problem.

plietar · 2024-12-04T16:29:38Z

I'm getting a similar issue.

The problem is that the AgentFunctionDescription goes at out scope at the end of the define function and removed from the stack, but the model's graph still holds pointers to it in its list of roots. Any attempt to use the graph after define returns leads to reading the recycled memory from the AgentFunctionDescription, causing the crashes.

The layer API does not have this issue because it stores std::shared_ptr<AgentFunctionData>, whose lifespan is managed automatically.

I think this is a pretty big foot gun, since any non-trivial model is likely going to be split into a function to define it and a function to run it.

I'm not sure I understand the nuance between AgentFunctionDescription and AgentFunctionData, but if the dependency node stuff was moved from the description object to the data, then the graph could hold pointers to the AgentFunctionData instead, which are guaranteed to live as long as the model. The graph might even be able to use std::shared_ptr, since it is presumably to be acyclic.

Robadob · 2024-12-04T16:49:19Z

I wrote the original ModelDescription hierarchy, however the dependency graph API was developed by @MILeach and I've not used it heavily.

At a quick glance, I agree with you the problem appears to be because the AgentFunctionDescription is being stored, rather than it's corresponding AgentFunctionData. Within the ModelDescription hierarchy, the Description suffixed classes are merely disposable interfaces, whereas the Data suffixed classes (structs?) are all owned by ModelData (which is belongs to ModelDescription during model definition).

FLAMEGPU2/src/flamegpu/model/DependencyNode.cpp

Line 55 in dfae015

dependencies.push_back(&dependency);

A little harder to resolve, given that Description/Data separation, I guess DependencyNode requires splitting too. Likewise, HostFunctionDescription does not have an equivalent HostFunctionData currently.

Sadly, a bit too much work required given how busy I am at current.

Robadob · 2024-12-04T17:23:27Z

Possibly of interest @plietar, I think this is the only open-source model we have that's split across multiple files. It pre-dates the DAG functionality.

https://github.com/primagesheffield/flamegpu2-neuroblastoma

Robadob added bug Priority: High labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

ptheywood commented Jun 11, 2024 •

edited

Loading

Robadob commented Jun 11, 2024

plietar commented Dec 4, 2024

Robadob commented Dec 4, 2024 •

edited

Loading

Robadob commented Dec 4, 2024

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

Comments

ptheywood commented Jun 11, 2024 • edited Loading

Robadob commented Jun 11, 2024

plietar commented Dec 4, 2024

Robadob commented Dec 4, 2024 • edited Loading

Robadob commented Dec 4, 2024

ptheywood commented Jun 11, 2024 •

edited

Loading

Robadob commented Dec 4, 2024 •

edited

Loading