Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG Control flow errors when abstracting function definitions to separate compilation units / methods #1210

Open
ptheywood opened this issue Jun 11, 2024 · 4 comments

Comments

@ptheywood
Copy link
Member

ptheywood commented Jun 11, 2024

Encountered a bug when working on a non-trivial model, where when using the DAG api (dependsOn etc) errors would occur when abstracting the definition of agent function behaviours and inclusion in the control flow DAG to mehtods in a separate file which take a reference of the ModelDescription object. The same abstraciton but using layers behaves fine.

E.g. something along the lines of (untested)

main.cu

// ... 

#include "flamegpu/flamegpu.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}


Splitting out the agent funciton(s) into methods in a .cu file, with an associated header

other.cuh

#include "flamegpu/flamegpu.h"
namespace other {
void define(flamegpu::ModelDescription& model);
}  // namespace other```

other.cu

#include "other.h"

FLAMEGPU_AGENT_FUNCTION(foo, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}

FLAMEGPU_AGENT_FUNCTION(bar, flamegpu::MessageNone, flamegpu::MessageNone) {
    // ...
    return flamegpu::ALIVE;
}
namespace other {
void define(flamegpu::ModelDescription& model){
    flamegpu::AgentDescription agent = model.Agent("agent");
    flamegpu::AgentFunctionDescription foo_desc = agent.newFunction("foo", foo);
    flamegpu::AgentFunctionDescription bar_desc = agent.newFunction("bar", bar);

    // add to the DAG, ideally in a separate method by getting mutable refs to functions, but error occurred even without that.

    // Foo runs first
    model.addExecutionRoot(foo_desc);
    // bad depends on foo
    bar_desc.dependsOn(foo_desc);
}
}  // namespace other

main.cu

#include "flamegpu/flamegpu.h"
#include "other.h"

int main(int argc, char* argv[]) {
    // Define the model, agent and 2 agent funcs
    flamegpu::ModelDescription model("model");
    flamegpu::AgentDescription agent = model.newAgent("agent");

    other::define(model);

    // Build the execution graph 
    model.generateLayers();

    // Construct the model.
    flamegpu::CUDASimulation simulation(model);

    // ...

    return 0;
}

In the separate larger model where this occurred, this resulted in runtime errors under linux (CUDA 12.5, GCC 11) resulted in runtime errors for the split case, while the first case was fine.

The runtime error was:

terminate called after throwing an instance of 'std::bad_array_new_length'
  what():  std::bad_array_new_length

Which via gdb had a backtrace pointing at DependencyNode::getDependents called by DependencyGraph::validateSubTree(DependencyNode* node, std::vector<DependencyNode*>& functionStack)

DependencyNode::dependents is a std::vector<DependencyNode*> dependents; but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

@Robadob
Copy link
Member

Robadob commented Jun 11, 2024

but it does not appear to get explicitly initialised anywhere, which may be the problem (or it might not, as a debug build reproduced the error).

Default implicit constructor, which is implicitly called by subclass's constructor. I don't think that is the problem.

@plietar
Copy link
Contributor

plietar commented Dec 4, 2024

I'm getting a similar issue.

The problem is that the AgentFunctionDescription goes at out scope at the end of the define function and removed from the stack, but the model's graph still holds pointers to it in its list of roots. Any attempt to use the graph after define returns leads to reading the recycled memory from the AgentFunctionDescription, causing the crashes.

The layer API does not have this issue because it stores std::shared_ptr<AgentFunctionData>, whose lifespan is managed automatically.

I think this is a pretty big foot gun, since any non-trivial model is likely going to be split into a function to define it and a function to run it.

I'm not sure I understand the nuance between AgentFunctionDescription and AgentFunctionData, but if the dependency node stuff was moved from the description object to the data, then the graph could hold pointers to the AgentFunctionData instead, which are guaranteed to live as long as the model. The graph might even be able to use std::shared_ptr, since it is presumably to be acyclic.

@Robadob
Copy link
Member

Robadob commented Dec 4, 2024

I wrote the original ModelDescription hierarchy, however the dependency graph API was developed by @MILeach and I've not used it heavily.

At a quick glance, I agree with you the problem appears to be because the AgentFunctionDescription is being stored, rather than it's corresponding AgentFunctionData. Within the ModelDescription hierarchy, the Description suffixed classes are merely disposable interfaces, whereas the Data suffixed classes (structs?) are all owned by ModelData (which is belongs to ModelDescription during model definition).

dependencies.push_back(&dependency);

A little harder to resolve, given that Description/Data separation, I guess DependencyNode requires splitting too. Likewise, HostFunctionDescription does not have an equivalent HostFunctionData currently.

Sadly, a bit too much work required given how busy I am at current.

@Robadob
Copy link
Member

Robadob commented Dec 4, 2024

Possibly of interest @plietar, I think this is the only open-source model we have that's split across multiple files. It pre-dates the DAG functionality.

https://github.com/primagesheffield/flamegpu2-neuroblastoma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants