-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestMLPotential.py
fails with nnpops
implementation
#25
Comments
also, when I add the
when i call |
It looks like a case where the model is on one device and the input tensor is on a different one.
What exactly does that mean? Are you running it on a computer without a GPU? Or do you mean there is a GPU, but you're specifying the CPU platform when you create your context? The PyTorch plugin is supposed to work with all platforms. |
running on a computer without a GPU.
right. I am just trying to figure out why this is the case/how to fix. the platform is |
TestMLPotential.py passes when I run it locally. Perhaps the problem is that you have CUDA installed (so PyTorch tries to use it), but you don't have any CUDA compatible GPUs (so it fails when it tries)? |
sorry, i should clarify; i am trying to run |
i definitely have both, and i can make |
I think we're talking about different things, since a few different errors are described above. I was referring to the |
yes, i don't disagree that is the case. This is not the blocking issue, just a passing observation (which you correctly clarified). I am primarily concerned with integrating the |
@peastman , if i reduce the problem to this code snippet, pull
and i'm not sure how to debug this. |
TestMLPotential.py
failsTestMLPotential.py
fails with nnpops
implementation
Let me make sure I understand. This error happens when all of the following are true:
If any one of those is not true, it works. Is that correct? How does this relate to the original problem you described up at the top? That one produced an exception about tensors being on different devices, while this one produces |
correct.
I don't know. i haven't tried all of the permutations; however, i need the latter two points to be
the precise issue is not the problem; after more digging, and showing this, I realized there is an edge case associated with running nnpops-implemented |
Here's the error I get when running your example.
Notice the message "Unknown device: 87" near the top. Each time I run it, there's a different number. That makes me think it might be a problem with uninitialized memory somewhere. I'm not sure where it's getting the number from though. The error happens in the first line of module = torch::jit::load(owner.getFile()); |
The above was using the main branch, so it actually wasn't using the NNPOps optimized version. Strange... |
that's especially strange. I haven't encountered that. (unintentionally closed issue); i can't tell if this is a version issue, but all of my packaged come from |
I think this may be an issue with incompatible versions of pytorch. Investigating... |
I was compiling OpenMM-Torch against a version of libtorch downloaded from https://pytorch.org, and I think it was incompatible with the one from conda. I needed to do that because the conda version was missing the CMake files needed to compile against it. I updated to the newest conda package (PyTorch 1.10.0), and now it does include the CMake files. But when I try to compile against it, all the test cases fail to build with the errors
|
The packages installed to build pytorch can differ from the packages installed to run it when you just |
I don't think so. The link errors refer to standard C++ functions. Usually that indicates a binary incompatibility of some sort, either libraries were compiled with different ABIs or different versions of libstdc++. |
I was thinking that it might be trying to use your system libraries instead of the conda-forge built libraries installed via the packages appearing in the |
@peastman : i was playing around with the If I set
and add a
i can manipulate the global parameter and make calls to the I'm not sure how easy it would be to find the root cause of the your thoughts? |
@peastman: Since it will take a while to establish why putting a |
@dominicrufa could you post the output of |
@peastman, my
if you are referring to my omm installation, I am using a nightly build from conda-forge; i'm not building from source. |
I'm referring to the OpenMM-Torch plugin. Do you build it from source or install with conda? |
|
Is there an issue with the build environments of |
@peastman, is this specifically what is throwing the |
Correct. If I manually restore the context, the error goes away. But if I then follow with a second energy evaluation, we get a CUDA error inside PyTorch. |
@peastman , if it is indeed a pytorch bug, would it make more sense to use this hack in the meantime since the time horizon for the pytorch bugfix is an unknown? I only say this because this issue is blocking for me. if you'd prefer to avoid the hack, I'll open a PR fixing the problem with the hack (for reference's sake) as a temporary workaround that i can integrate into my downstream workflow. |
Perhaps the NVIDIA folks like @dmclark17 might be able to help us here since it involves a few community codes? |
Sure—I can do some investigating and try to reproduce on my end.
I'm still getting up to speed on how contexts are being handled here—have you tried popping the current context before the PyTorch code and then pushing it afterwards? |
Contexts are handled by the ContextSelector class. It pushes the context in its constructor and pops it in the destructor. To use it, you create an instance as a local variable. The context is current from that line to the end of the enclosing block. Here is the method where the problem occurs. There are ContextSelectors to set the context for two short blocks, one in lines 97-101 and another in lines 145-150. It does not set a context at the point where the PyTorch model is invoked (either line 114 or 119). And usually that works. But in fails when the TorchForce is inside a CustomCVForce. In that case, this whole method is called from https://github.com/openmm/openmm/blob/c7af17c8ba2b6c3667e5126b494d1972b1b6d254/platforms/common/src/CommonKernels.cpp#L5389. The invoking method has already placed a context onto the stack, and PyTorch removes it. This does suggest a workaround: possibly we could modify the implementation of CustomCVForce to not have a context set when it calls |
The workaround is in openmm/openmm#3533. |
Thanks for the explanation! I'm trying to create a standalone reproducer to make sure I understand and can communicate the issue. I am loading in a simple model that multiplies an input tensor by two. I created it using the following:
The C++ code looks like this:
I am seeing the following output:
In this case, it doesn't seem like PyTorch is changing the context. On the other hand, if there isn't a current context when the JIT module was executed, it seems like PyTorch is creating a new context and leaving it on the stack. It doesn't seem like this is the expected behavior given the error observed with OpenMM-torch. Do you have any ideas on how to make the example more realistic? Thanks! |
If you move the lines that load the module up to the top of int main() {
torch::jit::script::Module module = torch::jit::load("../model.pt");
module.to(at::kCUDA);
cuInit(0);
CUcontext ctx, myContext;
CUdevice dev;
CUresult res;
cuDeviceGet(&dev, 0);
cuCtxCreate(&ctx, CU_CTX_SCHED_SPIN, dev);
printContext("After creation");
cuCtxPushCurrent(ctx);
printContext("After push");
std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({1,}).to(at::kCUDA));
at::Tensor output = module.forward(inputs).toTensor();
printContext("After run");
} Here is the output I get.
|
Interesting—I am not able to reproduce that on my end; with that ordering, I am seeing:
I think I'm using the CUDA toolkit that conda installed with PyTorch—I'm not sure if that could be causing the difference. |
What version of PyTorch do you have? I was testing with 1.9.1. |
I am using 1.11.0 and linking to the libtorch that comes with the conda installation. I will try using 1.9.1! |
@peastman, can you merge the workaround with |
Merged. We should still figure out what's going on with PyTorch, but it should fix the immediate problem. What version of PyTorch were you using when you encountered the problem? |
@peastman, were you able to see the problem with |
Yes.
I didn't make any changes to code in this repo. |
@peastman , which pull request did you use to reproduce the problem? |
The one you said to use, #21. |
right, yes. sorry for the confusion. i think it just needs to be rebased with |
I am able to reproduce the issue with 1.9.0:
I am not seeing anything about CUDA contexts in the 1.11.0 release notes. |
I've been looking into the difference between PyTorch 1.9 and 1.11, and it seems like 1.9 is calling Would it be possible to try to reproduce the original bug with PyTorch 1.11 to see if it is fixed? I need to use #21 to reproduce, correct? |
@dominicrufa : Was this fixed? |
closing as this is fixed with |
I'm not too familiar with
torch
tracebacks, but it seems likeTorch
isn't robust to the placement of arrays onto different devices:@peastman , any idea what is going wrong here? or perhaps @raimis knows what is wrong.
alternatively, if i try to run this without GPUs, it throws a runtime error:
do we generally want to make this package robust to the platform type? or only to
CUDA
?The text was updated successfully, but these errors were encountered: