-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when creating simulation context with simple TorchForce force #88
Comments
Your script runs fine for me. Can you try running it inside
|
I ran the code using the above instructions. This is what I got when I typed
|
This maybe related to #84 |
I have created environment: conda env create mmh/openmm-8-beta-linux
conda activate openmm-8-beta-linux
conda install -c conda-forge openmmtools conda list
The script above runs without a problem:
The main different between @FranklinHu1 environment is I suspect, this may be related to conda-forge/openmm-torch-feedstock#20. @FranklinHu1 what version of |
I have been able to recreate this error using Method to get error (conda version 22.9.0):
Method to run without error:
This is the side by side diff of mamba list and conda list
Note that the working mamba version does use |
This is issue conda-forge/openmm-torch-feedstock#20 My conda installs e.g. conda installs:
Mamba installs the versions that are built for e.g.:
|
@sef43 good catch! Still, I don't understand, why I get PyTorch 1.12.1 and you 1.11.0 with |
@FranklinHu1 could you install with |
I have discovered the reason for this. I was using The versions of
So There are version of |
Yes, I have been able to run my debugging script by creating the openmm beta environment using mamba instead of conda as @sef43 suggested. I am now having problems loading my more advanced PyTorch models due to their dependency on packages like torch cluster and torch geometric. For just torch cluster, I tried to install using conda using the command
Which is the same problem encountered with cluster in issue #87. In any case, I think my specific issue with the initial debugging script I posted is resolved with the mamba environment workaround, and I am now following issue #87 for the torch cluster and other torch dependencies fix. |
I still can't reproduce this, even using conda and PyTorch 1.12.1. Here's the sequence of commands I typed:
environment
|
@peastman The error seems occur when you use conda in a linux environment which does not have CUDA available. Conda can detect if you have CUDA available, if you look at the output of
and look at the
If i run your commands in this linux environment it works. However, when on a Linux node which does not have CUDA available and the output of
Then when I run the commands I get the segmentation fault (because conda installs the incompatible pytorch). You should be able to simulate this buy overriding the conda virtual package (see https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html):
I can reproduce the segmentation fault, on a Linux node which has CUDA, with these lines:
Output:
|
Thanks, I can reproduce it with that sequence of commands. I compared environments created with mamba (which works) and conda (which fails). They install identical builds of pytorch, so that isn't the problem. But they install different builds of openmm-torch. Conda installs pytorch:
- 1.11.0
- 1.12.0 I'm not sure how to determine which of the two packages is which, but presumably this means conda is incorrectly installing the wrong one for the PyTorch version it has installed? |
@mikemhenry do you have any idea what might be causing the behavior described above (#88 (comment))? Is this a bug in conda, or is there a problem in how we specify the constraints in the recipe? |
My suspicion is that the behaviour is due to the fact that all the You can view this by running: example output
Conda will not be able to install them unless it detects you have cuda on your system. However, the older uploads of I believe the fix is to either provide a linux CPU only build of openmm-torch which depends on |
That explains why it installs pytorch 1.11 instead of 1.12. But in that case, it ought to install the openmm-torch package that was built against pytorch 1.11. I don't really understand the constraints in the recipe. In the - pytorch # [build_platform != target_platform]
- pytorch =*={{ torch_proc_type }}* # [build_platform != target_platform] Those are only used when cross compiling, so I don't think they're relevant. Then they're specified under # Leaving two dependencies helps rerender correctly
# The first gets filled in by the global pinnings
# The second gets the processor type
- pytorch
- pytorch =*={{ torch_proc_type }}* And finally it includes this constraint. run_constrained:
# 2022/02/05 hmaarrfk
# While conda packaging seems to allow us to specify
# constraints on the same package in different lines
# the resulting package doesn't have the ability to
# be specified in multiples lines
# This makes it tricky to use run_exports
# we add the GPU constraint in the run_constrained
# to allow us to have "two" constraints on the
# running package
- pytorch =*={{ torch_proc_type }}* |
@hmaarrfk since your name is mentioned in the comment above, I wondered if you had any idea about this issue? The short version is that On the other hand, if you install with mamba, it correctly installs the package that was built against 1.11 and therefore works. (Either way, it never installs pytorch 1.12. It seems the newer packages aren't supported on computers without CUDA?) |
I think i used a "feature" or a happy mistake of mamba, and not conda. Conda seems to ignore the constraint leading to incompatible versions being installed. I'm not sure of an other syntax that would work with the migration pipeline of ours. You are free to try something. You can also try to use conda-libmamaba-solver. One thing that may have changed is,: We've been somewhat convinced that we should use higher build numbers to help prioritize cuda builds for machines that support them. Given this we can probably adjust the run export to pin GPU builds to the GPU, and CPU to CPU or GPU. Given the constraints from the overall environment, it is likely that the versions will be correctly installed |
Thanks! That's helpful information.
Is it possible to include that in an environment file so it would automatically be used when building an environment? Or would we need to tell users to install it first? If the latter, it's probably simpler to just tell them to use mamba. I'm trying to dig into the built packages to understand better what is happening. Here is the {
"arch": "x86_64",
"build": "cuda112py310h02d4f52_1",
"build_number": 1,
"constrains": [
"pytorch =*=cuda*"
],
"depends": [
"__glibc >=2.17",
"cudatoolkit >=11.2,<12",
"libgcc-ng >=12",
"libstdcxx-ng >=12",
"ocl-icd >=2.3.1,<3.0a0",
"ocl-icd-system",
"openmm >=8.0.0beta,<8.1.0a0",
"python >=3.10,<3.11.0a0",
"python_abi 3.10.* *_cp310",
"pytorch >=1.12.0,<1.13.0a0"
],
"license": "MIT",
"license_family": "MIT",
"name": "openmm-torch",
"platform": "linux",
"subdir": "linux-64",
"timestamp": 1665505325676,
"version": "1.0beta"
} There are a few things I notice about this. First, the On the other hand, the
Can you explain how that would work? I've never been completely clear on the relationship between the |
I really don't like pytorch-gpu as a name. I think it makes writing recipes harder. Constraints were added in conda build conda/conda-build#2001 |
Got it, thanks. So that leaves two questions.
Perhaps conda doesn't expect the same package to be listed in both places? Since it is, the specification in |
The thing is that the |
What would happen if we just left out the |
Thanks @hmaarrfk for helping out here! |
Correct.
Theoretically, the higher build number of recent pytorch versions should conda and mamba to prefer the GPU builds, if possible. Honestly, my recommendation would be:
We can also try to export the GPU requirement at build time. This would require updated requirements. What I was scared of, is that if you build locally, then it may:
I think this case might be small, but i think it would be very confusing to debug. |
@peastman I can update the environment files, I can also grep our docs and switch recommendations to use mamba. I can also do some builds with the second half of suggestions, but I do worry about making things harder to debug. |
We'll still have to build against multiple versions of pytorch. We build native libraries that link to libtorch, which isn't binary compatible across major releases.
Definitely! It sounds like removing the @mikemhenry what do you think? |
@peastman Sounds good! I will get a new build out the door asap! |
Adding pytorch 1.13 here conda-forge/openmm-torch-feedstock#31 |
Merging conda-forge/openmm-torch-feedstock#31 appears to have fixed the problem. After installing with conda as described above, I can now run the script without it crashing. |
Yes also fixed for me |
Thanks! I'll close this then. |
Hello,
I am running into a segmentation fault when adding a simple TorchForce force to the alanine dipeptide test system from openmmtools. This is similar to issue #87, but I am trying and failing to do something far simpler.
For my environment, I am using the openmm-8-beta-linux environment generated from the following command:
conda env create mmh/openmm-8-beta-linux
The only modification I made to the environment is installing openmmtools to gain access to the alanine dipeptide system I have been using for debugging. A printout of my environment is as follows:
The script I have been trying to run is as follows:
As the comment indicates, a segmentation fault occurs when building the simulation. The exact cause of the error is the initialization of the context, and the segmentation fault can be triggered directly using
openmm.openmm.Context()
using the modified alanine system and the integrator. I have tried this approach on two different Linux systems and have run into the same segmentation fault both times.Any clarification help would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: