Make OptimizedTorchANI robust to changes to device between calls. #113

RaulPPelaez · 2023-08-18T15:29:37Z

Solves #112.

AFAIK, the error in #112 consists of OpenMM changing the location of just the positions, which ends up with NNPOps down the line being fed tensors in two different devices.

The original reproducer by @raimis:

import torch as pt
from NNPOps import OptimizedTorchANI
from openmm import Context, LocalEnergyMinimizer, Platform, System, VerletIntegrator
from openmmtorch import TorchForce
from torchani.models import ANI2x

scale = 1e10
platform = "CUDA"

class Model(pt.nn.Module):
    def __init__(self, scale):
        super().__init__()
        self.scale = scale
        self.species = pt.tensor([1, 1]).unsqueeze(0)
        self.model = ANI2x(periodic_table_index=True)
        self.model = OptimizedTorchANI(self.model, self.species)
    def forward(self, positions):
        positions = positions.unsqueeze(0).to(pt.float32)
        return self.scale * self.model.forward((self.species, positions))[1]

force = TorchForce(pt.jit.script(Model(scale)))

system = System()
system.addForce(force)
for _ in range(2):
    system.addParticle(1)

platform = Platform.getPlatformByName(platform)
context = Context(system, VerletIntegrator(1), platform)

context.setPositions([[0, 0, 0], [1, 0, 0]])
LocalEnergyMinimizer.minimize(context)

Is solved by simply sending the positions to the same device as the tensor with the atomic numbers (which always stays on the same device), I did this by modifying OptimizedTorchANI:

    def forward(self, species_coordinates: Tuple[Tensor, Tensor],
                 cell: Optional[Tensor] = None,
                 pbc: Optional[Tensor] = None) -> SpeciesEnergies:
 
         species_coordinates = self.species_converter(species_coordinates)
+        species_coordinates = (
+            species_coordinates[0],
+            species_coordinates[1].to(species_coordinates[0].device),
+        )
         species_aevs = self.aev_computer(species_coordinates, cell=cell, pbc=pbc)
         species_energies = self.neural_networks(species_aevs)
         species_energies = self.energy_shifter(species_energies)
 

        return species_energies

In this PR, I also restructured SymmetryFunctions a bit. Instead of creating an implementation and storing it for the duration of the execution, now this class holds a map from devices to implementations, creating/fetching the necessary one according to where the positions are stored.
I did this because the positions (and only the positions) suddenly changing devices leaves us with an ambiguous decision. Do we:

Respect the original device the module was created with (like my fix above does)
Respect the device of "positions" by doing all computations in its device
In other words, what should OptimizedTorchANI do when this assertion fails?:

    def forward(
        self,
        species_coordinates: Tuple[Tensor, Tensor],
        cell: Optional[Tensor] = None,
        pbc: Optional[Tensor] = None,
    ) -> SpeciesEnergies:
        assert species_coordinates[0].device == species_coordinates[1].device

In the first case, we simply move positions back to the device when required.
In the second case, we must ensure every component can handle inputs with changing device.
OTOH, this makes me think: OpenMM suddenly changing the device of the positions without correctly informing NNPOps (perhaps by calling model.to(device)?) sounds to me like a bug in either OpenMM or OpenMM-Torch.

Finally I also applied the formatter.

Ensure position and species are on the same device in OptimizedTorchANI Format

peastman · 2023-08-18T16:34:18Z

I'm not sure this is the right fix. I think the real problem is in OpenMM-Torch, and this just masks the symptoms.

TorchForce used to store the name of the file containing the model. When you created a Context (and therefore a TorchForceImpl), it would load the file to create a torch::jit::Module. That meant every Context had its own copy of the model, and there was no possibility of interaction between them.

openmm/openmm-torch#97 changed it to make TorchForce directly store the torch::jit:Module. The result is that all Contexts share the same model. This creates the possibility of one Context influencing another one. The bug seen here is one manifestation of that. It could also happen in much more subtle ways that lead to incorrect results.

I think the correct solution is to make TorchForceImpl clone the model. That will ensure that every Context again has its own independent copy.

sef43 · 2023-08-21T12:30:27Z

I'm not sure this is the right fix. I think the real problem is in OpenMM-Torch, and this just masks the symptoms.

I think it will require the torch model code to handle the devices as done by @RaulPPelaez here. How would it work for a simple pure pytorch model, e.g.:

import torch as pt
from openmm import Context, LocalEnergyMinimizer, Platform, System, VerletIntegrator
from openmmtorch import TorchForce

scale = 1.0e10
platform = "CUDA"
device = "cuda"

class Model(pt.nn.Module):
    def __init__(self, scale, device):
        super().__init__()
        self.device=device
        self.scale = scale
        self.r0 = pt.tensor([0.0,0.0,0.0], device=device)
    def forward(self, positions):
        positions=positions.to(self.device) # <- without this line it will not work 
        return self.scale * pt.sum(positions - self.r0)**2

model = pt.jit.script(Model(scale, device))
force = TorchForce(model)

system = System()
system.addForce(force)
for _ in range(2):
    system.addParticle(1)

platform = Platform.getPlatformByName(platform)
context = Context(system, VerletIntegrator(1), platform)

context.setPositions([[0, 0, 0], [1, 0, 0]])
LocalEnergyMinimizer.minimize(context)

Typically you have to define the device for some tensors in the constructor. In the forward method you then expect the positions to be on the same device. If they then come on CPU instead of CUDA because LocalEnergyMinimizer has detected the forces are large you will need to have code in the forward method which copies the tensors to all be on the same device. Or is there a different way to do this without the .to(device) lines?

raimis · 2023-08-21T12:43:50Z

@RaulPPelaez, I think @peastman is right. Each context should have a separate copy of Torch module. So, it can be initialized once on a specific device and never changes. This will ensure the isolation of the contexts and the NNPops don't need to handle the device changes.

raimis · 2023-08-21T12:48:16Z

@sef43 the module shouldn't have explicit device assignments. Rather you create parameters and/or buffers, so PyTorch can move them to the right device. OpenMM-Torch already uses that mechanism (https://github.com/openmm/openmm-torch/blob/e9f2ae24f00138740ee6683ea4ccd476c268c183/platforms/cuda/src/CudaTorchKernels.cpp#L78).

RaulPPelaez · 2023-09-04T13:36:15Z

I believe I am missing something about the issue @peastman is describing.
Instead of modifying NNPOps I can go ahead and make it so that TorchForceImpl in OpenMM-Torch loads the module anew from a file each time.
For instance, I can change the getModule function in TorchForce so that each time it returns a new instance of the module:

const torch::jit::Module TorchForce::getModule() const {
  std::stringstream output_stream;
  this->module.save(output_stream);
  return torch::jit::load(output_stream);
}

This way TorchForceImpl::initialize gets a just loaded module each time:

void TorchForceImpl::initialize(ContextImpl& context) {
    auto module = owner.getModule();
    // Create the kernel.
    kernel = context.getPlatform().createKernel(CalcTorchForceKernel::Name(), context);
    kernel.getAs<CalcTorchForceKernel>().initialize(context.getSystem(), owner, module);
}

As far as I understand this is equivalent to the behavior of TorchForceImpl before openmm/openmm-torch#97 .
This however results in the same error as in the original post.

EDIT: I made a mistake, the fix above does actually fix the error and it makes sense to me why.

RaulPPelaez · 2023-09-04T14:12:50Z

I opened openmm/openmm-torch#116 with the fix suggested by @peastman.
The original reproducer passes on my machine using that instead of this PR.

Hence, while this PR does make SymmetryFunctions robust to devices changing, I am not sure if it is worth merging it.

Make SymmetryFunctions robust to changes to device between calls.

9794db1

Ensure position and species are on the same device in OptimizedTorchANI Format

RaulPPelaez mentioned this pull request Aug 18, 2023

Minimization fails due to a device change #112

Closed

RaulPPelaez added 2 commits August 18, 2023 17:47

Add test for multi device SymmetryFunctions

b1b99e3

Add test for OptimizedTorchANI robustness to device change

8933bd4

RaulPPelaez mentioned this pull request Sep 4, 2023

Clone the module on TorchForceImpl initialization openmm/openmm-torch#116

Merged

RaulPPelaez closed this Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make OptimizedTorchANI robust to changes to device between calls. #113

Make OptimizedTorchANI robust to changes to device between calls. #113

RaulPPelaez commented Aug 18, 2023

peastman commented Aug 18, 2023

sef43 commented Aug 21, 2023

raimis commented Aug 21, 2023

raimis commented Aug 21, 2023

RaulPPelaez commented Sep 4, 2023 •

edited

Loading

RaulPPelaez commented Sep 4, 2023

Make OptimizedTorchANI robust to changes to device between calls. #113

Make OptimizedTorchANI robust to changes to device between calls. #113

Conversation

RaulPPelaez commented Aug 18, 2023

peastman commented Aug 18, 2023

sef43 commented Aug 21, 2023

raimis commented Aug 21, 2023

raimis commented Aug 21, 2023

RaulPPelaez commented Sep 4, 2023 • edited Loading

RaulPPelaez commented Sep 4, 2023

RaulPPelaez commented Sep 4, 2023 •

edited

Loading