Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CMake not adding some tests #104

Merged
merged 7 commits into from
Jul 24, 2023
Merged

Conversation

RaulPPelaez
Copy link
Contributor

Solves #98

@RaulPPelaez
Copy link
Contributor Author

Now the CI is picking up all tests.
@mikemhenry I suspect if you try the GPU runner the OpenCL tests will fail, could you try?

@mikemhenry
Copy link
Collaborator

Will do!

@mikemhenry
Copy link
Collaborator

@mikemhenry
Copy link
Collaborator

@RaulPPelaez good call with that 20min time out, it got stuck in a loop https://github.com/openmm/NNPOps/actions/runs/5085521894/jobs/9139106717

@RaulPPelaez
Copy link
Contributor Author

RaulPPelaez commented May 26, 2023

It does not seem stuck in a loop to me, it seems like it just takes too long. The waterbox test does take a long time in the CPU. See one of the CPU CI's https://github.com/openmm/NNPOps/actions/runs/5078935616/jobs/9124055531?pr=104
It takes 24 minutes (mainly installing CUDA I guess?), the one that does not install CUDA is 10 mins.
Could it be that the CPU in the AWS runner is too slow?
Maybe the GPU runner could filter and run just GPU tests? EDIT: Not sure how to achieve this.

@RaulPPelaez
Copy link
Contributor Author

Also, no idea where this error comes from and why its only there in the GPU runner:

8: deviceString = 'cpu', molFile = '1hvj'
8: 
8:     @pytest.mark.parametrize('deviceString', ['cpu', 'cuda'])
8:     @pytest.mark.parametrize('molFile', ['1hvj', '1hvk', '2iuz', '3hkw', '3hky', '3lka', '3o99'])
8:     def test_model_serialization(deviceString, molFile):
8:     
8:         if deviceString == 'cuda' and not torch.cuda.is_available():
8:             pytest.skip('CUDA is not available')
8:     
8:         from NNPOps.EnergyShifter import TorchANIEnergyShifter
8:     
8:         device = torch.device(deviceString)
8:     
8: >       mol = mdtraj.load(os.path.join(molecules, f'{molFile}_ligand.mol2'))
8: 
8: test/TestEnergyShifter.py:80: 
8: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
8: ../3/envs/nnpops/lib/python3.10/site-packages/mdtraj/core/trajectory.py:396: in load
8:     kwargs["top"] = _parse_topology(kwargs.get("top", filename_or_filenames[0]), **topkwargs)
8: ../3/envs/nnpops/lib/python3.10/site-packages/mdtraj/core/trajectory.py:181: in _parse_topology
8:     topology = load_mol2(top, **kwargs).topology
8: ../3/envs/nnpops/lib/python3.10/site-packages/mdtraj/formats/mol2.py:91: in load_mol2
8:     atoms, bonds = mol2_to_dataframes(filename)
8: ../3/envs/nnpops/lib/python3.10/site-packages/mdtraj/formats/mol2.py:200: in mol2_to_dataframes
8:     data = dict((key, list(grp)) for key, grp in itertools.groupby(f, _parse_mol2_sections))
8: ../3/envs/nnpops/lib/python3.10/site-packages/mdtraj/formats/mol2.py:200: in <genexpr>
8:     data = dict((key, list(grp)) for key, grp in itertools.groupby(f, _parse_mol2_sections))
8: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
8: 
8: self = <encodings.ascii.IncrementalDecoder object at 0x7fb6956f8520>
8: input = b'@<TRIPOS>MOLECULE\n    A78\n  115   118     0     0     0\nSMALL\nUSER_CHARGES\n\n\n@<TRIPOS>ATOM\n      1 C1       ....8968   -1.7691   16.1457 hc         1 A78      0.039867\n    104 H47         9.7154   -5.7103   17.7810 hc         1 '
8: final = False
8: 
8:     def decode(self, input, final=False):
8: >       return codecs.ascii_decode(input, self.errors)[0]
8: E       UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
8: 
8: ../3/envs/nnpops/lib/python3.10/encodings/ascii.py:26: UnicodeDecodeError

@mikemhenry
Copy link
Collaborator

I've done this before with pytest + decorators. I'll first try bumping up the time-out first. I thought it was a loop since I wasn't looking closely and thought it kept running the same pytest tests, but now I see they are separate invocations.

RE UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)

It does seem to be consistent. It shows up only for test_model_serialization[1hvj-cpu] & test_model_serialization[1hvj-cuda] in TestBatchedNN.py and TestEnergyShifter.py. I will see if it shows up when I bump the timeout time.

I will also look at cuda versions of the test since that will save time and money, but I want to get everything passing first.

Thank you for your patience on this!

@mikemhenry
Copy link
Collaborator

 11: FAILED test/TestSymmetryFunctions.py::test_model_serialization[1hvj-cpu] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
11: FAILED test/TestSymmetryFunctions.py::test_model_serialization[1hvj-cuda] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
11: FAILED test/TestSymmetryFunctions.py::test_non_default_stream[1hvj] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
 10: FAILED test/TestSpeciesConverter.py::test_model_serialization[1hvj-cpu] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
10: FAILED test/TestSpeciesConverter.py::test_model_serialization[1hvj-cuda] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
8: FAILED test/TestEnergyShifter.py::test_model_serialization[1hvj-cpu] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
8: FAILED test/TestEnergyShifter.py::test_model_serialization[1hvj-cuda] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
 5: FAILED test/TestBatchedNN.py::test_model_serialization[1hvj-cpu] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)
5: FAILED test/TestBatchedNN.py::test_model_serialization[1hvj-cuda] - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 217: ordinal not in range(128)

looks like it is all the same file... maybe there is something weird with it? I'll see if I can reproduce on aws

@mikemhenry
Copy link
Collaborator

There is something weird with the file

(nnpops) [ec2-user@ip-10-0-142-194 build]$ file ../src/pytorch/molecules/1hvk_ligand.mol2
../src/pytorch/molecules/1hvk_ligand.mol2: ASCII text
(nnpops) [ec2-user@ip-10-0-142-194 build]$ file ../src/pytorch/molecules/1hvj_ligand.mol2
../src/pytorch/molecules/1hvj_ligand.mol2: data

@mikemhenry
Copy link
Collaborator

Viewing the file, it does have junk in it:
image
Looks weird on github too:
image

@RaulPPelaez @raimis
Where did this come from?

@RaulPPelaez
Copy link
Contributor Author

Thank you for your patience on this!

What are you saying! Thank you, man!

Where did this come from?

I have no idea.
Looking around the mol2 spec I see no reason why non-ascii characters should be there. Seems like some formatting error to me.
I tr -cd '\11\12\15\40-\176''d the file, lets try now.
What really surprises me is that thus far that test just happily ate the non-ascii characters.

@RaulPPelaez
Copy link
Contributor Author

There are other strange chars in that file:

12 N22 3.1777 2.0550 10.9450 z| 1 A78 -0.557900

Seems to me like that | should not be there. Maybe @raimis can tell us more...

@sef43
Copy link
Contributor

sef43 commented May 29, 2023

What really surprises me is that thus far that test just happily ate the non-ascii characters.

Ah i have seen this error before but it went away with a different MDTraj version so I didn’t investigate further

@mikemhenry
Copy link
Collaborator

I haven't really looked into how the tests are using this file, so perhaps the z| atom type doesn't matter and since this is an Amazon Linux image, it is possible the default locale (which controls encoding) is something like C instead of utf-8 which is why it barfs here and not on a modern linux desktop os. I can probably hack around it (well setting uft-8 as a locale isn't a hack) but I think it would be better for the file to get fixed.

@mikemhenry
Copy link
Collaborator

testing here: https://github.com/openmm/NNPOps/actions/runs/5469785425

@RaulPPelaez
Copy link
Contributor Author

Seems like the issue with the non-ascii files was fixed. All tests pass inyour GPU runner!

@mikemhenry
Copy link
Collaborator

So this should be good to merge in, any objections?

@mikemhenry
Copy link
Collaborator

Requested some reviewers

@RaulPPelaez
Copy link
Contributor Author

One of the tests fails to install CUDA with "no space left on device", it happens sometimes and I do not think we can do anything about it. I will trigger a rerun

@RaulPPelaez
Copy link
Contributor Author

Success! Please review again @peastman

@peastman
Copy link
Member

peastman commented Jul 7, 2023

Looks good as far as I can tell.

@mikemhenry
Copy link
Collaborator

Sweet, I will re-run the GPU tests to make sure we are good and if they pass I will get this merged in!

@mikemhenry
Copy link
Collaborator

@RaulPPelaez
Copy link
Contributor Author

GPU tests passed too. Lets merge!
cc @raimis @peastman

@raimis raimis merged commit 5e2438d into openmm:master Jul 24, 2023
@raimis raimis mentioned this pull request Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants