Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent un-necessary recalculation of RMSD matrix #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tanmoy7989
Copy link
Member

all-by-all RMSD distance matrix is calculated only when sampling_precision calculated is requested. However, I have a use-case where I don't want to calculate sampling precision, but rather have prior information about the cluster threshold. I want to just cluster the models based on this threshold, and obtain the cluster precision. But model precision calculation without doing sampling precision calculation is possibly only when the Distances_Matrix.data.npy file is already created and contains the RMSD distance matrix.

So, I changed this to the RMSD distance matrix being calculated and stored in Distances_Matrix.data.npy always (irrespective of whether sampling precision calculation is skipped or not), except when the script is being re-run and that file already exists.

Copy link
Collaborator

@shruthivis shruthivis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, can merge.

@benmwebb benmwebb closed this Nov 19, 2021
@benmwebb benmwebb reopened this Nov 19, 2021
Copy link
Member

@benmwebb benmwebb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dangerous to me to cache the matrix file here (see inline comments).

inner_data = rmsd_calculation.get_rmsds_matrix(
conforms, args.mode, args.align, args.cores, symm_groups)
conforms, args.mode, args.align, args.cores, symm_groups)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff seems unnecessary since you don't change any logic, only whitespace. In fact, it may even cause flake8 to complain.

# afterwards (so that we retain the original IMP orientation)
numpy.save("conforms", conforms)

if not os.path.isfile("Distances_Matrix.data.npy"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user

  • runs a simulation
  • does analysis, creating Distances_Matrix.data.npy
  • goes back and runs a new simulation
  • does a 2nd round of analysis?

Won't this file then be out of date and result in a hard-to-diagnose issue (matrix is for the first run but is used in the second run)? If so, one fix might be to stat the matrix/RMF/PDB files and only skip the matrix creation if the .npy file is newer than all RMF/PDB. Another would be to never reuse your cached matrix unless the user explicitly requests it with a --use-cache option or similar. Always better to err on the side of caution with caching.

@benmwebb benmwebb closed this Dec 16, 2021
@benmwebb benmwebb reopened this Dec 16, 2021
@benmwebb benmwebb closed this Dec 16, 2021
@benmwebb benmwebb reopened this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants