-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent un-necessary recalculation of RMSD matrix #20
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, can merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems dangerous to me to cache the matrix file here (see inline comments).
inner_data = rmsd_calculation.get_rmsds_matrix( | ||
conforms, args.mode, args.align, args.cores, symm_groups) | ||
conforms, args.mode, args.align, args.cores, symm_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff seems unnecessary since you don't change any logic, only whitespace. In fact, it may even cause flake8 to complain.
# afterwards (so that we retain the original IMP orientation) | ||
numpy.save("conforms", conforms) | ||
|
||
if not os.path.isfile("Distances_Matrix.data.npy"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user
- runs a simulation
- does analysis, creating
Distances_Matrix.data.npy
- goes back and runs a new simulation
- does a 2nd round of analysis?
Won't this file then be out of date and result in a hard-to-diagnose issue (matrix is for the first run but is used in the second run)? If so, one fix might be to stat
the matrix/RMF/PDB files and only skip the matrix creation if the .npy
file is newer than all RMF/PDB. Another would be to never reuse your cached matrix unless the user explicitly requests it with a --use-cache
option or similar. Always better to err on the side of caution with caching.
all-by-all RMSD distance matrix is calculated only when sampling_precision calculated is requested. However, I have a use-case where I don't want to calculate sampling precision, but rather have prior information about the cluster threshold. I want to just cluster the models based on this threshold, and obtain the cluster precision. But model precision calculation without doing sampling precision calculation is possibly only when the
Distances_Matrix.data.npy
file is already created and contains the RMSD distance matrix.So, I changed this to the RMSD distance matrix being calculated and stored in
Distances_Matrix.data.npy
always (irrespective of whether sampling precision calculation is skipped or not), except when the script is being re-run and that file already exists.