prevent un-necessary recalculation of RMSD matrix #20

tanmoy7989 · 2021-11-19T08:52:15Z

all-by-all RMSD distance matrix is calculated only when sampling_precision calculated is requested. However, I have a use-case where I don't want to calculate sampling precision, but rather have prior information about the cluster threshold. I want to just cluster the models based on this threshold, and obtain the cluster precision. But model precision calculation without doing sampling precision calculation is possibly only when the Distances_Matrix.data.npy file is already created and contains the RMSD distance matrix.

So, I changed this to the RMSD distance matrix being calculated and stored in Distances_Matrix.data.npy always (irrespective of whether sampling precision calculation is skipped or not), except when the script is being re-run and that file already exists.

shruthivis

Looks fine to me, can merge.

benmwebb

It seems dangerous to me to cache the matrix file here (see inline comments).

benmwebb · 2021-12-02T20:59:52Z

pyext/src/exhaust.py

        inner_data = rmsd_calculation.get_rmsds_matrix(
-                conforms, args.mode, args.align, args.cores, symm_groups)
+            conforms, args.mode, args.align, args.cores, symm_groups)


This diff seems unnecessary since you don't change any logic, only whitespace. In fact, it may even cause flake8 to complain.

benmwebb · 2021-12-02T21:01:47Z

pyext/src/exhaust.py

+    # afterwards (so that we retain the original IMP orientation)
+    numpy.save("conforms", conforms)
+
+    if not os.path.isfile("Distances_Matrix.data.npy"):


What happens if the user

runs a simulation

does analysis, creating Distances_Matrix.data.npy

goes back and runs a new simulation

does a 2nd round of analysis?

Won't this file then be out of date and result in a hard-to-diagnose issue (matrix is for the first run but is used in the second run)? If so, one fix might be to stat the matrix/RMF/PDB files and only skip the matrix creation if the .npy file is newer than all RMF/PDB. Another would be to never reuse your cached matrix unless the user explicitly requests it with a --use-cache option or similar. Always better to err on the side of caution with caching.

prevent un-necessary recalculation of RMSD matrix

3362820

tanmoy7989 requested review from benmwebb and shruthivis November 19, 2021 08:52

shruthivis approved these changes Nov 19, 2021

View reviewed changes

benmwebb closed this Nov 19, 2021

benmwebb reopened this Nov 19, 2021

benmwebb requested changes Dec 2, 2021

View reviewed changes

benmwebb closed this Dec 16, 2021

benmwebb reopened this Dec 16, 2021

benmwebb closed this Dec 16, 2021

benmwebb reopened this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent un-necessary recalculation of RMSD matrix #20

prevent un-necessary recalculation of RMSD matrix #20

tanmoy7989 commented Nov 19, 2021

shruthivis left a comment

benmwebb left a comment

benmwebb Dec 2, 2021

benmwebb Dec 2, 2021

prevent un-necessary recalculation of RMSD matrix #20

Are you sure you want to change the base?

prevent un-necessary recalculation of RMSD matrix #20

Conversation

tanmoy7989 commented Nov 19, 2021

shruthivis left a comment

Choose a reason for hiding this comment

benmwebb left a comment

Choose a reason for hiding this comment

benmwebb Dec 2, 2021

Choose a reason for hiding this comment

benmwebb Dec 2, 2021

Choose a reason for hiding this comment