Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a unified class for water structure and dynamics #9

Open
wants to merge 6 commits into
base: devel
Choose a base branch
from

Conversation

lucasdekam
Copy link
Contributor

@lucasdekam lucasdekam commented Feb 25, 2025

This PR contains a new module analysis.py with a class WaterAnalysis that replaces the old WaterStructure classes. It analyzes the water structure, but also collects data to compute the dipole autocorrelation and water survival probability. The water survival probability calculation takes only seconds, whereas the waterdynamics-based SP class takes minutes.

I also added a function for automatically identifying surface indices, utils.guess_surface_indices.

Comparison to previous code

  • Survival probability
    image

  • Angle distribution
    image

  • count_in_region: also verified with calc_sel-water

  • Water structure profiles
    image

New features
Dipole autocorrelation, quite confident of correctness because the vector autocorrelation was also tested with the VACF code

image

Future tasks

  • Cleaning up the waterstructure.py and waterdynamics.py modules, I left the old code there for comparison. The files contain a bit too much code now. Maybe also move hbonds to a separate file
  • Writing unit tests (can do in the future)
  • Maybe add MSD in z-direction

Let me know if you want/need the comparison notebook, I wasn't sure whether to place it in the tests folder.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Introduced a comprehensive analysis module for confined water simulations, offering insights into water density, orientation, angular distribution, dipole behavior, and survival metrics.
    • Added utilities for precise region segmentation and surface identification to enhance analysis accuracy and streamline simulation data evaluation.
    • New functions for calculating vector autocorrelation and survival probability added to improve analysis capabilities.
    • Enhanced functionality for analyzing water structure properties with additional functions for density and orientation profiles, and counting water molecules in specified regions.

* analyze time-averaged water structure, water dipole autocorrelation, and survival probability within the same class
* class collects data, then user can choose which analyses to perform by running methods
* methods interface to general functions in modules (waterstructure and waterdynamics)
* can be extended to include z-MSD
* utils now includes a function to automatically identify surface indices
Copy link

coderabbitai bot commented Feb 25, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces a new class WaterAnalysis for analyzing atomistic simulations of confined water, extending AnalysisBase. It includes methods for processing trajectory data and computing various properties of water molecules. Additionally, two utility functions are added for generating region masks and identifying surface indices. New functions for calculating vector autocorrelation and survival probability are introduced, along with several functions for analyzing water structure properties, including density and orientation profiles. Import statements are updated, and detailed documentation is added across the modules.

Changes

File Change Summary
WatAnalysis/analysis.py Added class WaterAnalysis (inheriting from AnalysisBase) with methods: run, density_profile, orientation_profile, costheta_profile, count_in_region, angular_distribution, dipole_autocorrelation, and survival_probability; internal methods _prepare, _single_frame, and _conclude added for frame processing.
WatAnalysis/utils.py Added functions get_region_masks and guess_surface_indices to generate boolean masks and identify surface atom indices; updated imports with Tuple, Atoms, and get_layers.
WatAnalysis/waterdynamics.py Introduced functions calc_vector_autocorrelation and calc_survival_probability to compute, respectively, the autocorrelation of vectors and the survival probability over lag times; updated imports and added a docstring.
WatAnalysis/waterstructure.py Added functions calc_density_profile, calc_orientation_profile, calc_costheta_profile, count_water_in_region, and calc_angular_distribution for computing water density, orientation, cosine profiles, counting molecules, and angular distributions in specified regions.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant WA as WaterAnalysis
    participant T as Trajectory Data

    U->>WA: Initialize analysis with simulation parameters
    WA->>WA: _prepare() - Set up analysis environment
    loop Process each frame
        WA->>T: Retrieve frame data
        WA->>WA: _single_frame(frame) - Process current frame
    end
    WA->>WA: _conclude() - Finalize calculations and average results
    WA->>U: Return computed analysis results
Loading

Possibly related PRs

Poem

I'm a rabbit hopping through the code,
Splashes of water, simulations in load.
With each frame, new secrets unfold,
Analysis so fresh, robust and bold.
Hoppy cheers to the codes we've bestowed! 🐰💧


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
WatAnalysis/utils.py (2)

166-199: Clarify argument shape constraints in docstring.

While this function effectively creates boolean masks based on given intervals, it would be helpful to explicitly state in the docstring or via type hints that:

  1. z_coords must have shape (n_frames, n_atoms) or (n_atoms,) if operating on a single frame.
  2. z1 and z2 must each be one-dimensional arrays with the same length as z_coords’s first dimension (if multi-frame).

This helps future maintainers avoid shape mismatch mistakes.


201-251: Add safeguards to prevent infinite loops.

The while _crosses_z_boundary(atoms[atoms.symbols == element]): loop translates the slab by +1 Å each iteration until it no longer crosses the boundary. Although this approach works under typical slab configurations, a pathological case could cause the loop never to exit (e.g., extremely large or malformed cell dimensions). Consider:

  • Adding a maximum iteration count as a fallback.
  • Logging or printing the number of iterations to give users feedback if the loop runs too often.

This will make the function more robust.

WatAnalysis/waterstructure.py (1)

2-5: Expand the docstring to reflect new functionalities
These lines reference "time-averaged water structure properties" but do not clarify the additional angle/dipole autocorrelation features now offered by the module. Consider updating the docstring to briefly mention the new orientation and angular distribution analyses introduced below.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5515d88 and 7b34d1d.

📒 Files selected for processing (4)
  • WatAnalysis/analysis.py (1 hunks)
  • WatAnalysis/utils.py (2 hunks)
  • WatAnalysis/waterdynamics.py (1 hunks)
  • WatAnalysis/waterstructure.py (2 hunks)
🔇 Additional comments (6)
WatAnalysis/utils.py (1)

2-8: Use consistent import style and minimize redundancy where possible.

All these imports look correct and are commonly used in the subsequent code. If you plan to expand this module further, consider grouping imports from the same library together. For now, this is fine.

WatAnalysis/analysis.py (2)

74-102: Confirm that surf_ids is not None and is valid.

If surf_ids is missing or incorrectly shaped, downstream logic (e.g., accessing surf_ids[0] in _single_frame) may fail. Consider raising an exception or providing a fallback to compute surf_ids automatically, such as via utils.guess_surface_indices, to improve robustness.


195-196: Validate assumption about “valid dipoles.”

When only_valid_dipoles=True, you filter out NaN values in cos_theta to identify valid water molecules. Ensure there are no corner cases where cos_theta might be NaN for other reasons (e.g., a transient geometry issue) that do not necessarily mean “invalid water.” You may want to confirm that each molecule truly has exactly 2 hydrogens.

WatAnalysis/waterdynamics.py (1)

75-129: Validate indexing logic in the loop.

The nested loop:

for k in range(tau):
    intersection *= mask[k : -tau + k : step]
intersection *= mask[tau::step]

could risk out-of-bounds errors if tau approaches the total number of frames, or it might unintentionally double-count the final slice. Ensure that the step-based slicing is correct for your intended definition of “continuous presence” in the region.

WatAnalysis/waterstructure.py (2)

64-114: Check bin calculation to avoid potential zero or negative bins
When (z_surf[1] - z_surf[0]) < dz, the computed bin count might be zero or negative, causing a runtime warning or error. It would be prudent to validate these parameters or handle the scenario more gracefully.

You could, for instance, add a small check:

 if (z_surf[1] - z_surf[0]) <= dz:
+    raise ValueError("Bin width dz is too large or surfaces are too close.")
 counts, bin_edges = np.histogram(...)

226-263: Logic looks good for counting water molecules
This function correctly combines masks for the lower and upper surfaces. The code is structured and readable, with meaningful variable names.

Comment on lines +13 to +73
def calc_vector_autocorrelation(
max_tau: int,
delta_tau: int,
step: int,
vectors: np.ndarray,
mask: np.ndarray,
):
"""
Calculate the autocorrelation function for a vector quantity over time.

Parameters
----------
max_tau : int
Maximum lag time to calculate ACF for
delta_tau : int
Time interval between lag times (points on the C(tau) vs. tau curve)
step : int
Step size for time origins. If equal to max_tau, there is no overlap between
time windows considered in the calculation (so more uncorrelated).
vectors : numpy.ndarray
Array of vectors with shape (num_timesteps, num_particles, 3)
mask : numpy.ndarray
Boolean mask array indicating which particles to include, shape
(num_timesteps, num_particles)

Returns
-------
tau : numpy.ndarray
Array of lag times
acf : numpy.ndarray
Normalized autocorrelation function values for each lag time
"""
tau = np.arange(start=0, stop=max_tau, step=delta_tau)
acf = np.zeros(tau.shape)
mask = np.expand_dims(mask, axis=2)

# Calculate ACF for each lag time
for i, t in enumerate(tau):
n_selected_vectors = None
if t == 0:
# For t=0, just calculate the dot product with itself
dot_products = np.sum(
vectors * vectors * mask, axis=2
) # Shape: (num_timesteps, num_molecules)
n_selected_vectors = np.count_nonzero(mask)
else:
# For t > 0, calculate the dot products between shifted arrays
_vectors_0 = vectors[:-t:step] * mask[:-t:step] # dipole(t=0)
_vectors_t = vectors[t::step] * mask[t::step] # dipole(t=tau)
dot_products = np.sum(
_vectors_0 * _vectors_t, axis=2
) # Shape: ((num_timesteps - t)//step, num_molecules)
n_selected_vectors = np.count_nonzero(mask[:-t:step] * mask[t::step])

# Average over molecules and time origins
acf[i] = np.sum(dot_products) / n_selected_vectors

# Normalize the ACF
acf /= acf[0] # Normalize by the zero-lag value
return tau, acf

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle potential division by zero for ACF normalization.

When normalizing by acf[0], if no vectors are selected or if the dot product is zero at t=0, acf[0] will be 0, leading to NaN. Consider adding a check before normalization:

 if acf[0] == 0:
     # Handle edge case, e.g., set acf[...] = np.nan or skip normalization
 acf /= acf[0]

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
WatAnalysis/waterstructure.py (2)

64-71: Possible integer rounding for bin count.

When computing bins=int((z_surf[1] - z_surf[0]) / dz), floating-point inaccuracies may cause off-by-one issues. Consider rounding with math.floor or max(1, int(...)) to avoid an empty bin scenario if (z_surf[1] - z_surf[0]) < dz.


200-224: Validate zero-vector cases in calc_costheta_profile.

Like the dipole normalization concern elsewhere, NaN values can arise if norms are zero. Although you filter out np.isnan, consider systematically handling zero-length vectors or logging warnings for molecules missing hydrogens.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b34d1d and 3c0029d.

📒 Files selected for processing (2)
  • WatAnalysis/analysis.py (1 hunks)
  • WatAnalysis/waterstructure.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
WatAnalysis/waterstructure.py (1)
Learnt from: lucasdekam
PR: ChiahsinChu/WatAnalysis#9
File: WatAnalysis/waterstructure.py:170-224
Timestamp: 2025-02-25T16:12:43.768Z
Learning: The cosine theta profile in water analysis is an odd function across surfaces, meaning the proper symmetrization formula is (f(x) - f(-x))/2 which is implemented as `avg_cos_theta = (avg_cos_theta - avg_cos_theta[::-1]) / 2` in the code. This differs from even functions like density profiles which use addition for symmetrization.
🔇 Additional comments (4)
WatAnalysis/analysis.py (3)

74-92: Consider validating surf_ids and related atom indices.

If surf_ids is None or empty, or if the selected surfaces contain few atoms, the references in _single_frame may trigger unexpected behavior or indexing errors. You might add a check that raises a helpful exception if these indices are invalid.


95-101: Handle potential empty or invalid output from identify_water_molecules.

If the system lacks proper water molecules or if detection fails, self.water_dict could be empty. You may want to guard against downstream usage of an empty dictionary, e.g., by logging a warning or raising an exception.


183-210: Check consistency of only_valid_dipoles usage.

only_valid_dipoles filters out NaN entries in cos_theta, which is derived directly from normalized dipoles. Ensure it aligns with your analysis goals when dealing with incomplete or broken water geometries (sometimes H is missing). If partial water molecules can appear in your system, consider alternative filtering or logging.

WatAnalysis/waterstructure.py (1)

165-167: Confirm that subtractive symmetrization is intentional for an odd function.

The use of (rho_cos_theta - rho_cos_theta[::-1]) / 2 is valid if you are explicitly modeling an odd function across the surface centerline. If instead you intend to average symmetrical data (an even function), use addition instead. Verify the conceptual basis here to ensure correctness.

Comment on lines +143 to +155
# Calculate dipoles and project on self.axis
dipole = waterstructure.calc_water_dipoles(
coords_hydrogen,
coords_oxygen,
self.water_dict,
box=ts_box,
mic=self.min_vector,
)
dipole /= np.linalg.norm(dipole, axis=-1, keepdims=True)
np.copyto(self.results.dipoles[self._frame_index], dipole)
cos_theta = dipole[:, self.axis]
np.copyto(self.results.cos_theta[self._frame_index], cos_theta)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent division by zero in normalized dipole calculation.

When calling dipole /= np.linalg.norm(dipole, axis=-1, keepdims=True), water molecules missing atoms (or incorrectly detected) might yield zero vectors. This produces NaNs or raises warnings. Consider adding a small epsilon check or filtering out zero-length vectors to avoid potential runtime issues.

 dipole_lengths = np.linalg.norm(dipole, axis=-1, keepdims=True)
+nonzero_mask = dipole_lengths[:,0] > 1e-8
+dipole[nonzero_mask] /= dipole_lengths[nonzero_mask]

Committable suggestion skipped: line range outside the PR's diff.

also remove spurious attribute self.save_velocities
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
WatAnalysis/waterdynamics.py (2)

70-71: 🛠️ Refactor suggestion

Address division by zero for ACF normalization.

Directly normalizing by acf[0] can cause division by zero if acf[0] equals zero. This was flagged in a past review as well.

-acf /= acf[0]  # Normalize by the zero-lag value
+if acf[0] != 0:
+    acf /= acf[0]
+else:
+    # Handle edge case: set acf[...] = np.nan or skip normalization

132-132: 🛠️ Refactor suggestion

Avoid dividing by zero for survival probability normalization.

As with the autocorrelation, normalizing with:

acf /= acf[0]

risks division by zero if acf[0] is zero. Consider the same fix as above to handle this edge case.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c0029d and ffc648a.

📒 Files selected for processing (3)
  • WatAnalysis/analysis.py (1 hunks)
  • WatAnalysis/waterdynamics.py (1 hunks)
  • WatAnalysis/waterstructure.py (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
WatAnalysis/waterstructure.py (1)
Learnt from: lucasdekam
PR: ChiahsinChu/WatAnalysis#9
File: WatAnalysis/waterstructure.py:170-224
Timestamp: 2025-02-25T16:12:43.768Z
Learning: The cosine theta profile in water analysis is an odd function across surfaces, meaning the proper symmetrization formula is (f(x) - f(-x))/2 which is implemented as `avg_cos_theta = (avg_cos_theta - avg_cos_theta[::-1]) / 2` in the code. This differs from even functions like density profiles which use addition for symmetrization.
🔇 Additional comments (4)
WatAnalysis/analysis.py (2)

149-150: Guard against zero dipole norm.

If a water molecule’s two O–H vectors collapse to a zero vector (e.g. incomplete or incorrectly assigned water), normalizing the dipole at lines 149–150:

dipole /= np.linalg.norm(dipole, axis=-1, keepdims=True)

can produce NaNs or inf. Consider adding an epsilon or filtering zero-length vectors before normalization.


19-456: Overall implementation looks solid.

The WaterAnalysis class is cohesive and well-documented, with clear methods for density, orientation, dipole autocorrelation, and survival probability. No major concerns beyond the zero-vector edge case. Good job!

WatAnalysis/waterstructure.py (2)

292-293: Verify angle inputs for arccos.

If floating-point rounding pushes cos_theta[mask_lo] or -cos_theta[mask_hi] slightly outside [-1, 1], np.arccos can produce NaNs or raise warnings. Consider clamping the values to the valid range:

val = np.clip(val, -1.0, 1.0)

before calling arccos.


64-304: All new water structure methods appear correctly integrated.

These five functions (calc_density_profile, calc_orientation_profile, calc_costheta_profile, count_water_in_region, and calc_angular_distribution) align well with the code in analysis.py. Symmetrization for odd/even functions is clearly documented, and overall logic is consistent with the user’s design. LGTM!

Comment on lines +67 to +68
# Average over molecules and time origins
acf[i] = np.sum(dot_products) / n_selected_vectors
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle potential zero division in n_selected_vectors.

When t == 0 or if the boolean mask is empty for some reason, n_selected_vectors may end up being zero. This will raise an error or produce NaNs at line 68:

acf[i] = np.sum(dot_products) / n_selected_vectors

Consider adding a safety check or skipping the iteration if n_selected_vectors == 0.

# N(t,tau), shape: (num_timesteps - tau, )
n_t_tau = np.sum(intersection, axis=1)

acf[i] = np.mean(n_t_tau / n_t)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent potential division by zero when computing survival probability.

In line 127:

acf[i] = np.mean(n_t_tau / n_t)

If n_t is zero, a division by zero occurs or NaNs are produced. Add a guard check for empty or zero denominators.

@lucasdekam lucasdekam changed the base branch from master to devel February 27, 2025 13:36
Added a notebook by accident
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant