Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDF5 support for trajs and model_devis #259

Merged
merged 17 commits into from
Sep 10, 2024

Conversation

zjgemi
Copy link
Collaborator

@zjgemi zjgemi commented Sep 3, 2024

Summary by CodeRabbit

  • New Features

    • Introduced new optional arguments for improved data handling and multitasking capabilities.
    • Added support for HDF5 formatted data in various modules.
    • Enhanced flexibility in input handling for multiple data formats.
  • Bug Fixes

    • Improved robustness in handling validation data structures.
  • Documentation

    • Updated documentation to clarify new parameters and their intended use.

Copy link

coderabbitai bot commented Sep 3, 2024

Walkthrough

Walkthrough

The changes enhance argument handling, data processing capabilities, and flexibility across various modules of the dpgen2 package. New optional parameters are introduced to functions, enabling better configuration and support for HDF5 datasets. The logic for handling valid data and model freezing is refined, and new methods are implemented to improve data writing processes.

Changes

Files Change Summary
dpgen2/entrypoint/args.py Added use_hdf5 argument to run_diffcsp_args.
dpgen2/entrypoint/submit.py Introduced RunRelaxHDF5; updated make_concurrent_learning_op to include explore_config; restructured workflow_concurrent_learning for multitasking data handling.
dpgen2/exploration/render/traj_render.py, dpgen2/exploration/render/traj_render_lammps.py Updated get_model_devi and get_confs methods to accept Union[List[Path], List[HDF5Dataset]] as parameters.
dpgen2/exploration/selector/conf_selector.py, dpgen2/exploration/selector/conf_selector_frame.py Modified select method to accept Union[List[Path], List[HDF5Dataset]] for trajs and model_devis.
dpgen2/op/select_confs.py Updated get_input_sign method to accept Artifact(Union[List[Path], HDF5Datasets]) for trajs and model_devis.
dpgen2/exploration/scheduler/convergence_check_stage_scheduler.py, dpgen2/exploration/scheduler/scheduler.py, dpgen2/exploration/scheduler/stage_scheduler.py Updated plan_next_iteration method to accept Union[List[Path], List[HDF5Dataset]] for trajs.
pyproject.toml Updated pydflow version from >=1.6.57 to >=1.8.88.
tests/op/test_run_relax.py Added empty dictionary under "expl_config" in the OPIO constructor within testRunRelax.

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 687b9c5 and 0499be9.

Files selected for processing (4)
  • dpgen2/exploration/scheduler/convergence_check_stage_scheduler.py (2 hunks)
  • dpgen2/exploration/scheduler/scheduler.py (3 hunks)
  • dpgen2/exploration/scheduler/stage_scheduler.py (3 hunks)
  • dpgen2/flow/dpgen_loop.py (3 hunks)
Additional comments not posted (7)
dpgen2/exploration/scheduler/stage_scheduler.py (2)

11-12: Approved import changes.

The addition of Union and HDF5Dataset is necessary for the new functionality to handle both paths and HDF5 datasets in the trajs parameter.

Also applies to: 14-15


Line range hint 95-106: Approved method changes with a suggestion to verify integration.

The update to the trajs parameter type in plan_next_iteration enhances the method's flexibility to handle different data sources. The documentation is updated accordingly, which is good for clarity.

Please ensure that the integration of HDF5Dataset is tested thoroughly to confirm that the system handles these datasets correctly across different scenarios.

dpgen2/exploration/scheduler/convergence_check_stage_scheduler.py (2)

8-8: Approved import changes.

The addition of Union and HDF5Dataset is necessary for the new functionality to handle both paths and HDF5 datasets in the trajs parameter.

Also applies to: 14-16


74-74: Approved method changes with a suggestion to verify integration.

The update to the trajs parameter type in plan_next_iteration enhances the method's flexibility to handle different data sources. The documentation is updated accordingly, which is good for clarity.

Please ensure that the integration of HDF5Dataset is tested thoroughly to confirm that the system handles these datasets correctly across different scenarios.

dpgen2/exploration/scheduler/scheduler.py (2)

8-8: Approved import changes.

The addition of Union and HDF5Dataset is necessary for the new functionality to handle both paths and HDF5 datasets in the trajs parameter.

Also applies to: 15-17


117-117: Approved method changes with a suggestion to verify integration.

The update to the trajs parameter type in plan_next_iteration enhances the method's flexibility to handle different data sources. The documentation is updated accordingly, which is good for clarity.

Please ensure that the integration of HDF5Dataset is tested thoroughly to confirm that the system handles these datasets correctly across different scenarios.

Also applies to: 126-126

dpgen2/flow/dpgen_loop.py (1)

96-96: Type annotation change approved, verify handling of new type.

The change in type annotation for the trajs key in the get_input_sign method is approved as it enhances flexibility. However, it's crucial to verify that the rest of the codebase correctly handles the new HDF5Datasets type to ensure compatibility.

Run the following script to verify the handling of the new type:

Verification successful

Handling of HDF5Datasets type is consistent across the codebase.

The codebase shows consistent handling of the trajs variable with the new type annotation Union[List[Path], HDF5Datasets]. The changes are integrated correctly, and the code is prepared to handle both types effectively.

  • Instances in dpgen2/exploration/selector/conf_selector_frame.py and other files confirm the correct handling of the new type.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify handling of the new HDF5Datasets type in the codebase.

# Test: Search for usage of `trajs` and check if it handles both types correctly. Expect: Proper handling of both List[Path] and HDF5Datasets.
rg --type python -A 5 $'trajs'

Length of output: 60


Script:

#!/bin/bash
# Description: Verify handling of the new HDF5Datasets type in the codebase.

# Test: Search for usage of `trajs` and check if it handles both types correctly. Expect: Proper handling of both List[Path] and HDF5Datasets.
rg --type py -A 5 $'trajs'

Length of output: 47826


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range, codebase verification and nitpick comments (3)
dpgen2/op/__init__.py (1)

42-42: LGTM!

The new import statement for RunRelaxHDF5 is consistent with the existing import style in the file.

Regarding the unused import warning from Ruff, it's likely a false positive in this case. Importing an entity in __init__.py allows it to be accessed directly from the package level, even if it's not used within the __init__.py file itself.

If desired, you can resolve the warning by adding RunRelaxHDF5 to the __all__ list to explicitly mark it as part of the public interface:

__all__ = [
    ..., 
    "RunRelaxHDF5",
]

However, this is not strictly necessary if the project doesn't define __all__ for other entities.

Tools
Ruff

42-42: .run_relax.RunRelaxHDF5 imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

dpgen2/exploration/render/traj_render_lammps.py (1)

55-58: Consider simplifying the if-else block using a ternary operator.

The static analysis tool suggests using a ternary operator instead of the if-else block. This can simplify the code without changing its behavior.

Apply this diff to simplify the code:

-if isinstance(fname, HDF5Dataset):
-    dd = fname.get_data()
-else:
-    dd = np.loadtxt(fname)
+dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)
Tools
Ruff

55-58: Use ternary operator dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname) instead of if-else-block

Replace if-else-block with dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)

(SIM108)

dpgen2/op/run_lmp.py (1)

296-318: LGTM with a nitpick!

The freeze_model function implementation looks good.

Improve the error message.

Consider providing more context in the error message to help with debugging.

Apply this diff to improve the error message:

 def freeze_model(input_model, frozen_model, head=None):
     freeze_args = "-o %s" % frozen_model
     if head is not None:
         freeze_args += " --head %s" % head
     freeze_cmd = "dp --pt freeze -c %s %s" % (input_model, freeze_args)
     ret, out, err = run_command(freeze_cmd, shell=True)
     if ret != 0:
         logging.error(
             "".join(
                 (
                     "freeze failed\n",
-                    "command was",
+                    "command was: ",
                     freeze_cmd,
-                    "out msg",
+                    "\nout msg: ",
                     out,
                     "\n",
-                    "err msg",
+                    "err msg: ",
                     err,
                     "\n",
                 )
             )
         )
-        raise TransientError("freeze failed")
+        raise TransientError(f"Failed to freeze model {input_model} with command: {freeze_cmd}")

dpgen2/exploration/selector/conf_selector.py Show resolved Hide resolved
@@ -1,3 +1,4 @@
import logging
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import.

The logging module is imported but not used in the code. Please remove it.

-import logging
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import logging
Tools
Ruff

1-1: logging imported but unused

Remove unused import: logging

(F401)

Comment on lines 222 to 228
@staticmethod
def normalize_config(data={}):
ta = RunRelax.relax_args()
base = Argument("base", dict, ta)
data = base.normalize_value(data, trim_pattern="_*")
base.check_value(data, strict=False)
return data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the mutable default argument with None.

Using a mutable default argument can lead to unexpected behavior. Please replace it with None and initialize it within the function.

-def normalize_config(data={}):
+def normalize_config(data=None):
+    if data is None:
+        data = {}
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def normalize_config(data={}):
ta = RunRelax.relax_args()
base = Argument("base", dict, ta)
data = base.normalize_value(data, trim_pattern="_*")
base.check_value(data, strict=False)
return data
@staticmethod
def normalize_config(data=None):
if data is None:
data = {}
ta = RunRelax.relax_args()
base = Argument("base", dict, ta)
data = base.normalize_value(data, trim_pattern="_*")
base.check_value(data, strict=False)
return data
Tools
Ruff

223-223: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Comment on lines 222 to 224
Argument(
"model_frozen_head", str, optional=True, default=None, doc=doc_head
),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the argument documentation.

The argument has been renamed from head to model_frozen_head, but the documentation still refers to the old name.

Apply this diff to update the documentation:

-doc_head = "Select a head from multitask"
+doc_model_frozen_head = "Select a head from the multitask model to freeze"
 return [
     Argument("command", str, optional=True, default="lmp", doc=doc_lmp_cmd),
     Argument(
         "teacher_model_path",
         [BinaryFileInput, str],
         optional=True,
         default=None,
         doc=doc_teacher_model,
     ),
     Argument(
         "shuffle_models",
         bool,
         optional=True,
         default=False,
         doc=doc_shuffle_models,
     ),
     Argument(
-        "model_frozen_head", str, optional=True, default=None, doc=doc_head
+        "model_frozen_head", str, optional=True, default=None, doc=doc_model_frozen_head
     ),
 ]

Committable suggestion was skipped due to low confidence.

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 78.37838% with 8 lines in your changes missing coverage. Please review.

Project coverage is 83.65%. Comparing base (ce4ab3e) to head (0499be9).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
dpgen2/op/run_relax.py 71.42% 6 Missing ⚠️
dpgen2/exploration/render/traj_render_lammps.py 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #259      +/-   ##
==========================================
- Coverage   83.70%   83.65%   -0.05%     
==========================================
  Files         104      104              
  Lines        5958     5990      +32     
==========================================
+ Hits         4987     5011      +24     
- Misses        971      979       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range, codebase verification and nitpick comments (1)
dpgen2/exploration/render/traj_render_lammps.py (1)

55-58: Consider using a ternary operator for conciseness.

The if-else block can be replaced with a ternary operator to make the code more concise without changing the functionality.

Apply this diff to refactor the code:

-if isinstance(fname, HDF5Dataset):
-    dd = fname.get_data()
-else:
-    dd = np.loadtxt(fname)
+dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)
Tools
Ruff

55-58: Use ternary operator dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname) instead of if-else-block

Replace if-else-block with dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)

(SIM108)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (1)
dpgen2/exploration/render/traj_render_lammps.py (1)

62-65: LGTM! Consider using a ternary operator for conciseness.

The change to the logic for loading data allows the _load_one_model_devi method to handle HDF5 datasets directly, improving its capability to work with different data formats.

As suggested by Ruff, consider replacing the if-else block with a ternary operator for conciseness:

-if isinstance(fname, HDF5Dataset):
-    dd = fname.get_data()
-else:
-    dd = np.loadtxt(fname)
+dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)
Tools
Ruff

62-65: Use ternary operator dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname) instead of if-else-block

Replace if-else-block with dd = fname.get_data() if isinstance(fname, HDF5Dataset) else np.loadtxt(fname)

(SIM108)

@@ -10,9 +10,13 @@
Optional,
Set,
Tuple,
Union,
)

import dpdata
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused import.

The dpdata module is imported but unused.

Apply this diff to remove the unused import:

-import dpdata
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import dpdata
Tools
Ruff

16-16: dpdata imported but unused

Remove unused import: dpdata

(F401)

Comment on lines 15 to 16
import dpdata
import numpy as np
Copy link

@coderabbitai coderabbitai bot Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused imports.

The static analysis tool suggests that the following imports are unused:

  • dpdata (line 15)
  • numpy (line 16)

Please remove these unused imports to keep the code clean.

Tools
Ruff

15-15: dpdata imported but unused

Remove unused import: dpdata

(F401)


16-16: numpy imported but unused

Remove unused import: numpy

(F401)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ruff hasn't been enabled in the dpgen2

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njzjz, you're correct that Ruff hasn't been enabled in the dpgen2 repository. If you want to enforce linting rules like removing unused imports, you might consider enabling Ruff in the future. For now, you can disregard the suggestion to remove the unused imports.

Copy link

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also support run_lmp, which seems to be straightforward.

@zjgemi
Copy link
Collaborator Author

zjgemi commented Sep 5, 2024

Could you please also support run_lmp, which seems to be straightforward.

Sure.

@zjgemi
Copy link
Collaborator Author

zjgemi commented Sep 5, 2024

Could you please also support run_lmp, which seems to be straightforward.

I realize that for run_lmp, a task only outputs a single trajectory and a single model_devi file. As outputs of each task must be stored in a seperated file. Merging outputs of each task into a HDF5 file will bring little benefit.

On the other hand, in the HDF5 mode, users cannot conveniently preview file content in UI. That's why HDF5 mode is not employed by default unless performance bottleneck is met.

@wanghan-iapcm wanghan-iapcm merged commit 3501db4 into deepmodeling:master Sep 10, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants