You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hello,
I am a new user of SSIM . I was trying to implement the training/inference workflow it to a CFD code.
The training runs fine and I was able to save a .pt model
But when I try to load the model for inference purposes the pipeline crashes (log content below). The problem it seems is with the following lines:
colo_model.add_ml_model('model',
cfg.inference.backend,
model=None, # this is used if model is in memory
model_path=cfg.inference.model_path,
device=cfg.inference.device,
batch_size=cfg.inference.batch,
min_batch_size=cfg.inference.batch,
devices_per_node=cfg.inference.devices_per_node,
inputs=None, outputs=None)
After adding the model above I am starting the DB with
# Start the co-located model
block = False if cfg.train.executable else True
print("Launching SOD2D and SmartSim co-located DB ... ")
if len(cfg.sim.copy_files)>0 or len(cfg.sim.link_files)>0:
colo_model.attach_generator_files(to_copy=list(cfg.sim.copy_files), to_symlink=list(cfg.sim.link_files))
exp.generate(colo_model, overwrite=True)
exp.start(colo_model, block=block, summary=True)
#exp.summary(style="html")
print("Done\n")
Without the add_ml_model the DB launches without any problems.
I am not sure how to debug the problem. Any help is appreciated!
Vishal
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 246, in main
launch_models(client, db_models)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 232, in launch_models
model_name = launch_db_model(client, db_model)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 109, in launch_db_model
client.set_model_from_file(
File "/gpfs/home/bsc/bsc021712/install_nrsml/SmartRedis/src/python/module/smartredis/util.py", line 155, in smartredis_api_wrapper
raise getattr(error, exception_name)(cpp_error_str, method_name) from None
smartredis.error.RedisRuntimeError: Client.set_model_from_file execution failed
File "/home/bsc/bsc021712/install_nrsml/SmartRedis/src/cpp/redis.cpp", line 728, in SmartRedis library
Redis error when executing command: ERR Could not load backend
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 249, in main
raise SSInternalError(
smartsim.error.errors.SSInternalError: Failed to set model or script, could not connect to database
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 335, in
main(
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 258, in main
raise SSInternalError("Colocated entrypoint raised an error") from e
smartsim.error.errors.SSInternalError: Colocated entrypoint raised an error
error: list of process IDs must follow -p
Usage:
ps [options]
Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.
For more details see ps(1).
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
hello,
I am a new user of SSIM . I was trying to implement the training/inference workflow it to a CFD code.
The training runs fine and I was able to save a .pt model
But when I try to load the model for inference purposes the pipeline crashes (log content below). The problem it seems is with the following lines:
After adding the model above I am starting the DB with
Without the add_ml_model the DB launches without any problems.
I am not sure how to debug the problem. Any help is appreciated!
Vishal
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 246, in main
launch_models(client, db_models)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 232, in launch_models
model_name = launch_db_model(client, db_model)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 109, in launch_db_model
client.set_model_from_file(
File "/gpfs/home/bsc/bsc021712/install_nrsml/SmartRedis/src/python/module/smartredis/util.py", line 155, in smartredis_api_wrapper
raise getattr(error, exception_name)(cpp_error_str, method_name) from None
smartredis.error.RedisRuntimeError: Client.set_model_from_file execution failed
File "/home/bsc/bsc021712/install_nrsml/SmartRedis/src/cpp/redis.cpp", line 728, in SmartRedis library
Redis error when executing command: ERR Could not load backend
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 249, in main
raise SSInternalError(
smartsim.error.errors.SSInternalError: Failed to set model or script, could not connect to database
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 335, in
main(
File "/apps/ACC/MINIFORGE/24.3.0-0/lib/python3.10/site-packages/smartsim/_core/entrypoints/colocated.py", line 258, in main
raise SSInternalError("Colocated entrypoint raised an error") from e
smartsim.error.errors.SSInternalError: Colocated entrypoint raised an error
error: list of process IDs must follow -p
Usage:
ps [options]
Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.
For more details see ps(1).
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[8075,1],0]
Exit code: 1
Beta Was this translation helpful? Give feedback.
All reactions