-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Seg Fault When Deploying TF+HPS Model with merlin-tensorflow #440
Comments
@tuanavu Thanks for your feedback, we have decoupled/reorganized third parties dependency HPS depends on after 23.06. Since we have pre-installed all HPS/SOK related libraries, there is no need to set LD_PRELOAD. If you must set some custom library file paths, it is recommended to use LD_LIBRARY_PATH to set. FYI @EmmaQiaoCh @bashimao Please add your comment about reorganizing third parties dependency. |
Hi @yingcanw, following up on this thread, without setting the LD_PRELOAD, I got this error after deploying the model. I used
And here's the |
Please provide more details on which step in the notebook outputs these error messages. |
Sure, Steps to reproduce the behavior:
Note that the same model can be deployed and test successfully with |
From the brief reproduction steps you provided, I still haven't figured out which specific step you met these errors. So I can only guess that you have successfully completed the model training and created_and_save_inference_graph, and then you met the error in this step (Deploy SavedModel using HPS with Triton TensorFlow Backend) Since we do not have the same AWS environment to reproduce your issue, we have not reproduced the same issue you encountered in the local machine(T4/V100, Intel CPU, Ubuntu 22.04 with 23.12 container ).
|
Hi @yingcanw, The issue seems to circle back to the initial problem discussed in this thread: #440 (comment). When I set |
Thank you for your corrections. There is a typo here. We have upgraded python to 3.10 since 23.08, and we need to update the notebook to modify the triton server launch command. However, I still haven't reproduced the issue you mentioned on 23.09. But I still want to emphasize the difference in #440 (comment) , users are asked not to set the LD_PRELOAD variable independently(please pay attention to the bold part in the log), LD_PRELOAD is used as an argument when launching the triton server rather than an environment variable. In other words, LD_PRELOAD should not be set independently as an environment variable. Hope the above information is more clear for you to solve the problem.
|
Hi @yingcanw, Quick update, I believe I figure out the root cause of the seg fault. It appears to be related to configuring the Steps to reproduce the behavior:
|
Thanks a lot for your update, I think this error output can be easily misunderstood(we have verified that if the LD_PRELOAD parameter is set, circular dependencies will cause seg fault). But we don't have the same AWS environment to reproduce this problem. |
Describe the bug
I've encountered a segmentation fault while deploying a TensorFlow model with Hierarchical Parameter Server (HPS) following the instructions provided in the HPS TensorFlow Triton deployment demo notebook. This issue has been consistent across Merlin-TensorFlow images from
merlin-tensorflow:23.08
tomerlin-tensorflow:23.12
, which utilize Python 3.10.Note that the issue doesn’t happen with merlin-tensorflow <= 23.06 that uses Python 3.8
When deploying in a Kubernetes environment with the environment variable LD_PRELOAD set to /usr/local/lib/python3.10/dist-packages/merlin_hps-1.0.0-py3.10-linux-x86_64.egg/hierarchical_parameter_server/lib/libhierarchical_parameter_server.so, the Triton inference server container terminates unexpectedly with exit code 139. Trying to import the HPS library within the container also leads to a segmentation fault.
Error logs
Error: container triton terminated with exit code 139.
To Reproduce
Steps to reproduce the behavior:
merlin-tensorflow:23.09
and follow the deployment steps outlined in the HPS TensorFlow Triton deployment demo notebook to export the inference graph with HPS.Environment (please complete the following information):
Additional context
The error suggests there might be an incompatibility issue with the Python version or a problem with the HPS. Any insights or solutions to this problem would be greatly appreciated.
The text was updated successfully, but these errors were encountered: