Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelNotHereException causing 8 retry iterations exhausted for model #523

Open
GolanLevy opened this issue Aug 6, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@GolanLevy
Copy link

GolanLevy commented Aug 6, 2024

Describe the bug

From time to time, our system spins out of control, throwing many ModelNotHereExceptions which eventually leading to "8 retry iterations exhausted for model".

Our registration process is completely automated, and triggered by a registerModel gRPC request (instead of a yaml configuration), followed by ensureLoaded request to validate that the registration has completed successfully.

Models:
The issues is not consistent per model: a failing invocation of a model can be successful on the next try, in case that the request is directed to a not-faulty mm pod (see the next section).

MM pods:
We have a few dozens mm pods , and the issues is very prominent in only some of them (<50%), addressed as "faulty" pods. Faulty pods are still functioning, meaning they are able to serve, run predictions and invoke internal requests, but have very high error rate due to the ModelNotHereExceptions.
It looks like faulty pods are somehow not synced with ETCD and invoke random internal requests.
All the mm pods are not new, and are running for a days/hours before the issue starts.
Note that non-faulty pods are also throwing these errors from time to time.

ETCD:
We do however suspect the ETCD, since its pods were restarted (for reasons unclear to us yet) and the faulty pods are only ones that were created prior to the ETCD restart.

Mitigation:
The issue usually stops when there is a scale in event, so some of the pods are terminated.
Note that a faulty pod might not be terminated, but the errors are stopped due to a termination of a different pod (maybe on that the problematic model was loaded on).

Example:
In the attached log file, you can see that a newly registered model 4774912c is facing this issue, even though it was loaded on modelmesh-serving-triton-2.x-768448c4fb-q9564.
The external requests to the many faulty pods, are directed to 8 pods, which none of them is modelmesh-serving-triton-2.x-768448c4fb-q9564.

report.csv

As you can see, the situation is very peculiar and we are not sure how to investigate further.
We are curious:

  1. Why the ForwardingLB decides to randomly invoke inference requests to other pods, assuming the model is already loaded there?
  2. How to continue this investigation?

Thanks!

@GolanLevy GolanLevy added the bug Something isn't working label Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant