Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

[bug] : Model not loading while using existing container image to setup MME on sagemaker #170

Open
3 of 6 tasks
abhi1793 opened this issue Oct 8, 2020 · 1 comment
Open
3 of 6 tasks

Comments

@abhi1793
Copy link

abhi1793 commented Oct 8, 2020

Checklist

Concise Description:
Getting this error, when invoking a MME on sagemaker setup using 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04 container image.

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=14448): Max retries exceeded with url: /v1/models/d2295a7526f9df36354b8a2c4adc4f63 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f70966dba50>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
File "/sagemaker/python_service.py", line 157, in _handle_load_model_post
self._wait_for_model(model_name)
File "/sagemaker/python_service.py", line 247, in _wait_for_model
response = session.get(url)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)

DLC image/dockerfile:
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-cpu-py37-ubuntu18.04
Current behavior:

Expected behavior:
Model should load up and return prediction
Additional context:
I have setup a MME using the above mentioned container and invoking the endpoint using a lambda. The model files are in placed in S3 and are in the correct directory structure with a version number.

@ajaykarpur ajaykarpur transferred this issue from aws/deep-learning-containers Oct 12, 2020
@ajaykarpur
Copy link
Contributor

How large is your model? After the load model request is sent, the container waits for a period of time to ensure the model is available to the model server. If the model is very large, however, the container might not have waited sufficient time for the model to load, causing a ConnectionError.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants