Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

[Batch Transform] TF Serving receives requests before model is loaded #171

Open
sayradley opened this issue Oct 14, 2020 · 2 comments
Open

Comments

@sayradley
Copy link

sayradley commented Oct 14, 2020

Describe the bug
I've noticed the first requests (around 50) to the serving server fail with 500. When the model is loaded it normally processes other incoming requests.

To reproduce
Run an image transform job with baching enabled using the ml.p2.xlarge instance.

Expected behavior
TF Serving server starts receiving requests when the model is loaded and ready.

Screenshots or logs


"POST /invocations HTTP/1.1" 500 288 "-" "Go-http-client/1.1"

ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/mdl:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc2c84cab10>: Failed to establish a new connection: [Errno 111] Connection refused')) | 

ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/mdl:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc2c84cab10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Traceback (most recent call last):  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn    (self._dns_host, self.port), self.timeout, **extra_kw  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection    raise err  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection    sock.connect(sa)  File "/usr/local/lib/python3.7/site-packages/gevent/_socket3.py", line 335, in connect    raise error(err, strerror(err)) | Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection raise err File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection sock.connect(sa) File "/usr/local/lib/python3.7/site-packages/gevent/_socket3.py", line 335, in connect raise error(err, strerror(err))

ConnectionRefusedError: [Errno 111] Connection refused | ConnectionRefusedError: [Errno 111] Connection refused

when this log shows up everything is processed correctly (200):

Successfully loaded servable version {name: mdl version: 3}

Running gRPC ModelServer at 0.0.0.0:10000 ...
Exporting HTTP/REST API at:localhost:10001 ...

"POST /invocations HTTP/1.1" 200 40669 "-" "Go-http-client/1.1"

System information
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-gpu-py37-cu102-ubuntu18.04

Additional context
I use this piece of code to start a job:

sagemaker.create_transform_job(
      TransformJobName='test',
      ModelName=mdl,
      MaxConcurrentTransforms=64,
      BatchStrategy='MultiRecord',
      Environment={
        'SAGEMAKER_TFS_ENABLE_BATCHING': 'true',
        'SAGEMAKER_TFS_MAX_BATCH_SIZE': '256',
        'SAGEMAKER_TFS_BATCH_TIMEOUT_MICROS': '100000'
      },
      TransformInput={
        'DataSource': {
          'S3DataSource': {
            'S3DataType': 'S3Prefix',
            'S3Uri': 's3://data'
          }
        },
        'ContentType': 'application/x-image',
        'CompressionType': 'None',
        'SplitType': 'None'
      },
      TransformOutput={
        'S3OutputPath': self._prepare_athena_partition(),
        'AssembleWith': 'None'
      },
      TransformResources={
        'InstanceType': 'ml.p2.xlarge',
        'InstanceCount': 1,
      },
      DataProcessing={
        'JoinSource': 'None'
      }
)
@icywang86rui
Copy link
Contributor

Our ping logic doesn't check if the model is loaded correctly. We need to fix the deep ping logic. It waits for a period of time for the model to load. What's the size of your model?

@sayradley
Copy link
Author

300 MB

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants