[Batch Transform] TF Serving receives requests before model is loaded #171

sayradley · 2020-10-14T21:32:36Z

Describe the bug
I've noticed the first requests (around 50) to the serving server fail with 500. When the model is loaded it normally processes other incoming requests.

To reproduce
Run an image transform job with baching enabled using the ml.p2.xlarge instance.

Expected behavior
TF Serving server starts receiving requests when the model is loaded and ready.

Screenshots or logs


"POST /invocations HTTP/1.1" 500 288 "-" "Go-http-client/1.1"

ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/mdl:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc2c84cab10>: Failed to establish a new connection: [Errno 111] Connection refused')) | 

ERROR:python_service:exception handling request: HTTPConnectionPool(host='localhost', port=10001): Max retries exceeded with url: /v1/models/mdl:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc2c84cab10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Traceback (most recent call last):  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn    (self._dns_host, self.port), self.timeout, **extra_kw  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection    raise err  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection    sock.connect(sa)  File "/usr/local/lib/python3.7/site-packages/gevent/_socket3.py", line 335, in connect    raise error(err, strerror(err)) | Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection raise err File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection sock.connect(sa) File "/usr/local/lib/python3.7/site-packages/gevent/_socket3.py", line 335, in connect raise error(err, strerror(err))

ConnectionRefusedError: [Errno 111] Connection refused | ConnectionRefusedError: [Errno 111] Connection refused

when this log shows up everything is processed correctly (200):

Successfully loaded servable version {name: mdl version: 3}

Running gRPC ModelServer at 0.0.0.0:10000 ...
Exporting HTTP/REST API at:localhost:10001 ...

"POST /invocations HTTP/1.1" 200 40669 "-" "Go-http-client/1.1"

System information
763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:2.3.0-gpu-py37-cu102-ubuntu18.04

Additional context
I use this piece of code to start a job:

sagemaker.create_transform_job(
      TransformJobName='test',
      ModelName=mdl,
      MaxConcurrentTransforms=64,
      BatchStrategy='MultiRecord',
      Environment={
        'SAGEMAKER_TFS_ENABLE_BATCHING': 'true',
        'SAGEMAKER_TFS_MAX_BATCH_SIZE': '256',
        'SAGEMAKER_TFS_BATCH_TIMEOUT_MICROS': '100000'
      },
      TransformInput={
        'DataSource': {
          'S3DataSource': {
            'S3DataType': 'S3Prefix',
            'S3Uri': 's3://data'
          }
        },
        'ContentType': 'application/x-image',
        'CompressionType': 'None',
        'SplitType': 'None'
      },
      TransformOutput={
        'S3OutputPath': self._prepare_athena_partition(),
        'AssembleWith': 'None'
      },
      TransformResources={
        'InstanceType': 'ml.p2.xlarge',
        'InstanceCount': 1,
      },
      DataProcessing={
        'JoinSource': 'None'
      }
)

The text was updated successfully, but these errors were encountered:

icywang86rui · 2020-10-27T18:04:44Z

Our ping logic doesn't check if the model is loaded correctly. We need to fix the deep ping logic. It waits for a period of time for the model to load. What's the size of your model?

sayradley · 2020-10-27T22:02:36Z

300 MB

ajaykarpur added the type: bug label Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Batch Transform] TF Serving receives requests before model is loaded #171

[Batch Transform] TF Serving receives requests before model is loaded #171

sayradley commented Oct 14, 2020 •

edited

Loading

icywang86rui commented Oct 27, 2020

sayradley commented Oct 27, 2020

[Batch Transform] TF Serving receives requests before model is loaded #171

[Batch Transform] TF Serving receives requests before model is loaded #171

Comments

sayradley commented Oct 14, 2020 • edited Loading

icywang86rui commented Oct 27, 2020

sayradley commented Oct 27, 2020

sayradley commented Oct 14, 2020 •

edited

Loading