You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During tests performed on Triton GPU (when all 4 workers used Triton GPU as backend), the following error appears in the log:
2024-02-14 11:55:46,577 :: MainProcess :: MainThread :: INFO :: Running logo object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg
2024-02-14 11:55:55,991 :: MainProcess :: MainThread :: ERROR :: Traceback (most recent call last):
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
rv = job.perform()
^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
self._result = self._execute()
^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
result = self.func(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
image_prediction: ImagePrediction = run_object_detection_model( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
response = grpc_stub.ModelInfer(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);
"
debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
Traceback (most recent call last):
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
rv = job.perform()
^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
self._result = self._execute()
^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
result = self.func(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
image_prediction: ImagePrediction = run_object_detection_model( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
response = grpc_stub.ModelInfer(request)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.INTERNAL
details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);
"
debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
2024-02-14 11:55:56,000 :: MainProcess :: raven-sentry.BackgroundWorker :: WARNING :: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2580)')': /api/1415205/store/
2024-02-14 11:55:56,216 :: MainProcess :: MainThread :: INFO :: robotoff-high-2: robotoff.workers.tasks.import_image.run_nutrition_table_object_detection(image_url='https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg', product_id=<Product 3184670016510 | off>) (ee052e87-5cf2-4033-855d-28c4e36d35e8)
2024-02-14 11:55:56,431 :: MainProcess :: MainThread :: INFO :: Running nutrition table object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg
This is returned by Triton when it runs on GPU. It looks like it only appear when concurrent requests are performed on the same model at the same time, dynamic batching may be the culprit here.
We wanted to switch to a more performant (and faster) object detection architecture such as Yolo v8, it may be a better solution to do so instead of investigating this issue (which is probably an ONNX bug).
The text was updated successfully, but these errors were encountered:
During tests performed on Triton GPU (when all 4 workers used Triton GPU as backend), the following error appears in the log:
This is returned by Triton when it runs on GPU. It looks like it only appear when concurrent requests are performed on the same model at the same time, dynamic batching may be the culprit here.
We wanted to switch to a more performant (and faster) object detection architecture such as Yolo v8, it may be a better solution to do so instead of investigating this issue (which is probably an ONNX bug).
The text was updated successfully, but these errors were encountered: