random CUDA error when running object detection model on GPU #1315

raphael0202 · 2024-02-16T09:12:52Z

During tests performed on Triton GPU (when all 4 workers used Triton GPU as backend), the following error appears in the log:

2024-02-14 11:55:46,577 :: MainProcess :: MainThread :: INFO :: Running logo object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg
2024-02-14 11:55:55,991 :: MainProcess :: MainThread :: ERROR :: Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
    image_prediction: ImagePrediction = run_object_detection_model(  # type: ignore
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
    results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
    response = grpc_stub.ModelInfer(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);
	
	"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
    image_prediction: ImagePrediction = run_object_detection_model(  # type: ignore
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
    results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
    response = grpc_stub.ModelInfer(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); 

"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
2024-02-14 11:55:56,000 :: MainProcess :: raven-sentry.BackgroundWorker :: WARNING :: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2580)')': /api/1415205/store/
2024-02-14 11:55:56,216 :: MainProcess :: MainThread :: INFO :: robotoff-high-2: robotoff.workers.tasks.import_image.run_nutrition_table_object_detection(image_url='https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg', product_id=<Product 3184670016510 | off>) (ee052e87-5cf2-4033-855d-28c4e36d35e8)
2024-02-14 11:55:56,431 :: MainProcess :: MainThread :: INFO :: Running nutrition table object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg

This is returned by Triton when it runs on GPU. It looks like it only appear when concurrent requests are performed on the same model at the same time, dynamic batching may be the culprit here.

We wanted to switch to a more performant (and faster) object detection architecture such as Yolo v8, it may be a better solution to do so instead of investigating this issue (which is probably an ONNX bug).

The text was updated successfully, but these errors were encountered:

teolemon added this to 🤖 Artificial Intelligence @ Open Food Facts Feb 16, 2024

github-project-automation bot moved this to Todo in 🤖 Artificial Intelligence @ Open Food Facts Feb 16, 2024

teolemon added 🐛 bug Something isn't working 🎯 P1 labels Jul 30, 2024

teolemon removed the 🐛 bug Something isn't working label Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random CUDA error when running object detection model on GPU #1315

random CUDA error when running object detection model on GPU #1315

raphael0202 commented Feb 16, 2024 •

edited

Loading

random CUDA error when running object detection model on GPU #1315

random CUDA error when running object detection model on GPU #1315

Comments

raphael0202 commented Feb 16, 2024 • edited Loading

raphael0202 commented Feb 16, 2024 •

edited

Loading