Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random CUDA error when running object detection model on GPU #1315

Open
raphael0202 opened this issue Feb 16, 2024 · 0 comments
Open

random CUDA error when running object detection model on GPU #1315

raphael0202 opened this issue Feb 16, 2024 · 0 comments
Labels

Comments

@raphael0202
Copy link
Collaborator

raphael0202 commented Feb 16, 2024

During tests performed on Triton GPU (when all 4 workers used Triton GPU as backend), the following error appears in the log:

2024-02-14 11:55:46,577 :: MainProcess :: MainThread :: INFO :: Running logo object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg
2024-02-14 11:55:55,991 :: MainProcess :: MainThread :: ERROR :: Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
    image_prediction: ImagePrediction = run_object_detection_model(  # type: ignore
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
    results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
    response = grpc_stub.ModelInfer(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);
	
	"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/worker.py", line 1075, in perform_job
    rv = job.perform()
         ^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 854, in perform
    self._result = self._execute()
                   ^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/rq/job.py", line 877, in _execute
    result = self.func(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/workers/tasks/import_image.py", line 473, in run_logo_object_detection
    image_prediction: ImagePrediction = run_object_detection_model(  # type: ignore
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/insights/extraction.py", line 88, in run_object_detection_model
    results = ObjectDetectionModelRegistry.get(model_name.value).detect_from_image(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/robotoff/robotoff/prediction/object_detection/core.py", line 146, in detect_from_image
    response = grpc_stub.ModelInfer(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); 

"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:34.140.4.148:8001 {created_time:"2024-02-14T11:55:55.085875785+00:00", grpc_status:13, grpc_message:"onnx runtime error 6: Non-zero status code returned while running FusedConv node. Name:\'FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block2/unit_4/bottleneck_v1/conv1/Conv2D\' Status Message: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=261fe7d0b952 ; file=/workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); \n\n"}"
>
2024-02-14 11:55:56,000 :: MainProcess :: raven-sentry.BackgroundWorker :: WARNING :: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2580)')': /api/1415205/store/
2024-02-14 11:55:56,216 :: MainProcess :: MainThread :: INFO :: robotoff-high-2: robotoff.workers.tasks.import_image.run_nutrition_table_object_detection(image_url='https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg', product_id=<Product 3184670016510 | off>) (ee052e87-5cf2-4033-855d-28c4e36d35e8)
2024-02-14 11:55:56,431 :: MainProcess :: MainThread :: INFO :: Running nutrition table object detection for <Product 3184670016510 | off>, image https://static.openfoodfacts.org/images/products/318/467/001/6510/3.jpg

This is returned by Triton when it runs on GPU. It looks like it only appear when concurrent requests are performed on the same model at the same time, dynamic batching may be the culprit here.

We wanted to switch to a more performant (and faster) object detection architecture such as Yolo v8, it may be a better solution to do so instead of investigating this issue (which is probably an ONNX bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants