Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I optimize multi-batch and parallel inference in TensorRT for faster performance on high-resolution image patches? #4241

Open
Nizam-Drongo-Space opened this issue Nov 7, 2024 · 2 comments
Labels
Performance General performance issues triaged Issue has been triaged by maintainers

Comments

@Nizam-Drongo-Space
Copy link

Description

I am encountering performance bottlenecks while running multi-threaded inference on high-resolution images using TensorRT. The model involves breaking the image into patches to manage GPU memory, performing inference on each patch, and then merging the results. However, the inference time per patch is still high, even when increasing the batch size. Additionally, loading multiple engines onto the GPU to parallelize the inference does not yield the expected speedup. I am seeking advice on optimizing the inference process for faster execution, either by improving batch processing or enabling better parallelism in TensorRT.

Environment

TensorRT Version: 10.5.0
GPU Type: RTX 3050TI 4GB
Nvidia Driver Version: 535.183.01
CUDA Version: 12.2
CUDNN Version: N/A
Operating System + Version: Ubuntu 20.04
Python Version: 3.11
TensorFlow Version: N/A
PyTorch Version: N/A
Baremetal or Container (if container, which image + tag): Baremetal

Relevant Files

build_engine.py

def build_engine(onnx_file_path, engine_file_path):
    logger = trt.Logger(trt.Logger.ERROR)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    profile = builder.create_optimization_profile()
    config = builder.create_builder_config()
    parser = trt.OnnxParser(network, logger)
    
    if not os.path.exists(onnx_file_path):
        print("Failed finding ONNX file!")
        return
    print("Succeeded finding ONNX file!")
    
    with open(onnx_file_path, 'rb') as model:
        if not parser.parse(model.read()):
            print('Failed parsing the ONNX file')
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return
    print('Completed parsing of ONNX file')

    # Configure input profile
    input_tensor = network.get_input(0)
    profile.set_shape(input_tensor.name, (min_batch, shape[1], shape[2], shape[3]), shape, (max_batch, shape[1], shape[2], shape[3]))
    config.add_optimization_profile(profile)

    # Build the serialized engine
    engine_string = builder.build_serialized_network(network, config)
    if engine_string is None:
        print("Failed building engine!")
        return
    print("Succeeded building engine!")
    
    with open(engine_file_path, "wb") as f:
        f.write(engine_string)

inference.py

class TRTModel:
    def __init__(self, trt_path):
        self.trt_path = trt_path
        trt.init_libnvinfer_plugins(None, "")
        self.logger = trt.Logger(trt.Logger.ERROR)
        with open(self.trt_path, "rb") as f:
            engine_data = f.read()
        self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(engine_data)

    def create_execution_context(self):
        return self.engine.create_execution_context()

    def process_async(self, input_data):
        _, stream = cudart.cudaStreamCreate()
        context = self.create_execution_context()
        
        input_size = input_data.nbytes
        output_size = input_data.nbytes

        input_device = cudart.cudaMallocAsync(input_size, stream)[1]
        output_device = cudart.cudaMallocAsync(output_size, stream)[1]

        input_data_np = input_data.cpu().numpy()

        cudart.cudaMemcpyAsync(input_device, input_data_np.ctypes.data, input_data.nbytes,
                               cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)

        context.set_tensor_address('images', int(input_device))
        context.set_tensor_address('output', int(output_device))
        context.execute_async_v3(stream_handle=int(stream))

        output_host = np.empty_like(input_data_np, dtype=np.float32)
        cudart.cudaMemcpyAsync(output_host.ctypes.data, output_device, output_host.nbytes,
                               cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
        cudart.cudaStreamSynchronize(stream)

        cudart.cudaFree(input_device)
        cudart.cudaFree(output_device)
        cudart.cudaStreamDestroy(stream)

        return output_host

Steps To Reproduce

  1. Build the Engine: Use build_engine to convert an ONNX model into a TensorRT engine.
  2. Run Inference: Use TRTModel to perform inference on cropped image patches.
  3. Expected Result: While batch sizes are increased, the inference time per patch remains high. Running multiple engines for parallel inference also does not improve performance.
  4. Profiling Results:
    • Transfer to device: 0.48 ms
    • Inference time: 784.75 ms
    • Transfer to host: 0.67 ms
    • Total time for a single patch (256x256): 19-22 seconds on average

I am seeking optimization suggestions for improving multi-batch processing or multi-threaded parallel inference in TensorRT.

@smrutiranjanmohapatra
Copy link

I am facing same issue if you find solution let us know here too

@poweiw
Copy link
Collaborator

poweiw commented Nov 18, 2024

@zerollzeng can you take a look?

@poweiw poweiw added Performance General performance issues triaged Issue has been triaged by maintainers labels Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance General performance issues triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants