Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A minimalistic example of using Cuda Shared memory #84

Closed
sayanmutd opened this issue Oct 1, 2024 · 5 comments
Closed

A minimalistic example of using Cuda Shared memory #84

sayanmutd opened this issue Oct 1, 2024 · 5 comments
Assignees
Labels

Comments

@sayanmutd
Copy link

Can you please provide a minimalistic example of using cuda shared memory from a client application which streams preprocessed pytorch tensors located in GPU to the pytriton server.
The pytriton server is also using the same pytorch tensors through dlpack and doing inference in the pytriton server.
I need examples for both the client and the server.

@sayanmutd
Copy link
Author

@piotrm-nvidia Can you please suggest a minimal example to get started.

@piotrm-nvidia
Copy link
Collaborator

piotrm-nvidia commented Oct 5, 2024

Let's start with simple Linear model, which takes a single input tensor and returns the negative of the input tensor:

import numpy as np
from pytriton.decorators import batch

@batch
def infer_fn(data):
    result = data * np.array([[-1]], dtype=np.float32)  # Process inputs and produce result
    return [result]

from pytriton.model_config import Tensor
from pytriton.triton import Triton
triton = Triton()
triton.bind(
    model_name="Linear",
    infer_func=infer_fn,
    inputs=[Tensor(name="data", dtype=np.float32, shape=(-1,)),],
    outputs=[Tensor(name="result", dtype=np.float32, shape=(-1,)),],
)
triton.run()

This code will create a simple Triton model that takes a single input tensor named data and returns the negative of the input tensor as the output tensor named result. The infer_fn function processes the input data and produces the output result. The @batch decorator indicates that the model supports batching.

You can test this model using the following code:

import numpy as np
from pytriton.client import ModelClient

client = ModelClient("localhost", "Linear")
data = np.array([1, 2, ], dtype=np.float32)
print(client.infer_sample(data=data))

The ModelClient class is a simple client for interacting with Triton models. If you need more advanced features, you can use the Triton client library directly. It provides a more flexible and powerful interface for working with Triton models and also supports shared memory for input and output data.

import numpy as np
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient import utils

# Configuration
url = "localhost:8000"
model_name = "Linear"
input_share = (2, 1)
input_dtype = np.float32
data = np.array([1, 2], dtype=np.float32)
input_byte_size = 8
output_byte_size = int(np.prod(input_share) * input_dtype().itemsize)

# Expected output shape and type
output_shape = (2, 1)
output_dtype = np.float32

# Create Triton client
try:
    triton_client = httpclient.InferenceServerClient(url=url, verbose=True)
except Exception as e:
    print("Channel creation failed: " + str(e))
    raise e

# Ensure no shared memory regions are registered with the server
triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()

# Create shared memory regions for input and output
shm_ip_handle = shm.create_shared_memory_region("input_data", "/input_simple", input_byte_size)
output_byte_size = int(np.prod(output_shape) * output_dtype().itemsize)
shm_op_handle = shm.create_shared_memory_region("output_data", "/output_simple", output_byte_size)

# Register the shared memory regions with the Triton server
triton_client.register_system_shared_memory("input_data", "/input_simple", input_byte_size)
triton_client.register_system_shared_memory("output_data", "/output_simple", output_byte_size)

# Put input data into shared memory
shm.set_shared_memory_region(shm_ip_handle, [data])

# Set up the inputs and outputs to use shared memory
inputs = []
inputs.append(httpclient.InferInput("data", [1, 2], "FP32"))
inputs[-1].set_shared_memory("input_data", input_byte_size)

outputs = []
outputs.append(httpclient.InferRequestedOutput("result", binary_data=True))
outputs[-1].set_shared_memory("output_data", output_byte_size)

# Perform inference
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

# Read results from the shared memory
output = results.get_output("result")
output_data = shm.get_contents_as_numpy(
    shm_op_handle,
    output_dtype,
    output_shape,
)

# Print the results
print(f"Input data: {data}")
print(f"Output data: {output_data}")

# Clean up
triton_client.unregister_system_shared_memory()
shm.destroy_shared_memory_region(shm_ip_handle)
shm.destroy_shared_memory_region(shm_op_handle)

This example demonstrates how to use shared memory.

While using shared memory can lead to performance improvements by reducing data transfer overhead, it's important to consider the following:

  1. Complexity: The use of shared memory introduces additional complexity into the code.
  2. Local Machine: Shared memory is only applicable on the same machine, so if your tensors are already in the process memory, calling your Python code directly might be simpler and more efficient.
  3. Performance: The internal implementation of PyTriton doesn't use shared memory, so using shared memory with the Triton Inference Server might not provide significant performance gains in this context.

Don't hesitate to ask if you have any questions or need further assistance.

@piotrm-nvidia
Copy link
Collaborator

I'm afraid that PyTorch support needs some fixes:

triton-inference-server/client#789

@piotrm-nvidia piotrm-nvidia self-assigned this Oct 7, 2024
Copy link

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 29, 2024
Copy link

github-actions bot commented Nov 6, 2024

This issue was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants