-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A minimalistic example of using Cuda Shared memory #84
Comments
@piotrm-nvidia Can you please suggest a minimal example to get started. |
Let's start with simple Linear model, which takes a single input tensor and returns the negative of the input tensor: import numpy as np
from pytriton.decorators import batch
@batch
def infer_fn(data):
result = data * np.array([[-1]], dtype=np.float32) # Process inputs and produce result
return [result]
from pytriton.model_config import Tensor
from pytriton.triton import Triton
triton = Triton()
triton.bind(
model_name="Linear",
infer_func=infer_fn,
inputs=[Tensor(name="data", dtype=np.float32, shape=(-1,)),],
outputs=[Tensor(name="result", dtype=np.float32, shape=(-1,)),],
)
triton.run() This code will create a simple Triton model that takes a single input tensor named You can test this model using the following code: import numpy as np
from pytriton.client import ModelClient
client = ModelClient("localhost", "Linear")
data = np.array([1, 2, ], dtype=np.float32)
print(client.infer_sample(data=data)) The import numpy as np
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient import utils
# Configuration
url = "localhost:8000"
model_name = "Linear"
input_share = (2, 1)
input_dtype = np.float32
data = np.array([1, 2], dtype=np.float32)
input_byte_size = 8
output_byte_size = int(np.prod(input_share) * input_dtype().itemsize)
# Expected output shape and type
output_shape = (2, 1)
output_dtype = np.float32
# Create Triton client
try:
triton_client = httpclient.InferenceServerClient(url=url, verbose=True)
except Exception as e:
print("Channel creation failed: " + str(e))
raise e
# Ensure no shared memory regions are registered with the server
triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()
# Create shared memory regions for input and output
shm_ip_handle = shm.create_shared_memory_region("input_data", "/input_simple", input_byte_size)
output_byte_size = int(np.prod(output_shape) * output_dtype().itemsize)
shm_op_handle = shm.create_shared_memory_region("output_data", "/output_simple", output_byte_size)
# Register the shared memory regions with the Triton server
triton_client.register_system_shared_memory("input_data", "/input_simple", input_byte_size)
triton_client.register_system_shared_memory("output_data", "/output_simple", output_byte_size)
# Put input data into shared memory
shm.set_shared_memory_region(shm_ip_handle, [data])
# Set up the inputs and outputs to use shared memory
inputs = []
inputs.append(httpclient.InferInput("data", [1, 2], "FP32"))
inputs[-1].set_shared_memory("input_data", input_byte_size)
outputs = []
outputs.append(httpclient.InferRequestedOutput("result", binary_data=True))
outputs[-1].set_shared_memory("output_data", output_byte_size)
# Perform inference
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
# Read results from the shared memory
output = results.get_output("result")
output_data = shm.get_contents_as_numpy(
shm_op_handle,
output_dtype,
output_shape,
)
# Print the results
print(f"Input data: {data}")
print(f"Output data: {output_data}")
# Clean up
triton_client.unregister_system_shared_memory()
shm.destroy_shared_memory_region(shm_ip_handle)
shm.destroy_shared_memory_region(shm_op_handle) This example demonstrates how to use shared memory. While using shared memory can lead to performance improvements by reducing data transfer overhead, it's important to consider the following:
Don't hesitate to ask if you have any questions or need further assistance. |
I'm afraid that PyTorch support needs some fixes: |
This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been stalled for 7 days with no activity. |
Can you please provide a minimalistic example of using cuda shared memory from a client application which streams preprocessed pytorch tensors located in GPU to the pytriton server.
The pytriton server is also using the same pytorch tensors through dlpack and doing inference in the pytriton server.
I need examples for both the client and the server.
The text was updated successfully, but these errors were encountered: