This tutorial will specify a minimal config.pbtxt file and associated Python code to stand up a simple machine translation service. For this basic service, we will stand up two different deployment packages:
- fasttext-language-identification: Identify the language of text sent to it
- seamless-m4t-v2-large: Translate the input text from the source language to a specified target language
The basic structure of a model repository is:
<model-repository-path>/
<model-1-name>/
config.pbtxt
1/ # Version 1
model.py
2/
model.py # Version 2
<model-2-name>/
config.pbtxt
1/
model.py
...
...
We will review the contents of the config.pbtxt files and the Python code needed. After that we will start the Triton Inference Server and show how to send some HTTP requests to the running inference endpoints. Lastly, we will use the Perf Analyzer command line tool to get some baseline throughput metrics.
The config.pbtxt file is a ModelConfig protobuf that provides required and optional information about a model in a model repository. The minimal required fields in the config.pbtxt file are the name, platform, max_batch_size, input, and output, which provide essential information about the model.
See here for more details.
It is possible to have some of this information generated automatically, but that is not covered in this tutorial.
The input and output portions each must specify a name, datatype, and shape. The name of the input or output must match the name expected by the model in your .py file.
In this tutorial, we will also leverage the optional configs/
in the model
repository for each model. This will allow us to specify alternative configurations
that can be loaded when we launch the triton inference server container. Normally,
this seems to be done for different architectures. For example, running on 8xH100s
might have on configuration file and running on a single A10 might have a different
configuration file. But for us, we will use this to be able cleanly keep the tutorials
separated from each other as we increase the complexity of our deployment.
However to start, we will use the default config.pbtxt
file to specify the model
configurations for this first tutorial.
The model will expect just a single string be sent to it and it will return two strings, the language identified and the script. For Seamless, we only need the language id, but some other use case may need both. This model is tiny, fast, and doesn't need a GPU.
Let's review the different parts of the file. We start with the required name and
backend fields. For this whole tutorial, we use the Python backend. We specify
the max_batch_size
to be zero, which turns off the dynamic batching feature of
Triton Inference Server. We will explore that feature in the next tutorial. We also
take advantage of an optional field, default_model_filename, to use a more descriptive
name than just model.py
that contains the code needed to serve this model.
name: "fasttext-language-identification
backend: "python"
max_batch_size: 0
default_model_filename: "fasttext-language-identification.py"
For the input/output sections, there seems to be a convention of using ALL CAPS to
name the inputs and outputs, so we use INPUT_TEXT
to be the name for our input.
For input & output data, Triton Inference Server has its own Triton Tensors. You can
see how their data types get mapped to the different backends here.
Fortunately, the Triton Tensor class has as_numpy()
method to map to a more familiar
numpy array. You will see this used in the .py files to come. Since we will be sending
in a string, we specify TYPE_STRING
, but be forewarned. Notice that the chart
states that this will be treated as bytes
. This means we will need to treat the
incoming data as bytes
in our code. Also, notice that when we map the Triton Tensor
to a numpy array, this will have a dtype of np.object_
.
The last thing we need to specify is the dims section. This is the number of dimensions of the input data. For this model, we are going to send just a 1-d array that has just one element in. This will be the entire contents of the string whose language is to be identified.
input [
{
name: "INPUT_TEXT"
data_type: TYPE_STRING
dims: [1]
}
]
The output fields mirrors the input field since the model will send back just a single answer, that is, it's best guess as to the language id and script of the submitted text.
output [
{
name: "SRC_LANG"
data_type: TYPE_STRING
dims: [1]
},
{
name: "SRC_SCRIPT"
data_type: TYPE_STRING
dims: [1]
}
]
Lastly we add in some optional fields in the config.pbtxt that specify the conda environment and that we want to run this model using the CPU. In addition, we specify a specific version of the model to be run, the first version.
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/fasttext-language-identification.tar.gz"}
}
instance_group [{kind: KIND_CPU}]
version_policy: { specific: { versions: [1]}}
This is the deployed model that performs the translation. Like many translation models, SeamlessM4Tv2 needs to have the input text chunked into reasonably sized pieces. Typically, this is done at the sentence level, but in reality you could possible do a few short sentences together. It seems that the total sequence length that the model can handle is 1024 tokens. If you provide too much text, then it just uses the last 1024 tokens.
We begin like we did before with specifying:
name: "seamless-m4t-v2-large"
backend: "python"
max_batch_size: 0
default_model_filename: "seamless-m4t-v2-large.py"
The inputs and outputs look like what we had before as well, with an INPUT_TEXT that is a TYPE_STRING that has a 1-d array with just a single-element. In addition, to the INPUT_TEXT, we need to specify the starting language, SRC_LANG, and what the language we want to translate to, TGT_LANG. These have the same 1-d arrays of TYPE_STRING with just a single element each.
input [
{
name: "INPUT_TEXT"
data_type: TYPE_STRING
dims: [1]
},
{
name: "SRC_LANG",
data_type: TYPE_STRING
dims: [1]
},
{
name: "TGT_LANG",
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "TRANSLATED_TEXT"
data_type: TYPE_STRING
dims: [1]
}
]
Lastly, we specify the conda pack to use and for this deployment we want it to run on
GPU. To explicitly require it run on a GPU we would set kind: KIND_GPU
, but if we
use kind: KIND_AUTO
then it will run on a GPU if available, otherwise fall back
to running on a CPU. This is nice if you are developing on a non-gpu, but want to
deploy on a GPU. Just remember to also do a similar check in the actual code itself
when we load the model from disk.
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/seamless-m4t-v2-large.tar.gz"}
}
instance_group [{ kind: KIND_AUTO }]
version_policy: { specific: { versions: [1]}}
Since we are leveraging the Python backend, we need to create a Python file that has the following minimum structure. See here for more details.
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
# Use this to load your model into memory
self.model = load_model()
def execute(self, requests):
"""
Must be implemented. Process incoming requests and return responses.
There must be a response for each request, but you can return an
InferenceResponse that is a TritonError back to the client if something went
wrong. This way the client gets the error message.
Parameters
----------
requests : List[pb_utils.InferenceRequest]
Returns
-------
List[pb_utils.InferenceResponses]
"""
responses = []
for request in requests:
# Get input data out of requests
output = self.model(input)
# Make InferenceResponse
response = ...
responses.append(response)
return responses
With that, we need to discuss the single biggest drawback that I can see. The library,
triton_python_backend_utils
, that we import only exists within the Triton Inference
Server containers. There is no library for you to install and worse, there is seemingly
no documentation that you can reference. This is a seriously drawback that makes using
the Python backend challenging. The best that I've found so far is to go looking at the
source code which is written in C++.
The only saving grace is that we don't need to use too many features from this
undocumented, hidden Python package.
Given this is just the first tutorial, we will be working in version 1 of this model.
This is found in
model_repository/fasttext-language-identification/1/fasttext-language-identification.py
.
As expected, the initialize()
loads the FastText model into memory. In addition, we
use the pb_utils.get_output_config_by_name()
and pb_utils.triton_string_to_numpy()
in conjunction to save off the corresponding numpy dtype (np.object_ in this case) for
the SRC_LANG & SRC_SCRIPT outputs. In the config.pbtxt, we had stated the data type was
TYPE_STRING. This is just a convenience for when we go to make the InferenceResponse
.
In the execute()
method, we loop through the list of requests and handle them one at
a time. This shows the basic pattern that we will use for nearly any model. NOTE: this
may not be the most optimized, but it certain is the simplest. For each request, we
start by getting the data that was sent. We put this into a try/except which enables
sending an input data error message back through the response.
try:
input_text_tt = pb_utils.get_input_tensor_by_name(request, "INPUT_TEXT")
except Exception as exc:
response = pb_utils.InferenceResponse(
error=pb_utils.TritonError(
f"{exc}", pb_utils.TritonError.INVALID_ARG
)
)
responses.append(response)
continue
Notice that we retrieve the needed data by the name. This is the name given in the
config.pbtxt file. The input_text_tt is a pb_utils.Tensor
so we next convert that
to a more amenable type. The pb_utils.Tensor
has the as_numpy()
method to easily
convert to more standard numpy arrays. Since that the config.pbtxt specifies that
INPUT_TEXT has dims: [1]
, we have just a 1-d numpy array that has just a single
element. Given that config.pbtxt for INPUT_TEXT set data_type: TYPE_STRING
you
would think that the resulting numpy array had strings. It does not. The array
contains bytes
and must thus be decoded.
input_text = input_text_tt.as_numpy()[0].decode("utf-8")
# Replace newlines with ' '. FastText breaks on \n
input_text_cleaned = self.REMOVE_NEWLINE.sub(" ", input_text)
With the data in a format/type that the FastText model can handle, we call the model to predict the language and script of the input text.
try:
output_labels, _ = self.model.predict(input_text_cleaned, k=1)
except Exception as exc:
response = pb_utils.InferenceResponse(
error=pb_utils.TritonError(f"{exc}")
)
responses.append(response)
continue
src_lang, src_script = output_labels[0].replace("__label__", "").split("_")
Once again we wrap this in try/except. If we encounter any error, we pass that error
back to the client via the InferenceResponse
. We take the resulting output and get
the top source language and script predicted.
Finally, we need wrap the model's outputs into pb_utils.Tensor
and put them into
the pb_utils.InferenceResponse
to be sent back to the client.
src_lang_tt = pb_utils.Tensor(
"SRC_LANG",
np.array([src_lang], dtype=self.src_lang_dtype),
)
src_script_tt = pb_utils.Tensor(
"SRC_SCRIPT",
np.array([src_script], dtype=self.src_script_dtype),
)
response = pb_utils.InferenceResponse(
output_tensors=[src_lang_tt, src_script_tt],
)
One of the unspoken, undocumented features is how the Triton Inference Server knows to send what response back to which request. As far as I can tell, this is why the documentation states that the input list of requests and output list of responses need to the same length. Presumably they need to be the same order too.
Based upon the associated config.pbtxt, the SeamlessM4Tv2Large deployment is also expecting to be sent a 1-d array of with one element of the text to be translated (INPUT_TEXT), a 1-d array of just one element specifying the source language (SRC_LANG), and a 1-d array of just one element specifying the target language (TGT_LANG).
The initialize()
method looks similar to the FastText deployment, but we do take
some care with the loading this onto a GPU. For this model, we need to run both a
processor (handles converting text -> pytorch tokens and the reverse) and the model
itself. Only the model goes onto the GPU. Notice that we are using a modified version
of the SeamlessM4T classes that can handle having batches of text where each element
in the batch can be a different source language and be translated to different target
languages. The Transformer versions of these classes do not support that for some
inexplicable reason, though these are pretty straightfoward to fix. For the curious,
the code is in the seamless_fix.py file along side the seamless-m4t-v2-large.py.
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
self.device = torch.device("cuda")
torch_dtype = torch.float16
else:
self.device = torch.device("cpu")
torch_dtype = torch.float32 # CPUs can't handle float16
self.model = SeamlessM4Tv2ForTextToTextMulti.from_pretrained(
"facebook/seamless-m4t-v2-large",
device_map="auto",
torch_dtype=torch_dtype,
cache_dir="/hub",
local_files_only=True,
)
self.processor = SeamlessM4TProcessorMulti.from_pretrained(
"facebook/seamless-m4t-v2-large",
cache_dir="/hub",
local_files_only=True,
)
The execute()
method looks very similar as well. It receives a list of requests
with each request containing the text to be translated, the source langauge of that
text, and the target language for the translation. These are retrieved as
pb_utils.Tensor
and then converted to lists of Python strings. If we get an error
retrieving the data from the request, we pass that error back in the
pb_utils.InferenceRequest
.
For the translation part, we do that in three stages. If there is an error at any one of the stages, we pass that back to the requestor.
- Use
processor()
to tokenize the input text - Use
model.generate()
to get output tokens of the translated text - Use
processor.batch_decode()
to decode the output tokens into the translated text
Finally, the translated text is converted back into a pb_utils.Tensor
and then used
to create the pb_utils.InferenceResponse
object that is appended to the list of
responses
which gets returned at the end of execute()
. Each response corresponds to
a request and contains either the translated text or an error message.
We launch the service from the parent directory using docker compose. We will leverage
the --model-config-name
option when launching
tritonserver
We will do this by using an environment variable, CONFIG_NAME
, when calling docker
compose. The observant student will notice that there is no file named
configs/tutorial1.pbtxt
which is the file that tritonserver will be looking for. But
if a pattern is not found, then tritonserver loads the default config.pbtxt
file,
which is what we are doing here.
$ CONFIG_NAME=tutorial1 docker-compose up
This mounts two volumes into the Triton Inference Server container. The first is the model repository which contains all the work we just described. The second is the location of our local huggingface hub cache directory which we use to load the models.
If we are successful you should see:
Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
Started Metrics Service at 0.0.0.0:8002
Take a look a examples/client_requests.py
to see how to structure your requests. Each
request has you specify the input data you are sending which includes:
- name - must match what's in the config.pbtxt file
- shape - The shape of this request.
- datatype - Since we are using string, this is "BYTES" shrug
- data - Array of the input data
"name": "SRC_LANG",
"shape": [1],
"datatype": "BYTES",
"data": ["spa"],
The Triton Performance Analyzer
is a CLI tool which can help you measure and optimize the inference performance of
your Triton Inference Server. It is available in the SDK container we all ready pulled.
The documentation says you can try pip install tritonclient
which should have it, but
that you will likely be missing necessary system libraries. They weren't wrong. I tried
on my Ubuntu machine and no luck.
$ docker run --rm -it --net host -v ./data:/workspace/data nvcr.io/nvidia/tritonserver:24.05-py3-sdk
This starts up the container and mounts the repo's /data directory which contains the load testing data we will use. There are a bunch of features in this, but we will try to keep to this tutorial's ethos, keep it as simple as possible.
From inside the SDK container, let's give the following command a try. We will kick off the command and then talk about it's pieces while we wait.
sdk:/workspace# perf_analyzer \
-m seamless-m4t-v2-large \
--input-data data/spanish-news-seamless-one.json \
--measurement-mode=count_windows \
--measurement-request-count=266 \
--request-rate-range=1.0:4.0:0.1 \
--latency-threshold=5000 \
--max-threads=300 \
--binary-search \
-v \
--stability-percentage=25
As you can see perf_analyzer takes many arguments. Here are the ones that I'm using
- -m : Specifies the deployment to analye
- --input-data : You can provide your own data, otherwise it will generate random data
- --measurement-mode : Measure for a set number of requests. You could also do time
- --measurement-request-count : Minimum number of requests in the count_window
- --request-rate-range : Gives the min:max:step that will be used in binary search
- --latency-threshold : Maximum time in ms a request should take
- --max-threads : Number of threads to use to send the requests
- --binary-search : Perform search between request-rate-range specified
- -v : verbose mode gives more info
- --stability-percentage : Max. % diff between high & low latency for last 3 trials
This will use a binary search to find what request rate the translate deployment can handle while keeping the latency below 5 seconds.
For the output we get the following
Request Rate: 2.5 inference requests per seconds
Pass [1] throughput: 2.48574 infer/sec. Avg latency: 670880 usec (std 104532 usec).
Pass [2] throughput: 2.50443 infer/sec. Avg latency: 663555 usec (std 162182 usec).
Pass [3] throughput: 2.47199 infer/sec. Avg latency: 570518 usec (std 256319 usec).
Client:
Request count: 801
Throughput: 2.48734 infer/sec
Avg client overhead: 0.00%
Avg latency: 634975 usec (standard deviation 115212 usec)
p50 latency: 424363 usec
p90 latency: 1247050 usec
p95 latency: 1337742 usec
p99 latency: 1519219 usec
Avg HTTP time: 634960 usec (send 101 usec + response wait 634859 usec + receive 0 usec)
Server:
Inference count: 801
Execution count: 801
Successful request count: 801
Avg request latency: 634382 usec (overhead 6 usec + queue 280831 usec + compute input 35 usec + compute infer 353468 usec + compute output 41 usec)
Inferences/Second vs. Client Average Batch Latency
Request Rate: 1.00, throughput: 1.00 infer/sec, latency 345138 usec
Request Rate: 2.50, throughput: 2.49 infer/sec, latency 634975 usec
So, it looks like our target to beat is 2.50 infer/sec for the seamless-m4t-v2-large deployment.
For comparison, the fasttext-language-identification deployment can support about 2,100 infer/sec.
In the next tutorial, we will explore leveraging Triton Inference Server's dynamic batching capability which promises to improve that throughput.