Description
Question
The codes in launch_triton_server.py:
def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
model_repo, log, log_file, tensorrt_llm_model_name):
cmd = ['mpirun', '--allow-run-as-root']
for i in range(world_size):
cmd += ['-n', '1', tritonserver, f'--model-repository={model_repo}']
if log and (i == 0):
cmd += ['--log-verbose=3', f'--log-file={log_file}']
# If rank is not 0, skip loading of models other than `tensorrt_llm_model_name`
if (i != 0):
cmd += ['--model-control-mode=explicit']
model_names = tensorrt_llm_model_name.split(',')
for name in model_names:
cmd += [f'--load-model={name}']
cmd += [
f'--grpc-port={grpc_port}', f'--http-port={http_port}',
f'--metrics-port={metrics_port}', '--disable-auto-complete-config',
f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
]
return cmd
When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001).
But how could this be possible?
When I tried to do something similar, I got the following error while launching the second server:
I0513 03:43:28.353306 21205 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0513 03:43:28.353458 21205 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
E0513 03:43:28.353559006 21206 chttp2_server.cc:1080] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2024-05-13T03:43:28.353510541+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-05-13T03:43:28.353503146+00:00", children:[UNKNOWN:Address family not supported by protocol {target_address:"[::]:8001", syscall:"socket", os_error:"Address family not supported by protocol", errno:97, created_time:"2024-05-13T03:43:28.353465612+00:00"}, UNKNOWN:Unable to configure socket {fd:6, created_time:"2024-05-13T03:43:28.353493367+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-05-13T03:43:28.353488259+00:00"}]}]}]}
E0513 03:43:28.353650 21206 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use
Background
I've been developing my triton backend drawing on the experience of https://github.com/triton-inference-server/tensorrtllm_backend.
I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model.
It's ok to run something like mpirun -np 2 python3.8 run.py
to load the two engines, run tensor-parallel inference, and get the correct results.
My goal now is to run the same two engines by the triton server.
I have already implemented the run.py
logic in the model.py (initialize() and execute() functions) in my python backend.
Following launch_triton_server.py, I tried the following command line:
mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ : -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --model-control-mode=explicit --load-model=llama2_7b --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix1_ :
Then I got the error as above.
Could you please tell me what I did wrong and how I can fix the error? Thanks a lot!