Skip to content

[tensorrt-llm backend] A question about launch_triton_server.py #455

Open
@victorsoda

Description

@victorsoda

Question

The codes in launch_triton_server.py:

def get_cmd(world_size, tritonserver, grpc_port, http_port, metrics_port,
            model_repo, log, log_file, tensorrt_llm_model_name):
    cmd = ['mpirun', '--allow-run-as-root']
    for i in range(world_size):
        cmd += ['-n', '1', tritonserver, f'--model-repository={model_repo}']
        if log and (i == 0):
            cmd += ['--log-verbose=3', f'--log-file={log_file}']
        # If rank is not 0, skip loading of models other than `tensorrt_llm_model_name`
        if (i != 0):
            cmd += ['--model-control-mode=explicit']
            model_names = tensorrt_llm_model_name.split(',')
            for name in model_names:
                cmd += [f'--load-model={name}']
        cmd += [
            f'--grpc-port={grpc_port}', f'--http-port={http_port}',
            f'--metrics-port={metrics_port}', '--disable-auto-complete-config',
            f'--backend-config=python,shm-region-prefix-name=prefix{i}_', ':'
        ]
    return cmd

When world_size = 2 for example, 2 triton servers will be launched using the same grpc port (e.g., 8001).
But how could this be possible?
When I tried to do something similar, I got the following error while launching the second server:

I0513 03:43:28.353306 21205 grpc_server.cc:2466] Started GRPCInferenceService at 0.0.0.0:8001
I0513 03:43:28.353458 21205 http_server.cc:4636] Started HTTPService at 0.0.0.0:8000
E0513 03:43:28.353559006   21206 chttp2_server.cc:1080]      UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:8001' {created_time:"2024-05-13T03:43:28.353510541+00:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-05-13T03:43:28.353503146+00:00", children:[UNKNOWN:Address family not supported by protocol {target_address:"[::]:8001", syscall:"socket", os_error:"Address family not supported by protocol", errno:97, created_time:"2024-05-13T03:43:28.353465612+00:00"}, UNKNOWN:Unable to configure socket {fd:6, created_time:"2024-05-13T03:43:28.353493367+00:00", children:[UNKNOWN:Address already in use {syscall:"bind", os_error:"Address already in use", errno:98, created_time:"2024-05-13T03:43:28.353488259+00:00"}]}]}]}
E0513 03:43:28.353650 21206 main.cc:245] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use

Background

I've been developing my triton backend drawing on the experience of https://github.com/triton-inference-server/tensorrtllm_backend.

I have already built two engines (tensor parallel, tp_size = 2) of the llama2-7b model.
It's ok to run something like mpirun -np 2 python3.8 run.py to load the two engines, run tensor-parallel inference, and get the correct results.

My goal now is to run the same two engines by the triton server.

I have already implemented the run.py logic in the model.py (initialize() and execute() functions) in my python backend.

Following launch_triton_server.py, I tried the following command line:

mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ : -n 1 /opt/tritonserver/bin/tritonserver --model-repository=./model_repository --model-control-mode=explicit --load-model=llama2_7b --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix1_ :

Then I got the error as above.

Could you please tell me what I did wrong and how I can fix the error? Thanks a lot!

Metadata

Metadata

Assignees

Labels

questionFurther information is requestedtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions