Skip to content

Latest commit

 

History

History
572 lines (460 loc) · 20.7 KB

Serving_Configure_EN.md

File metadata and controls

572 lines (460 loc) · 20.7 KB

Serving Configuration & Startup Params

(简体中文|English)

Overview

This guide focuses on Paddle C++ Serving and Python Pipeline configuration:

  • Model Configuration: Auto generated when converting model. Specify model input/output.
  • C++ Serving: High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
  • Python Pipeline: Multiple model combined scenarios.

Model Configuration

The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the Saving guide for model converting. The model configuration file provided must be a core/configure/proto/general_model_config.proto.

Example:

feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
  is_lod_tensor: false
  fetch_type: 1
  shape: 3
  shape: 640
  shape: 640
}
  • feed_var:model input
  • fetch_var:model output
  • name:node name
  • alias_name:alias name
  • is_lod_tensor:lod tensor, ref to Lod Introduction
  • feed_type/fetch_type:data type
feed_type 类型
0 INT64
1 FLOAT32
2 INT32
3 FP64
4 INT16
5 FP16
6 BF16
7 UINT8
8 INT8
20 STRING
  • shape:tensor shape

C++ Serving

1. Quick start and stop

The easiest way to start c++ serving is to provide the --model and --port flags.

Example starting c++ serving:

python3 -m paddle_serving_server.serve --model serving_model --port 9393

This command will generate the server configuration files as workdir_9393:

workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt

More flags:

Argument Type Default Description
--thread int 2 Number of brpc service thread
--runtime_thread_num int[] 0 Thread Number for each model in asynchronous mode
--batch_infer_size int[] 32 Batch Number for each model in asynchronous mode
--gpu_ids str[] "-1" Gpu card id for each model
--port int 9292 Exposed port of current service to users
--model str[] "" Path of paddle model directory to be served
--mem_optim_off - - Disable memory / graphic memory optimization
--ir_optim bool False Enable analysis and optimization of calculation graph
--use_mkl (Only for cpu version) - - Run inference with MKL. Need open with ir_optim.
--use_trt (Only for trt version) - - Run inference with TensorRT. Need open with ir_optim.
--use_lite (Only for Intel x86 CPU or ARM CPU) - - Run PaddleLite inference. Need open with ir_optim.
--use_xpu - - Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.
--precision str FP32 Precision Mode, support FP32, FP16, INT8
--use_calib bool False Use TRT int8 calibration
--gpu_multi_stream bool False EnableGpuMultiStream to get larger QPS
--use_ascend_cl bool False Enable for ascend910; Use with use_lite for ascend310
--request_cache_size int 0 Bytes size of request cache. By default, the cache is disabled
--enable_prometheus bool False Use Prometheus
--prometheus_port int 19393 Port of the Prometheus
`--use_dist_model bool False Use distributed model or not
--dist_carrier_id str "" Carrier id of distributed model
--dist_cfg_file str "" Config file of distributed model
--dist_endpoints str "" Endpoints of distributed model. splited by comma
--dist_nranks int 0 The number of rank in the distributed model
--dist_subgraph_index int -1 The subgraph index of distributed model
--dist_master_serving bool False The master serving of distributed inference
--min_subgraph_size str "" The min size of subgraph
--gpu_memory_mb int 50 Initially allocate GPU storage size, 50 MB default.
--cpu_math_thread_num int 1 Initialize the number of CPU computing threads
--trt_workspace_size int 33554432 Initialize allocation 1 << 25 GPU storage size for tensorRT
--trt_use_static bool False Initialize TRT with static data

Serving model with multiple gpus.

python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2

Serving two models.

python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292

Stop Serving(execute the following command in the directory where start serving or the path which environment variable SERVING_HOME set).

python3 -m paddle_serving_server.serve stop

stop sends SIGINT to C++ Serving. When setting kill, SIGKILL will be sent to C++ Serving

2. Starting with user-defined Configuration

Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.

Example starting with self-defined config:

/bin/serving --flagfile=proj.conf

2.1 proj.conf

You can provide proj.conf with lots of flags:

# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt

The table below sets out the detailed description:

name Default Description
precision "fp32" Precision Mode, support FP32, FP16, INT8
use_calib False Only for deployment with TensorRT
reload_interval_s 10 Reload interval
max_concurrency 0 Limit of request processing in parallel, 0: unlimited
num_threads 10 Number of brpc service thread
bthread_concurrency 10 Number of bthread
max_body_size 536870912 Max size of brpc message
inferservice_path "conf" Path of inferservice conf
inferservice_file "infer_service.prototxt" Filename of inferservice conf
resource_path "conf" Path of resource conf
resource_file "resource.prototxt" Filename of resource conf
workflow_path "conf" Path of workflow conf
workflow_file "workflow.prototxt" Filename of workflow conf

2.2 service.prototxt

To set listening port, modify service.prototxt. You can set the --inferservice_path and --inferservice_file to instruct the server to check for service.prototxt. The service.prototxt file provided must be a core/configure/server_configure.protobuf:InferServiceConf.

port: 8010
services {
  name: "GeneralModelService"
  workflows: "workflow1"
}
  • port: Listening port.
  • services: No need to modify. The workflow1 is defined in workflow.prototxt.

2.3 workflow.prototxt

To server user-defined OP, modify workflow.prototxt. You can set the --workflow_path and --inferservice_file to instruct the server to check for workflow.prototxt. The workflow.prototxt provided must be a core/configure/server_configure.protobuf:Workflow.

In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.

workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
    name: "general_reader_0"
    type: "GeneralReaderOp"
  }
  nodes {
    name: "general_infer_0"
    type: "GeneralInferOp"
    dependencies {
      name: "general_reader_0"
      mode: "RO"
    }
  }
  nodes {
    name: "general_response_0"
    type: "GeneralResponseOp"
    dependencies {
      name: "general_infer_0"
      mode: "RO"
    }
  }
}
  • name: The name of workflow.
  • workflow_type: "Sequence"
  • nodes: A workflow consists of nodes.
  • node.name: The name of node. Corresponding to node type. Ref to python/paddle_serving_server/dag.py
  • node.type: The bound operator. Ref to OPS in serving/op.
  • node.dependencies: The list of upstream dependent operators.
  • node.dependencies.name: The name of dependent operators.
  • node.dependencies.mode: RO-Read Only, RW-Read Write

2.4 resource.prototxt

You may modify resource.prototxt to set the path of model files. You can set the --resource_path and --resource_file to instruct the server to check for resource.prototxt. The resource.prototxt provided must be a core/configure/server_configure.proto:Workflow.

model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
  • model_toolkit_path: The diectory path of model_toolkil.prototxt.
  • model_toolkit_file: The file name of model_toolkil.prototxt.
  • general_model_path: The diectory path of general_model.prototxt.
  • general_model_file: The file name of general_model.prototxt.

2.5 model_toolkit.prototxt

The model_toolkit.prototxt specifies the parameters of predictor engines. The model_toolkit.prototxt provided must be a core/configure/server_configure.proto:ModelToolkitConf.

Example using cpu engine:

engines {
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
  reloadable_type: "timestamp_ne"
  model_dir: "uci_housing_model"
  gpu_ids: -1
  enable_memory_optimization: true
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
  use_ascend_cl: false
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
}
  • name: The name of engine corresponding to the node name in workflow.prototxt.
  • type: Only support ”PADDLE_INFER“
  • reloadable_meta: Specify the mark file of reload.
  • reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
reloadable_type Description
timestamp_ne when the mtime of reloadable_meta file changed
timestamp_gt When the mtime of reloadable_meta file greater than last record
md5sum No use
revision No use
  • model_dir: The path of model files.
  • gpu_ids: Specify the gpu ids. Support multiple device ids:
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
  • enable_memory_optimization: Enable memory optimization.
  • enable_ir_optimization: Enable ir optimization.
  • use_trt: Enable Tensor RT. Need use_gpu on.
  • use_lite: Enable PaddleLite.
  • use_xpu: Enable KUNLUN XPU.
  • use_gpu: Enbale GPU.
  • combined_model: Enable combined model.
  • gpu_multi_stream: Enable gpu multiple stream mode.
  • use_ascend_cl: Enable Ascend, use individually for ascend910, use with lite for ascend310
  • runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
  • batch_infer_size: The max batch size of Async mode.
  • enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
  • allow_split_request: Allow to split request task in Async mode.

2.6 general_model.prototxt

The content of general_model.prototxt is same as serving_server_conf.prototxt.

feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}

Python Pipeline

Quick start and stop

Example starting Pipeline Serving:

python3 web_service.py

Stop Serving(execute the following command in the directory where start Pipeline serving or the path which environment variable SERVING_HOME set).

python3 -m paddle_serving_server.serve stop

stop sends SIGINT to Pipeline Serving. When setting kill, SIGKILL will be sent to Pipeline Serving

yml Configuration

Python Pipeline provides a user-friendly programming framework for multi-model composite services.

Example of config.yaml:

#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
rpc_port: 18090

#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
http_port: 9999

#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
worker_num: 20

#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
build_dag_each_worker: false

dag:
    #True, thread model;False,process model
    is_thread_op: False

    #retry times
    retry: 1

    # True,generate the TimeLine data;False
    use_profile: false
    tracer:
        interval_s: 10

    #client type,include brpc, grpc and local_predictor.
    #client_type: local_predictor

    # max channel size, default 0
    #channel_size: 0

    #For distributed large model scenario with tensor parallelism, the first result is received and the other results are discarded to provide speed
    #channel_recv_frist_arrive: False

op:
    det:
        #concurrency,is_thread_op=True,thread otherwise process
        concurrency: 6

        #Serving IPs
        #server_endpoints: ["127.0.0.1:9393"]

        #Fetch data list
        #fetch_list: ["concat_1.tmp_0"]

        #det client config
        #client_config: serving_client_conf.prototxt

        #Serving timeout, ms
        #timeout: 3000

        #Serving retry times
        #retry: 1

        #Default 1。batch_size>1 should set auto_batching_timeout
        #batch_size: 2

        #Batching timeout,used with batch_size
        #auto_batching_timeout: 2000

        #Loading local server configuration without server_endpoints.
        local_service_conf:
            #client type,include brpc, grpc and local_predictor.
            client_type: local_predictor

            #det model path
            model_config: ocr_det_model

            #Fetch data list
            fetch_list: ["concat_1.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0

            #Device ID
            devices: ""

            #use_mkldnn, When running on mkldnn,must set ir_optim=True
            #use_mkldnn: True

            #ir_optim, When running on TensorRT,must set ir_optim=True
            ir_optim: True
            
            #CPU 计算线程数,在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"

            #mem_optim, memory / graphic memory optimization
            #mem_optim: True

            #use_calib, Use TRT int8 calibration
            #use_calib: False

            #use_mkldnn, Use mkldnn for cpu
            #use_mkldnn: False

            #The cache capacity of different input shapes for mkldnn
            #mkldnn_cache_capacity: 0

            #mkldnn_op_list, op list accelerated using MKLDNN, None default
            #mkldnn_op_list: []

            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
            #mkldnn_bf16_op_list: []

            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
            #min_subgraph_size: 3
    rec:
        #concurrency,is_thread_op=True,thread otherwise process
        concurrency: 3

        #time out, ms
        timeout: -1

        #retry times
        retry: 1

        #Loading local server configuration without server_endpoints.
        local_service_conf:

            #client type,include brpc, grpc and local_predictor.
            client_type: local_predictor

            #rec model path
            model_config: ocr_rec_model

            #Fetch data list
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0
            
            #Device ID
            devices: ""

            #use_mkldnn, When running on mkldnn,must set ir_optim=True
            #use_mkldnn: True

            #ir_optim, When running on TensorRT,must set ir_optim=True
            ir_optim: True
            
            #CPU 计算线程数,在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"

Single-machine and multi-card inference

Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.

Reference config.yaml:

#True, thread model;False,process model
is_thread_op: False

#concurrency,is_thread_op=True,thread otherwise process
concurrency: 7

devices: "0,1,2"

Heterogeneous Devices

In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:

  • CPU(Intel) : 0
  • GPU : 1
  • TensorRT : 2
  • CPU(Arm) : 3
  • XPU : 4
  • Ascend310(Arm) : 5
  • Ascend910(Arm) : 6

Reference config.yaml:

# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
device_type: 0
devices: "" # "0,1"

Low precision inference

Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:

  • CPU
    • fp32(default)
    • fp16
    • bf16(mkldnn)
  • GPU
    • fp32(default)
    • fp16(TRT effects)
    • int8
  • Tensor RT
    • fp32(default)
    • fp16
    • int8
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
precision: "fp32"

#cablic, open it when using int8
use_calib: True