Skip to content

Commit

Permalink
Add and upgrade to Milvus2.0 support for neural search (PaddlePaddle#…
Browse files Browse the repository at this point in the history
…2945)

* Add Milvus2.0 support for neural search

* Add milvus search file and remove unused blanks and code

* Add milvus_util.py
  • Loading branch information
w5688414 authored Aug 9, 2022
1 parent 145333e commit 085ac53
Show file tree
Hide file tree
Showing 17 changed files with 409 additions and 574 deletions.
Binary file added applications/neural_search/img/attu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions applications/neural_search/recall/in_batch_negative/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序
推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。

**环境依赖**
* python >= 3.6
* paddlepaddle >= 2.1.3
* python >= 3.6.2
* paddlepaddle >= 2.2.3
* paddlenlp >= 2.2
* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2
* visualdl >= 2.2.2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,22 +34,20 @@
# yapf: enable

if __name__ == "__main__":
# If you want to use ernie1.0 model, plesace uncomment the following code
output_emb_size = 256

pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")

tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
pretrained_model = AutoModel.from_pretrained("ernie-1.0")
tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
model = SemanticIndexBaseStatic(pretrained_model,
output_emb_size=output_emb_size)

if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % args.params_path)
else:
raise ValueError(
"Please set --params_path with correct pretrained model file")

model.eval()

# Convert to static graph with specific input description
model = paddle.jit.to_static(
model,
Expand Down
100 changes: 55 additions & 45 deletions applications/neural_search/recall/milvus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@
## 2. 环境依赖和安装说明

**环境依赖**
* python >= 3.6
* python >= 3.6.2
* paddlepaddle >= 2.2
* paddlenlp >= 2.2
* milvus >= 1.1.1
* pymilvus >= 1.1.2
* milvus >= 2.1.0
* pymilvus >= 2.1.0

<a name="代码结构"></a>

Expand All @@ -47,17 +47,15 @@
```
|—— scripts
|—— feature_extract.sh 提取特征向量的bash脚本
|—— search.sh 插入向量和向量检索bash脚本
├── base_model.py # 语义索引模型基类
├── config.py # milvus配置文件
├── data.py # 数据处理函数
├── embedding_insert.py # 插入向量
├── embedding_recall.py # 检索topK相似结果 / ANN
├── milvus_ann_search.py # 向量插入和检索的脚本
├── inference.py # 动态图模型向量抽取脚本
├── feature_extract.py # 批量抽取向量脚本
├── milvus_insert.py # 插入向量工具类
├── milvus_recall.py # 向量召回工具类
├── README.md
└── server_config.yml # milvus的config文件,本项目所用的配置
├── milvus_util.py # milvus的工具类
└── README.md
```
<a name="数据准备"></a>

Expand Down Expand Up @@ -97,13 +95,14 @@

## 5. 向量检索

### 5.1 基于Milvus的向量检索系统搭建

数据准备结束以后,我们开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本,建议使用官方的 Docker 安装方式,简单快捷。
数据准备结束以后,我们开始搭建 Milvus 的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,Milvus 的搭建教程请参考官方教程 [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1版本,建议使用官方的 Docker 安装方式,简单快捷。

Milvus 搭建完系统以后就可以插入和检索向量了,首先生成 embedding 向量,每个样本生成256维度的向量,使用的是32GB的V100的卡进行的提取:

```
CUDA_VISIBLE_DEVICES=2 python feature_extract.py \
CUDA_VISIBLE_DEVICES=0 python feature_extract.py \
--model_dir=./output \
--corpus_file "data/milvus_data.csv"
```
Expand All @@ -127,57 +126,60 @@ MILVUS_PORT = 8530
然后运行下面的命令把向量插入到Milvus库中:

```
python3 embedding_insert.py
python milvus_ann_search.py --data_path milvus/milvus_data.csv \
--embedding_path corpus_embedding.npy \
--batch_size 100000 \
--insert
```
参数含义说明

* `data_path`: 数据的路径
* `embedding_path`: 数据对应向量的路径
* `index`: 选择检索向量的索引,用于向量检索
* `insert`: 是否插入向量
* `search`: 是否检索向量
* `batch_size`: 表示的是一次性插入的向量的数量


| 数据量 | 时间 |
| ------------ | ------------ |
|1000万条|12min24s|
|1000万条|21min12s|

另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Attu](https://github.com/zilliztech/attu).

另外,Milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Milvus Enterprise Manager](https://github.com/zilliztech/attu)
![](../../img/attu.png)


运行召回脚本:

```
python3 embedding_recall.py
python milvus_ann_search.py --data_path milvus/milvus_data.csv \
--embedding_path corpus_embedding.npy \
--batch_size 100000 \
--index 18 \
--search
```
运行的结果为,表示的是召回的 id 和与当前的 query 计算的距离:

运行以后的结果的输出为:

```
10000000
time cost 0.5410025119781494 s
Status(code=0, message='Search vectors successfully!')
[
[
(id:1, distance:0.0),
(id:7109733, distance:0.832247257232666),
(id:6770053, distance:0.8488889932632446),
(id:2653227, distance:0.9032443761825562),
hit: (distance: 0.0, id: 18), text field: 吉林铁合金集团资产管理现状分析及对策资产管理;资金控制;应收帐款风险;造价控制;集中化财务控制
hit: (distance: 0.45325806736946106, id: 7611689), text field: 哈药集团应收账款分析应收账款,流动资产,财务报告
hit: (distance: 0.5440893769264221, id: 4297885), text field: 宝钢集团负债经营风险控制策略研究钢铁行业;负债经营;风险控制
hit: (distance: 0.5455711483955383, id: 5661135), text field: 浅谈电网企业固定资产风险管理大数据,固定资产,风险管理
...
```
返回的是向量的距离,向量的id,以及对应的文本。

第一次检索的时间大概是18s左右,需要把数据从磁盘加载到内存,后面检索就很快,下面是测试的速度:

| 数据量 | 时间 |
| ------------ | ------------ |
|100条|0.15351247787475586|

如果测试的速度过慢,可以修改 Milvus 配置里面的 cache 参数:
也可以一键执行上述的过程:

```
cache:
cache_size: 32GB
insert_buffer_size: 8GB
preload_collection:
sh scripts/search.sh
```
把 cache_size,insert_buffer_size 调的越大,速度越快,调完后重启 Milvus

### 5.2 文本检索

修改代码的模型路径和样本
首先修改代码的模型路径和样本

```
params_path='checkpoints/model_40/model_state.pdparams'
Expand All @@ -194,12 +196,20 @@ python3 inference.py

```
[1, 256]
[[ 0.06374735 -0.08051944 0.05118101 -0.05855767 -0.06969483 0.05318566
0.079629 0.02667932 -0.04501902 -0.01187392 0.09590752 -0.05831281
Tensor(shape=[1, 256], dtype=float32, place=Place(gpu:0), stop_gradient=True,
[[ 0.07830613, -0.14036864, 0.03433795, -0.14967985, -0.03386058,
0.06630671, 0.01357946, 0.03531205, 0.02411086, 0.02000865,
0.05724005, -0.08119474, 0.06286906, 0.06509133, 0.07193415,
....
5677638 国有股权参股对家族企业创新投入的影响混合所有制改革,国有股权,家族企业,创新投入 0.5417419672012329
1321645 高管政治联系对民营企业创新绩效的影响——董事会治理行为的非线性中介效应高管政治联系,创新绩效,民营上市公司,董事会治理行为,中介效应 0.5445536375045776
1340319 国有控股上市公司资产并购重组风险探讨国有控股上市公司,并购重组,防范对策 0.5515031218528748
hit: (distance: 0.40141725540161133, id: 2742485), text field: 完善国有企业技术创新投入机制的探讨--基于经济责任审计实践国有企业,技术创新,投
入机制
hit: (distance: 0.40258315205574036, id: 1472893), text field: 企业技术创新与组织冗余--基于国有企业与非国有企业的情境研究
hit: (distance: 0.4121206998825073, id: 51831), text field: 企业创新影响对外直接投资决策—基于中国制造业上市公司的研究企业创新;对外直接投资;
制造业;上市公司
hit: (distance: 0.42234909534454346, id: 8682312), text field: 政治关联对企业创新绩效的影响——国有企业与民营企业的对比政治关联,创新绩效,国有
企业,民营企业,双重差分
hit: (distance: 0.46187296509742737, id: 9324797), text field: 财务杠杆、股权激励与企业创新——基于中国A股制造业经验数据制造业;上市公司;股权激
励;财务杠杆;企业创新
....
```
## FAQ
Expand Down
31 changes: 18 additions & 13 deletions applications/neural_search/recall/milvus/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,25 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import os
from milvus import MetricType, IndexType

MILVUS_HOST = '10.21.226.173'
MILVUS_HOST = '10.21.226.175'
MILVUS_PORT = 8530
data_dim = 256
top_k = 100
collection_name = 'literature_search'
partition_tag = 'partition_2'
embedding_name = 'embeddings'

collection_param = {
'dimension': 256,
'index_file_size': 256,
'metric_type': MetricType.L2
index_config = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {
"nlist": 1000
},
}

index_type = IndexType.IVF_FLAT
index_param = {'nlist': 1000}

top_k = 100
search_param = {'nprobe': 20}
search_params = {
"metric_type": "L2",
"params": {
"nprobe": top_k
},
}
38 changes: 0 additions & 38 deletions applications/neural_search/recall/milvus/embedding_insert.py

This file was deleted.

42 changes: 0 additions & 42 deletions applications/neural_search/recall/milvus/embedding_recall.py

This file was deleted.

10 changes: 2 additions & 8 deletions applications/neural_search/recall/milvus/feature_extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
import sys
from tqdm import tqdm
import numpy as np
from scipy.special import softmax

import paddle
from paddle import inference
Expand All @@ -34,23 +33,19 @@
parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", type=str, required=True,
help="The directory to static model.")

parser.add_argument("--corpus_file", type=str, required=True,
help="The corpus_file path.")

parser.add_argument("--max_seq_length", default=64, type=int,
help="The maximum total input sequence length after tokenization. Sequences "
"longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=32, type=int,
help="Batch size per GPU/CPU for training.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu",
help="Select which device to train model, defaults to gpu.")

parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False],
help='Enable to use tensorrt to speed up.')
parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
help='The tensorrt precision.')

parser.add_argument('--cpu_threads', default=10, type=int,
help='Number of threads to predict when using cpu.')
parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False],
Expand Down Expand Up @@ -131,10 +126,9 @@ def predict(self, data, tokenizer):
Returns:
results(obj:`dict`): All the predictions labels.
"""

batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input
Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"
), # segment
): fn(samples)

Expand Down Expand Up @@ -179,7 +173,7 @@ def read_text(file_path):
args.batch_size, args.use_tensorrt, args.precision,
args.cpu_threads, args.enable_mkldnn)

tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
id2corpus = read_text(args.corpus_file)

corpus_list = [{idx: text} for idx, text in id2corpus.items()]
Expand Down
Loading

0 comments on commit 085ac53

Please sign in to comment.