Add and upgrade to Milvus2.0 support for neural search (PaddlePaddle#…

…2945) * Add Milvus2.0 support for neural search * Add milvus search file and remove unused blanks and code * Add milvus_util.py
lovepaul · Aug 9, 2022 · 085ac53 · 085ac53
1 parent 145333e
commit 085ac53
Show file tree

Hide file tree

Showing 17 changed files with 409 additions and 574 deletions.
diff --git a/applications/neural_search/img/attu.png b/applications/neural_search/img/attu.png
diff --git a/applications/neural_search/recall/in_batch_negative/README.md b/applications/neural_search/recall/in_batch_negative/README.md
@@ -66,8 +66,8 @@ Recall@K召回率是指预测的前topK（top-k是指从最后的按得分排序
 推荐使用GPU进行训练，在预测阶段使用CPU或者GPU均可。
 
 **环境依赖**
-* python >= 3.6
-* paddlepaddle >= 2.1.3
+* python >= 3.6.2
+* paddlepaddle >= 2.2.3
 * paddlenlp >= 2.2
 * [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2
 * visualdl >= 2.2.2

diff --git a/applications/neural_search/recall/in_batch_negative/export_model.py b/applications/neural_search/recall/in_batch_negative/export_model.py
@@ -34,22 +34,20 @@
 # yapf: enable
 
 if __name__ == "__main__":
-    # If you want to use ernie1.0 model, plesace uncomment the following code
     output_emb_size = 256
-
-    pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh")
-
-    tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+    pretrained_model = AutoModel.from_pretrained("ernie-1.0")
+    tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
     model = SemanticIndexBaseStatic(pretrained_model,
                                     output_emb_size=output_emb_size)
-
     if args.params_path and os.path.isfile(args.params_path):
         state_dict = paddle.load(args.params_path)
         model.set_dict(state_dict)
         print("Loaded parameters from %s" % args.params_path)
+    else:
+        raise ValueError(
+            "Please set --params_path with correct pretrained model file")
 
     model.eval()
-
     # Convert to static graph with specific input description
     model = paddle.jit.to_static(
         model,

diff --git a/applications/neural_search/recall/milvus/README.md b/applications/neural_search/recall/milvus/README.md
@@ -32,11 +32,11 @@
 ## 2. 环境依赖和安装说明
 
 **环境依赖**
-* python >= 3.6
+* python >= 3.6.2
 * paddlepaddle >= 2.2
 * paddlenlp >= 2.2
-* milvus >= 1.1.1
-* pymilvus >= 1.1.2
+* milvus >= 2.1.0
+* pymilvus >= 2.1.0
 
 <a name="代码结构"></a>
 
@@ -47,17 +47,15 @@
 ```
 |—— scripts
     |—— feature_extract.sh  提取特征向量的bash脚本
+    |—— search.sh  插入向量和向量检索bash脚本
 ├── base_model.py # 语义索引模型基类
 ├── config.py  # milvus配置文件
 ├── data.py # 数据处理函数
-├── embedding_insert.py # 插入向量
-├── embedding_recall.py # 检索topK相似结果 / ANN
+├── milvus_ann_search.py # 向量插入和检索的脚本
 ├── inference.py # 动态图模型向量抽取脚本
 ├── feature_extract.py # 批量抽取向量脚本
-├── milvus_insert.py # 插入向量工具类
-├── milvus_recall.py # 向量召回工具类
-├── README.md
-└── server_config.yml # milvus的config文件，本项目所用的配置
+├── milvus_util.py # milvus的工具类
+└── README.md
 ```
 <a name="数据准备"></a>
 
@@ -97,13 +95,14 @@
 
 ## 5. 向量检索
 
+### 5.1 基于Milvus的向量检索系统搭建
 
-数据准备结束以后，我们开始搭建 Milvus 的语义检索引擎，用于语义向量的快速检索，我们使用[Milvus](https://milvus.io/)开源工具进行召回，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是 Milvus 的1.1.1 CPU版本，建议使用官方的 Docker 安装方式，简单快捷。
+数据准备结束以后，我们开始搭建 Milvus 的语义检索引擎，用于语义向量的快速检索，我们使用[Milvus](https://milvus.io/)开源工具进行召回，Milvus 的搭建教程请参考官方教程  [Milvus官方安装教程](https://milvus.io/docs/v2.1.x/install_standalone-docker.md)本案例使用的是 Milvus 的2.1版本，建议使用官方的 Docker 安装方式，简单快捷。
 
 Milvus 搭建完系统以后就可以插入和检索向量了，首先生成 embedding 向量，每个样本生成256维度的向量，使用的是32GB的V100的卡进行的提取：
 
 ```
-CUDA_VISIBLE_DEVICES=2 python feature_extract.py \
+CUDA_VISIBLE_DEVICES=0 python feature_extract.py \
         --model_dir=./output \
         --corpus_file "data/milvus_data.csv"
 ```
@@ -127,57 +126,60 @@ MILVUS_PORT = 8530
 然后运行下面的命令把向量插入到Milvus库中：
 
 ```
-python3 embedding_insert.py
+python milvus_ann_search.py --data_path milvus/milvus_data.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --insert
 ```
+参数含义说明
+
+* `data_path`: 数据的路径
+* `embedding_path`: 数据对应向量的路径
+* `index`: 选择检索向量的索引，用于向量检索
+* `insert`: 是否插入向量
+* `search`: 是否检索向量
+* `batch_size`: 表示的是一次性插入的向量的数量
 
 
 |  数据量 |  时间 |
 | ------------ | ------------ |
-|1000万条|12min24s|
+|1000万条|21min12s|
+
+另外，Milvus提供了可视化的管理界面，可以很方便的查看数据，安装地址为[Attu](https://github.com/zilliztech/attu).
 
-另外，Milvus提供了可视化的管理界面，可以很方便的查看数据，安装地址为[Milvus Enterprise Manager](https://github.com/zilliztech/attu)
+![](../../img/attu.png)
 
 
 运行召回脚本：
 
 ```
-python3 embedding_recall.py
-
+python milvus_ann_search.py --data_path milvus/milvus_data.csv \
+                            --embedding_path corpus_embedding.npy \
+                            --batch_size 100000 \
+                            --index 18 \
+                            --search
 ```
-运行的结果为，表示的是召回的 id 和与当前的 query 计算的距离：
+
+运行以后的结果的输出为：
 
 ```
-10000000
-time cost 0.5410025119781494 s
-Status(code=0, message='Search vectors successfully!')
-[
-[
-(id:1, distance:0.0),
-(id:7109733, distance:0.832247257232666),
-(id:6770053, distance:0.8488889932632446),
-(id:2653227, distance:0.9032443761825562),
+hit: (distance: 0.0, id: 18), text field: 吉林铁合金集团资产管理现状分析及对策资产管理;资金控制;应收帐款风险;造价控制;集中化财务控制
+hit: (distance: 0.45325806736946106, id: 7611689), text field: 哈药集团应收账款分析应收账款,流动资产,财务报告
+hit: (distance: 0.5440893769264221, id: 4297885), text field: 宝钢集团负债经营风险控制策略研究钢铁行业;负债经营;风险控制
+hit: (distance: 0.5455711483955383, id: 5661135), text field: 浅谈电网企业固定资产风险管理大数据,固定资产,风险管理
 ...
 ```
+返回的是向量的距离，向量的id，以及对应的文本。
 
-第一次检索的时间大概是18s左右，需要把数据从磁盘加载到内存，后面检索就很快，下面是测试的速度：
-
-|  数据量 |  时间 |
-| ------------ | ------------ |
-|100条|0.15351247787475586|
-
-如果测试的速度过慢，可以修改 Milvus 配置里面的 cache 参数：
+也可以一键执行上述的过程：
 
 ```
-cache:
-  cache_size: 32GB
-  insert_buffer_size: 8GB
-  preload_collection:
-
+sh scripts/search.sh
 ```
-把 cache_size，insert_buffer_size 调的越大，速度越快，调完后重启 Milvus
 
+### 5.2 文本检索
 
-修改代码的模型路径和样本：
+首先修改代码的模型路径和样本：
 
 ```
 params_path='checkpoints/model_40/model_state.pdparams'
@@ -194,12 +196,20 @@ python3 inference.py
 
 ```
 [1, 256]
-[[ 0.06374735 -0.08051944  0.05118101 -0.05855767 -0.06969483  0.05318566
-   0.079629    0.02667932 -0.04501902 -0.01187392  0.09590752 -0.05831281
+Tensor(shape=[1, 256], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[ 0.07830613, -0.14036864,  0.03433795, -0.14967985, -0.03386058,
+          0.06630671,  0.01357946,  0.03531205,  0.02411086,  0.02000865,
+          0.05724005, -0.08119474,  0.06286906,  0.06509133,  0.07193415,
    ....
-5677638 国有股权参股对家族企业创新投入的影响混合所有制改革,国有股权,家族企业,创新投入 0.5417419672012329
-1321645 高管政治联系对民营企业创新绩效的影响——董事会治理行为的非线性中介效应高管政治联系,创新绩效,民营上市公司,董事会治理行为,中介效应 0.5445536375045776
-1340319 国有控股上市公司资产并购重组风险探讨国有控股上市公司,并购重组,防范对策 0.5515031218528748
+hit: (distance: 0.40141725540161133, id: 2742485), text field: 完善国有企业技术创新投入机制的探讨--基于经济责任审计实践国有企业,技术创新,投
+入机制
+hit: (distance: 0.40258315205574036, id: 1472893), text field: 企业技术创新与组织冗余--基于国有企业与非国有企业的情境研究
+hit: (distance: 0.4121206998825073, id: 51831), text field: 企业创新影响对外直接投资决策—基于中国制造业上市公司的研究企业创新;对外直接投资;
+制造业;上市公司
+hit: (distance: 0.42234909534454346, id: 8682312), text field: 政治关联对企业创新绩效的影响——国有企业与民营企业的对比政治关联,创新绩效,国有
+企业,民营企业,双重差分
+hit: (distance: 0.46187296509742737, id: 9324797), text field: 财务杠杆、股权激励与企业创新——基于中国A股制造业经验数据制造业;上市公司;股权激
+励;财务杠杆;企业创新
 ....
 ```
 ## FAQ

diff --git a/applications/neural_search/recall/milvus/config.py b/applications/neural_search/recall/milvus/config.py
@@ -12,20 +12,25 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import os
-from milvus import MetricType, IndexType
-
-MILVUS_HOST = '10.21.226.173'
+MILVUS_HOST = '10.21.226.175'
 MILVUS_PORT = 8530
+data_dim = 256
+top_k = 100
+collection_name = 'literature_search'
+partition_tag = 'partition_2'
+embedding_name = 'embeddings'
 
-collection_param = {
-    'dimension': 256,
-    'index_file_size': 256,
-    'metric_type': MetricType.L2
+index_config = {
+    "index_type": "IVF_FLAT",
+    "metric_type": "L2",
+    "params": {
+        "nlist": 1000
+    },
 }
 
-index_type = IndexType.IVF_FLAT
-index_param = {'nlist': 1000}
-
-top_k = 100
-search_param = {'nprobe': 20}
+search_params = {
+    "metric_type": "L2",
+    "params": {
+        "nprobe": top_k
+    },
+}
diff --git a/applications/neural_search/recall/milvus/embedding_insert.py b/applications/neural_search/recall/milvus/embedding_insert.py
diff --git a/applications/neural_search/recall/milvus/embedding_recall.py b/applications/neural_search/recall/milvus/embedding_recall.py
diff --git a/applications/neural_search/recall/milvus/feature_extract.py b/applications/neural_search/recall/milvus/feature_extract.py
@@ -17,7 +17,6 @@
 import sys
 from tqdm import tqdm
 import numpy as np
-from scipy.special import softmax
 
 import paddle
 from paddle import inference
@@ -34,23 +33,19 @@
 parser = argparse.ArgumentParser()
 parser.add_argument("--model_dir", type=str, required=True,
     help="The directory to static model.")
-
 parser.add_argument("--corpus_file", type=str, required=True,
     help="The corpus_file path.")
-
 parser.add_argument("--max_seq_length", default=64, type=int,
     help="The maximum total input sequence length after tokenization. Sequences "
     "longer than this will be truncated, sequences shorter will be padded.")
 parser.add_argument("--batch_size", default=32, type=int,
     help="Batch size per GPU/CPU for training.")
 parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu",
     help="Select which device to train model, defaults to gpu.")
-
 parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False],
     help='Enable to use tensorrt to speed up.')
 parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"],
     help='The tensorrt precision.')
-
 parser.add_argument('--cpu_threads', default=10, type=int,
     help='Number of threads to predict when using cpu.')
 parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False],
@@ -131,10 +126,9 @@ def predict(self, data, tokenizer):
         Returns:
             results(obj:`dict`): All the predictions labels.
         """
-
         batchify_fn = lambda samples, fn=Tuple(
             Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"),  # input
-            Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"
+            Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"
                 ),  # segment
         ): fn(samples)
 
@@ -179,7 +173,7 @@ def read_text(file_path):
                           args.batch_size, args.use_tensorrt, args.precision,
                           args.cpu_threads, args.enable_mkldnn)
 
-    tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
+    tokenizer = AutoTokenizer.from_pretrained('ernie-1.0')
     id2corpus = read_text(args.corpus_file)
 
     corpus_list = [{idx: text} for idx, text in id2corpus.items()]