-
Notifications
You must be signed in to change notification settings - Fork 788
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7b010bb
commit 326218b
Showing
105 changed files
with
27,530 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
DuRecDial (Towards Conversational Recommendation over Multi-Type Dialogs) | ||
============================= | ||
|
||
We identify the task of **conversational recommendation over multi-type dialogs**. To facilitate the study of this task, we create a human-to-human **Rec**ommendation oriented multi-type Chinese **Dial**og dataset at Bai**Du** (**DuRecDial**). In **DuRecDial**, every dialog contains multi-type dialogs with natural topic transitions. Moreover, there are rich interaction variability for recommendation. In addition, each seeker has an explicit profile for the modeling of personalized recommendation, and multiple dialogs with the recommender to mimic real-world application scenarios. **DuRecDial** contains **multi-type** dialogs (Recommendation dialog, chitchat, Task-oriented dialogue and QA), **10.2K** conversations from **7** domains (movie, star, music, news, food, poi, and weather), and **156K** utterances. An example of DuRecDial: | ||
|
||
![example](images/Figure1.png) | ||
|
||
|
||
Our paper (Towards Conversational Recommendation over Multi-Type Dialogs) on [arXiv](https://arxiv.org/abs/2005.03954) and [ACL Anthology](https://www.aclweb.org/anthology/2020.acl-main.98/). A Chinese intro & news for this paper is available [here](https://mp.weixin.qq.com/s/f3dCOc4Mog9eZTl0k5YQew). | ||
|
||
If the corpus is helpful to your research, please kindly cite our paper: | ||
|
||
``` | ||
@inproceedings{Liu2020TowardsCR, | ||
title={Towards Conversational Recommendation over Multi-Type Dialogs}, | ||
author={Z. Liu and H. Wang and Zheng-Yu Niu and Hua Wu and W. Che and Ting Liu}, | ||
booktitle={ACL}, | ||
year={2020} | ||
} | ||
``` | ||
|
||
# Dataset | ||
The dataset is available at https://baidu-nlp.bj.bcebos.com/DuRecDial.zip. Each conversation looks like the following: | ||
```python | ||
{"kg": | ||
[["沈阳", "2018-12-24", "晴, 西南风, 最高气温:2℃, 最低气温:-12℃"], | ||
["糖醋排骨", "成分", "猪肋排、姜片、葱、生抽、糖、醋、料酒、八角。"], | ||
["糖醋排骨", "类型", "热菜"], | ||
["晴, 西南风, 最高气温:2℃, 最低气温:-12℃", "适合吃", "糖醋排骨"], | ||
["大清花饺子(十一纬路店)", "特色菜", "糖醋排骨"], | ||
["大清花饺子(十一纬路店)", "评分", "4.8"], | ||
["大清花饺子(十一纬路店)", "人均价格", "50"], | ||
["大清花饺子(十一纬路店)", "地址", "沈河区十一纬路198号(近南二经街)"], | ||
["大清花饺子(十一纬路店)", "订单量", "1405"]], | ||
"user_profile": | ||
{"职业状态": "工作", "同意的新闻": " 何炅 的新闻", "没有接受的音乐": [" 还有我", "心火烧"], "喜欢的音乐": " 另一个自己", "年龄区间": "大于50", "拒绝": " 电影", "喜欢的明星": " 何炅", "接受的音乐": [" 向前奔跑", "思念的距离", "我是大侦探", "希望爱", "现在爱", "再见", "一路走过"], "居住地": "沈阳", "喜欢的poi": " 大清花饺子(十一纬路店)", "姓名": "陈轩奇", "同意的美食": " 糖醋排骨", "性别": "男"}, | ||
"conversation": | ||
["[1]今天是什么天气?", | ||
"今天沈阳: 晴, 西南风, 最高气温:2℃, 最低气温:-12℃,天气有点冷,注意保暖。", | ||
"你知道的真多。", | ||
"[2]这种天气温适合吃 『糖醋排骨』了呢。", | ||
"糖醋排骨可是我最喜欢的美食,真想现在就去吃糖醋排骨呢。", | ||
"[3]我正好知道有一家店,推荐您在 『大清花饺子(十一纬路店)』 订糖醋排骨。", | ||
"这家店的地址在哪里?", | ||
"这家店的地址:沈河区十一纬路198号(近南二经街)", | ||
"人均价格是多少?", | ||
"人均价格50元。", | ||
"评分是多少?", | ||
"评分是4.8", | ||
"今天中午12点半我一个人去吃,我预定一下。", | ||
"好的,这就为您预定。", | ||
"[4]先去准备一下,再见", | ||
"好的,再见,祝你生活愉快!"], | ||
"goals": | ||
"[1]问天气(User主动,User问天气,根据给定知识,Bot回复完整的天气信息,User满足并好评)-->[2]美食推荐(Bot主动推荐,这种天气温适合吃 『糖醋排骨』, User接受。需要聊2轮)-->[3]poi推荐(Bot主动,Bot推荐在 『大清花饺子(十一纬路店)』 订 『糖醋排骨』, User问 『大清花饺子(十一纬路店)』 的『人均价格』、『地址』、『评分』,Bot逐一回答后,最终User接受并提供预订信息:『就餐时间』 和 『就餐人数』)-->[4]再见", | ||
"situation": | ||
"聊天时间:2018-12-24 中午12:00,在公司 星期一" | ||
} | ||
``` | ||
|
||
- `kg` provides all Background knowledge related to dialogue in the form of SPO.. | ||
|
||
- `user_profile` includes some personal information, domain preference and entity preference of users. | ||
|
||
- `goals` contains the dialog topic transfer path of dialog session. | ||
|
||
- `situation` includes the time, place and topic of the dialogue. | ||
|
||
- `conversation` is a list of all the turns in the dialogue. | ||
|
||
|
||
# Model | ||
Due to some problems in the internal process, the model cannot be open source for the time being. As an alternative, you can use the baseline model of the [LIC 2020 competition](https://github.com/PaddlePaddle/Research/tree/master/NLP/Conversational-Recommendation-BASELINE). | ||
|
||
|
||
# Competitions | ||
We hold ___competitions___ to encourage more researchers to work in this direction. | ||
|
||
* [Conversational Recommendation Task](https://aistudio.baidu.com/aistudio/competition/detail/29) in [2020 Language and Intelligence Challenge](http://lic2020.cipsc.org.cn/). | ||
|
||
* [LUGE: Chit-chat Task](https://aistudio.baidu.com/aistudio/competition/detail/48/) in [LUGE ( Language Understanding and Generation Evaluation Benchmarks )](https://www.luge.ai/). | ||
|
||
* [AISTUDIO LUGE: Multi-skill Dialogue Task](https://aistudio.baidu.com/aistudio/competition/detail/55) and [DF LUGE: Multi-skill Dialogue Task](https://www.datafountain.cn/competitions/470). | ||
|
||
|
||
|
||
If the corpus is helpful to your research, please kindly cite our paper: | ||
|
||
``` | ||
@inproceedings{Liu2020TowardsCR, | ||
title={Towards Conversational Recommendation over Multi-Type Dialogs}, | ||
author={Z. Liu and H. Wang and Zheng-Yu Niu and Hua Wu and W. Che and Ting Liu}, | ||
booktitle={ACL}, | ||
year={2020} | ||
} | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
面向推荐的对话 | ||
=== | ||
|
||
# 简介 | ||
|
||
## 任务简介 | ||
面向推荐的对话是指集成对话系统和推荐系统的人机交互系统,该系统先通过问答或闲聊收集用户兴趣和偏好,然后主动给用户推荐其感兴趣的内容,比如餐厅、美食、电影、新闻等。 | ||
|
||
真实世界的人机交互同时涉及到多种类型的对话,比如问答、闲聊、任务型对话等。当前业界一般把多种类型的对话分开研究,这其实不符合真实的人机交互。如何自然的融合多类型对话是一个重要的挑战,为了应对这个挑战,我们提出了一个新的任务—多类型对话中的面向推荐的对话,期望系统能够主动且自然地将对话从非推荐对话(比如『问答』)引导到推荐对话,然后基于收集到的用户兴趣及用户实时反馈通过多次交互完成最终的推荐目标。 | ||
|
||
## 任务定义 | ||
给定对话相关的所有背景知识M=f<sub>1</sub>,f<sub>2</sub>,…,f<sub>n</sub> (n为知识的条数)、用户Profile (画像)P、对话场景S、第1个对话目标g<sub>1</sub>、最后2个对话目标 g<sub>L-1</sub> 和 g<sub>L</sub> 和对话目标序列的长度(对话目标的个数)L(L≥3)。要求参赛系统先预测对话目标序列中其他目标,再输出符合当前对话历史H=u<sub>1</sub>,u<sub>2</sub>,…,u<sub>t-1</sub>(1<t≤m,m为对话的utterance个数)和当前对话目标序列 G=g<sub>1</sub>、g<sub>2</sub>、g<sub>3</sub>…g<sub>q-1</sub>(1<q≤L,L为目标序列长度)的机器(参赛模型只需模拟机器角色即可)回复u<sub>t</sub> ,同时使得对话自然流畅、信息丰富。 | ||
|
||
输入/输出: | ||
|
||
输入:第一个对话目标g<sub>1</sub>、倒数第二个对话目标g<sub>L-1</sub>、知识信息M、用户Profile (画像)P、对话场景S、对话目标序列的长度L和对话历史H | ||
|
||
输出: 目标序列中其他目标 g<sub>2</sub>、g<sub>3</sub>… g<sub>L-2</sub>;同时符合对话历史和对话目标序列,且自然流畅、信息丰富的机器回复u<sub>t</sub> | ||
|
||
## 数据集 | ||
数据包括:用户Profile、对话相关的知识、对话的目标序列、对话场景和对话内容等。用户Profile包括用户的一些个人信息、领域偏好和实体偏好等。对话知识信息来源于明星、电影、音乐、新闻、美食、POI、天气等领域的有聊天价值的知识信息,如明星领域的个人信息、代表作、成就、评价等,电影领域的票房、主演、导演、评价等,以三元组SPO的形式组织。对话的目标序列包括3-5个对话目标,每个对话目标包括两部分:对话类型和对话话题。对话类型包括:QA、面向推荐的对话、任务型对话和闲聊。对话话题为明星、电影、音乐等领域的实体,或新闻等有聊天价值的知识信息。对话场景包括聊天的时间、地点和主题等。训练集包括约10万轮对话,开发集包括约1.5万轮对话,第一批测试集包含约5000个样本,第二批测试集包括约20000个样本,每个对话平均7-8轮。 | ||
|
||
具体数据样例及说明见[竞赛官网](https://aistudio.baidu.com/aistudio/competition/detail/29)。 | ||
|
||
## 基线系统(建议比赛使用生成模型) | ||
我们同时提供检索模型和生成模型。所有模型都是基于百度深度学习框架[PaddlePaddle](http://paddlepaddle.org/)实现的。基线系统包括三个大功能: | ||
1.goal_planning,目标规划,根据对话历史、知识库、用户Profile、对话目标历史等内容,为对话规划对话目标。<br> | ||
2.retrieval_model,检索模型,基于第1步规划的对话目标,根据对话历史、知识库、用户Profile、对话目标历史等内容,检索出对话的回复。<br> | ||
3.generative_model,生成模型,基于第1步规划的对话目标,根据对话历史、知识库、用户Profile、对话目标历史等内容,生成对话的回复。<br> | ||
|
||
两个模型的效果如下: | ||
|
||
| 基线系统 | F1/BLEU2 |DISTINCT2 | | ||
| ------------- | ------------ | ------------ | | ||
| 检索模型 | 34.73/0.230 | 0.189 | | ||
| 生成模型 | 38.17/0.221 | 0.056 | | ||
|
||
|
||
|
||
# 快速开始 | ||
|
||
## 安装 | ||
### 环境依赖 | ||
经测试,基线系统可在以下环境正常运行 | ||
|
||
* 系统:CentOS 6.3, cuda 9.0, CuDNN 7.0 | ||
* python 2.7 | ||
* PaddlePaddle 1.6.1 | ||
|
||
|
||
### 安装代码 | ||
克隆工具集代码库到本地 | ||
|
||
```shell | ||
git clone https://github.com/PaddlePaddle/Research.git | ||
cd Research/NLP/conversational-recommendation-BASELINE/ | ||
``` | ||
|
||
### 安装第三方依赖 | ||
``` | ||
conda create -n Dialog pip python=2.7 | ||
source activate Dialog | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## 运行 | ||
### 下载数据集 | ||
按[竞赛官网](https://aistudio.baidu.com/aistudio/competition/detail/29)的说明下载数据集。 | ||
|
||
### goal_planning训练和测试 | ||
|
||
#### 预处理数据 | ||
按官网说明下载数据,并放到`goal_planning/origin_data/resource/`目录下,再生成模型训练所需数据。 | ||
|
||
``` | ||
cd goal_planning/model | ||
python3 process_data_for_goal_planning.py | ||
cd ../data_generator | ||
python3 data_generator.py | ||
python3 train_generator.py | ||
``` | ||
|
||
#### 训练: | ||
``` | ||
cd goal_planning/model | ||
python paddle_binary_lstm.py,评估当前goal是否完成 | ||
python paddle_astar_goal.py,如果当前goal完成,预测下一个goal的type | ||
python paddle_astar_kg.py,如果当前goal完成,预测下一个goal的topic | ||
``` | ||
|
||
#### 测试: | ||
|
||
``` | ||
cd goal_planning/model | ||
python goal_planning.py,完整的goal planning | ||
``` | ||
|
||
|
||
### retrieval_model训练和测试 | ||
|
||
#### 预处理数据 | ||
|
||
按官网说明下载数据,并放到`data/resource/`目录下,处理成和`data/resource/train/dev/test.txt`相同的数据格式: | ||
|
||
``` | ||
./data/resource/train.txt | ||
./data/resource/dev.txt | ||
./data/resource/test.txt | ||
``` | ||
|
||
#### 训练模型 | ||
|
||
```bash | ||
cd retrieval_model | ||
sh run_train.sh match_kn_gene | ||
``` | ||
|
||
#### 测试模型 | ||
|
||
```bash | ||
cd retrieval_model | ||
sh run_test.sh match_kn_gene | ||
``` | ||
|
||
### generative_model训练和测试 | ||
|
||
#### 预处理数据 | ||
我们做了简单处理,直接拉取代码即可跑出结果。如想进一步提升效果,需要基于`goal planning`预测出更好的goal,然后替换数据集中的属性`goal`,我们通过实验发现加入goal能比较明显的提升效果。 | ||
|
||
注意:`generative_model/data/sgns.weibo.300d.txt`是生成模型所需要的embedding文件,因原始文件较大,只放了100行供参赛者了解文件格式。参赛者可自行训练更好的embedding。`data/resource/train/dev/test.txt`包含少量数据,需要替换成完整数据。 | ||
|
||
#### 训练模型 | ||
|
||
```bash | ||
cd generative_model | ||
sh run_train.sh | ||
``` | ||
|
||
#### 测试模型 | ||
|
||
```bash | ||
cd generative_model | ||
sh run_test.sh | ||
``` | ||
|
||
|
||
# 目录结构 | ||
|
||
```text | ||
. | ||
├── requirements.txt # 第三方依赖 | ||
├── README.md # 本文档 | ||
└── conversational-recommendation # 源码 | ||
├── generative_model # 生成模型 | ||
│ ├── data # 数据 | ||
│ ├── models # 默认模型保存路径 | ||
│ ├── network.py # 模型配置、训练和测试 | ||
│ ├── output # 默认输出路径 | ||
│ ├── run_test.sh # 测试脚本 | ||
│ ├── run_train.sh # 训练脚本 | ||
│ ├── source # 模型的实现 | ||
│ └── tools # 工具 | ||
├── goal_planning # 对话目标规划 | ||
│ ├── logs # 保存的log | ||
│ ├── data_generater # 生成训练所需数据 | ||
│ ├── process_data # 处理后的数据 | ||
│ ├── model # 模型 | ||
│ ├── model_state # 默认模型保存路径 | ||
│ ├── train_data # 转换为模型所需数据 | ||
│ └── origin_data # 原始数据 | ||
└── retrieval_model # 检索模型 | ||
├── args.py # 参数配置 | ||
├── data # 数据 | ||
├── dict # dict | ||
├── interact.py # 人工评估 | ||
├── models # 默认模型保存路径 | ||
├── output # 默认输出路径 | ||
├── predict.py # 模型测试 | ||
├── run_predict.sh # 测试脚本 | ||
├── run_train.sh # 训练脚本 | ||
├── source # 模型实现 | ||
├── tools # 工具 | ||
└── train.py # 模型训练 | ||
``` | ||
|
||
|
||
# 其他 | ||
## 如何贡献代码 | ||
|
||
我们欢迎开发者向基线系统贡献代码。如果您开发了新功能,发现了bug……欢迎提交Pull request与issue到Github。 |
Binary file added
BIN
+2.37 MB
...rsational-Recommendation-BASELINE/conversational_recommendation/generative_model/data.zip
Binary file not shown.
Empty file.
Empty file.
Oops, something went wrong.