- python 3.5
- tensorflow 1.3
- Chinese Words Segementation:jieba
- datasets:data
- data: preprocessing
- models: model
- weights: checkpoint
- myutils: API
- docs: document
- for each data:
(query_id, question, passage_id, passage, answer, start, end, docAnswerScore,docQuestionScore,overlaps)
- question: word id
- passage: word id
- answer: word id
- start: start position
- end: end position
- docAnswerScore: if answer is in document:1, if not0)
- docQuestionScore: matching score
- overlaps: if document word is in question:1,if not:0
- Label
- predict: start, end
- sequence labeling: BIO
q | a |
---|---|
task | Factoid Q&A subtask |
metric | Accuracy, F1 |
data num | 3w |
candidate | 10documents each question |
answer num | 1 |
answer length | <20 |
{
“query_id”: 10000,
“query”: “中国最大的内陆盆地是哪个”,
“passages”: [
{“passage_id”: 1, “url”: “https://zhidao.baidu.com/
question/713780769091877645”, “passage_text”: “中国新疆的塔里木盆地,是世界上最大的内陆盆地,东西长约1500公里,南北最宽处约600公里。盆地底部海拔1000米左右,面积53万平方公里。”},
{“passage_id”: 2, “url”: “http://www.doc88.com/p-093375971649.html”, “passage_text”: “中国最大的固定、半固定沙漠天山与昆仑山之间又有塔里木盆地,面积 53 万平方公里,是世界最大的内陆盆地。 盆地中部是塔克拉玛干大沙漠,面积 33.7 万平方公里,为世界第二大流动性沙漠。”},
……]
“answer”: “塔里木盆地”,
“type”: “factoid”
}
time | todo |
---|---|
2017/9/1 | Start |
2017/9/22 | Implement |
2017/10/10 | Data release |
2017/10/16 | Submit |
2017/10/23 | Optimize |
2017/11/1 | Finish |