中文NLP判断一个句子是不是疑问句

最近做项目，需要判断一下一个句子是不是疑问句，用机器学习方法做了一个小模型

数据来源

数据集存储在 data 目录下，处理完之后保存为 data/all_data.txt ，一共有74652个样本

1、中文阅读理解数据集 c3_public 提取问句和非问句

2、中文阅读理解数据集 cmrc2018_public 提取问句

模型选择

先对文本分词，再用TFIDF提取特征，考虑到实际过程中大部分文本可能没有标点符号，因此以p的概率丢失末尾的标点符号

1、LinearSVC 默认参数

2、AdaBoost 默认参数

3、DecisionTree 默认参数

4、Ensemble voting='soft'/'hard' 两种方式

模型结果

训练集和验证集大约按照3:1划分，seed值为2022，模型结果为验证集上的f1值

p = 0.0

max_features (tfidf参数)	LinearSVC	AdaBoost	DecisionTree	Ensemble （soft）	Ensemble （hard）
10	0.6567	0.7125	0.7105	-	0.7130
100	0.8378	0.8449	0.8446	-	0.8513
500	0.9126	0.9027	0.8971	-	0.9144
1000	0.9246	0.9103	0.9075	-	0.9257
> 1000	0.9435	0.9070	0.9287	-	0.9408

p=0.3

max_features (tfidf参数)	LinearSVC	AdaBoost	DecisionTree	Ensemble （soft）	Ensemble （hard）
10	0.6567	0.7125	0.7105	-	0.7130
100	0.8378	0.8448	0.8452	-	0.8512
500	0.9105	0.8994	0.8964	-	0.9121
1000	0.9190	0.9040	0.9018	-	0.9190
> 1000	0.9345	0.8924	0.9123	-	0.9291

p = 0.5

max_features (tfidf参数)	LinearSVC	AdaBoost	DecisionTree	Ensemble （soft）	Ensemble （hard）
10	0.6422	0.7055	0.7024	-	0.7052
100	0.8094	0.8077	0.8035	-	0.8156
500	0.8725	0.8452	0.8501	-	0.8696
1000	0.8759	0.8336	0.8414	-	0.8724
> 1000	0.8908	0.8151	0.8383	-	0.8741

p = 1.0

max_features (tfidf参数)	LinearSVC	AdaBoost	DecisionTree	Ensemble （soft）	Ensemble （hard）
10	0.5969	0.6597	0.6607	-	0.6609
100	0.7760	0.7523	0.7694	-	0.7817
500	0.8188	0.7575	0.7925	-	0.8166
1000	0.8223	0.7186	0.7854	-	0.8161
> 1000	0.8372	0.7479	0.7626	-	0.8188

结论

1、标点符号是很重要的特征，比如以max_features=500 LinearSVC模型举例：可以看到，随着p值的增加，F1值下降很明显！

2、在max_features<=500时，Ensemble的效果是最好的，并且明显好于LinearSVC

3、当max_features>500时，LinearSVC的效果是最好的，Ensemble和LinearSVC效果接近

4、综合2和3可以看出，LinearSVC受max_features的影响较大，比如以p=0.3举例，分别画出LinearSVC和Ensemble的F1值就可以很清楚看出

2022.05.06 加入FastText，该Git在原始数据集（all_data.txt）中跑的程序，FastText程序参考该Git

FastText训练和测试时，取p=0.5，在FastText/data/text.txt上测试的F1值为0.9624，显著好于机器学习的模型

但是模型的泛化性不强。

运行FastText

cd FastText
python run.py --model FastText --embedding random

测试FastText

cd FastText
python evalute.py

运行完之后可以手动测试下面的例子，如果只用Git提供的数据集，模型不能全部判断正确，但是当扩增数据集时，重新训练FastText，模型在下面的例子中可以全部判断正确。因此，要想提高模型的泛化性，需要在该Git的基础上扩增数据集

"明天明天咋样",
"明天早起不",
"我没有钱",
"我没有钱，你有？",
"你叫什么名字",
"你多大了？",
"你几岁了",
"你平时喜欢干什么？",

TODO

通过Roformer V2扩张问句
加入需要结合上下文才可以判定的困难样本
加入深度学习

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
FastText		FastText
data		data
fig		fig
model		model
README.md		README.md
get_data.py		get_data.py
plot_result.py		plot_result.py
predict.py		predict.py
result		result
run.py		run.py
save.py		save.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

中文NLP判断一个句子是不是疑问句

数据来源

模型选择

模型结果

结论

TODO

About

Releases

Packages

Languages

cingtiye/Chinese_question_sentence_judgment

Folders and files

Latest commit

History

Repository files navigation

中文NLP判断一个句子是不是疑问句

数据来源

模型选择

模型结果

结论

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages