Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.16——NLP相关了解 #5

Open
li-aolong opened this issue Jul 17, 2019 · 0 comments
Open

7.16——NLP相关了解 #5

li-aolong opened this issue Jul 17, 2019 · 0 comments
Labels
NLP 自然语言处理(Natural Language Processing)

Comments

@li-aolong
Copy link
Owner

处理流程

  • 语料预处理
    • 语料清洗:包括人工去重,对齐,删除,标注,规则提取内容,正则匹配,实体提取等
    • 分词:中文语料需要分词
    • 词性标注:基于规则,基于统计
    • 去停用词:不是必须的
  • 特征工程
    • 词袋模型:统计词频
    • 词向量:将文字转换成向量矩阵进行计算,主要包括跳字模型(Skip-Gram)和连续词袋模型(Continuous Bag of Words)。还有Word2Vec,Doc2Vec,WordRank,FastText等
  • 特征选择
    • 文本特征一般都是词语,具有语义信息
  • 模型训练
    • 有监督和无监督机器学习模型;深度学习模型等
  • 评价指标
    • 错误率,精度,准确率,精确度,召回率,F1衡量
    • ROC曲线,AUC

jieba

分词算法

  1. 基于统计词典,构造有向无环图(DAG)
  2. 基于DAG图,采用动态规划计算最大概率路径来进行分词
  3. 对于新词,采用HMM模型进行切分

相关函数

  • 分词
    • 精确分词:jieba.cut(text, cut_all=False)
    • 全模型:jieba.cut(text, cut_all=True)
    • 搜索引擎模式:jieba.cut_for_search(text)
    • 生成list:jieba.lcut(text)
    • 获取词性:jieba.psseg.lcut(text)
    • 自定义添加词和字典:jieba.add_word('text')jieba.load_userdict('user_dict.txt')
@li-aolong li-aolong added the NLP 自然语言处理(Natural Language Processing) label Jul 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NLP 自然语言处理(Natural Language Processing)
Projects
None yet
Development

No branches or pull requests

1 participant