diff --git "a/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb" "b/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb"
new file mode 100644
index 000000000..498882d2c
--- /dev/null
+++ "b/docs/\347\254\254\345\215\201\347\253\240/NLP\345\237\272\347\241\200.ipynb"
@@ -0,0 +1,720 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "119ec186",
+ "metadata": {},
+ "source": [
+ "# 词嵌入(概念部分)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "f8e5639e",
+ "metadata": {},
+ "source": [
+ "### 在了解什么是词嵌入之前,我们可以思考一下计算机如何识别人类的输入? \n",
+ "计算机通过将输入信息解析为0和1这般的二进制编码,从而将人类语言转化为机器语言,进行理解。 \n",
+ "我们先引入一个概念**one-hot编码**,也称为**独热编码**,在给定维度的情况下,一行向量有且仅有一个值为1,例如维度为5的向量[0,0,0,0,1] \n",
+ "例如,我们在幼儿园或小学学习汉语的时候,首先先识字和词,字和词就会保存在我们的大脑中的某处。
\n",
+ "\n",
+ "
一个小朋友刚学会了四个字和词-->[我] [特别] [喜欢] [学习] \n",
+ "\n",
+ "我们的计算机就可以为小朋友开辟一个词向量维度为4的独热编码 \n",
+ "对于中文 我们先进行分词 我 特别 喜欢 学习 \n",
+ "那么我们就可以令 我->[1 0 0 0] 特别 ->[0 1 0 0] 喜欢->[0 0 1 0] 学习->[0 0 0 1] \n",
+ "现在给出一句话 我喜欢学习,那么计算机给出的词向量->[1 0 1 1] \n",
+ "我们可以思考几个问题: \n",
+ "1.如果小朋友词汇量越学越多,学到了成千上万个词之后,我们使用上述方法构建的词向量就会有非常大的维度,并且是一个稀疏向量。 \n",
+ "2.在中文中 诸如 能 会 可以 这样同义词,我们如果使用独热编码,它们是正交的,缺乏词之间的相似性,很难把他们联系到一起。 \n",
+ "因此我们认为独热编码不是一个很好的词嵌入方法。 \n",
+ "\n",
+ "我们再来介绍一下 **稠密表示** \n",
+ "稠密表示的格式如one-hot编码一致,但数值却不同,如 [0.45,0.65,0.14,1.15,0.97] "
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "4db86da3",
+ "metadata": {},
+ "source": [
+ "# Bag of Words词袋表示"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "44dc9252",
+ "metadata": {},
+ "source": [
+ " 词袋表示顾名思义,我们往一个袋子中装入我们的词汇,构成一个词袋,当我们想表达的时候,我们将其取出,构建词袋的方法可以有如下形式。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "823f8f2d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]#我们的语料库\n",
+ "word_list = ' '.join(corpus).split()\n",
+ "word_list = list(sorted(set(word_list)))\n",
+ "word_dict = {w: i for i, w in enumerate(word_list)}\n",
+ "number_dict = {i: w for i, w in enumerate(word_list)}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "8eaeb37d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'do': 0,\n",
+ " 'drinking': 1,\n",
+ " 'hate': 2,\n",
+ " 'i': 3,\n",
+ " 'like': 4,\n",
+ " 'love': 5,\n",
+ " 'nlp': 6,\n",
+ " 'playing': 7,\n",
+ " 'reading': 8}"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "word_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "2bf380c8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{0: 'do',\n",
+ " 1: 'drinking',\n",
+ " 2: 'hate',\n",
+ " 3: 'i',\n",
+ " 4: 'like',\n",
+ " 5: 'love',\n",
+ " 6: 'nlp',\n",
+ " 7: 'playing',\n",
+ " 8: 'reading'}"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "number_dict"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "90e0ef43",
+ "metadata": {},
+ "source": [
+ "根据如上形式,我们可以构建一个维度为9的one&-hot编码,如下(除了可以使用np.eye构建,也可以通过sklearn的库调用)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "9821ed2a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "voc_size = len(word_dict)\n",
+ "bow = []\n",
+ "for i,name in enumerate(word_dict):\n",
+ " bow.append(np.eye(voc_size)[word_dict[name]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "03f1f12f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[array([1., 0., 0., 0., 0., 0., 0., 0., 0.]),\n",
+ " array([0., 1., 0., 0., 0., 0., 0., 0., 0.]),\n",
+ " array([0., 0., 1., 0., 0., 0., 0., 0., 0.]),\n",
+ " array([0., 0., 0., 1., 0., 0., 0., 0., 0.]),\n",
+ " array([0., 0., 0., 0., 1., 0., 0., 0., 0.]),\n",
+ " array([0., 0., 0., 0., 0., 1., 0., 0., 0.]),\n",
+ " array([0., 0., 0., 0., 0., 0., 1., 0., 0.]),\n",
+ " array([0., 0., 0., 0., 0., 0., 0., 1., 0.]),\n",
+ " array([0., 0., 0., 0., 0., 0., 0., 0., 1.])]"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "bow"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "086a5fd2",
+ "metadata": {},
+ "source": [
+ "# N-gram:基于统计的语言模型\n",
+ "N-gram 模型是一种自然语言处理模型,它利用了语言中词语之间的相关性来预测下一个出现的词语。N-gram 模型通过对一段文本中连续出现的 n 个词语进行建模,来预测文本中接下来出现的词语。比如,如果一个文本中包含连续出现的词语“the cat sat on”,那么 N-gram 模型可能会预测接下来的词语是“the mat”或“a hat”。 \n",
+ "\n",
+ "N-gram 模型的精确性取决于用于训练模型的文本的质量和数量。如果用于训练模型的文本包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。此外,如果用于训练模型的文本量较少,那么模型也可能无法充分捕捉到语言中的复杂性。 \n",
+ "\n",
+ "**N-gram 模型的优点:**\n",
+ "\n",
+ "简单易用,N-gram 模型的概念非常简单,实现起来也很容易。 \n",
+ "能够捕捉到语言中的相关性,N-gram 模型通过考虑连续出现的 n 个词语来预测下一个词语,因此它能够捕捉到语言中词语之间的相关性。 \n",
+ "可以使用已有的语料库进行训练,N-gram 模型可以使用已有的大量语料库进行训练,例如 Google 的 N-gram 数据库,这样可以大大提高模型的准确性。 \n",
+ "\n",
+ "**N-gram 模型的缺点:**\n",
+ "\n",
+ "对于短文本数据集不适用,N-gram 模型需要大量的文本数据进行训练,因此对于短文本数据集可能无法达到较高的准确性。 \n",
+ "容易受到噪声和语言纠错的影响,N-gram 模型是基于语料库进行训练的,如果语料库中包含大量的语言纠错和拼写错误,那么模型的预测结果也可能不准确。 \n",
+ "无法捕捉到语言中的非线性关系,N-gram 模型假设语言中的关系是线性的,但事实上语言中可能存在复杂的非线性关系,N-gram 模型无法捕捉到这些关系。 "
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "1f5ad65b",
+ "metadata": {},
+ "source": [
+ "# NNLM:前馈神经网络语言模型\n",
+ "下面通过前馈神经网络模型来**展示滑动**窗口的使用"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "7bddfa77",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#导入必要的库\n",
+ "import numpy as np\n",
+ "import torch\n",
+ "import torch.nn as nn\n",
+ "import torch.optim as optim\n",
+ "from tqdm import tqdm\n",
+ "from torch.autograd import Variable\n",
+ "dtype = torch.FloatTensor"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "29f23588",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['i',\n",
+ " 'like',\n",
+ " 'reading',\n",
+ " 'i',\n",
+ " 'love',\n",
+ " 'drinking',\n",
+ " 'i',\n",
+ " 'hate',\n",
+ " 'playing',\n",
+ " 'i',\n",
+ " 'do',\n",
+ " 'nlp']"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "corpus = [\"i like reading\", \"i love drinking\", \"i hate playing\", \"i do nlp\"]\n",
+ "\n",
+ "word_list = ' '.join(corpus).split()\n",
+ "word_list"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "12b58886",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 1000 cost = 1.010682\n",
+ "epoch: 2000 cost = 0.695155\n",
+ "epoch: 3000 cost = 0.597085\n",
+ "epoch: 4000 cost = 0.531892\n",
+ "epoch: 5000 cost = 0.376044\n",
+ "epoch: 6000 cost = 0.118038\n",
+ "epoch: 7000 cost = 0.077081\n",
+ "epoch: 8000 cost = 0.053636\n",
+ "epoch: 9000 cost = 0.038089\n",
+ "epoch: 10000 cost = 0.027224\n",
+ "[['i', 'like'], ['i', 'love'], ['i', 'hate'], ['i', 'do']] -> ['studying', 'datawhale', 'playing', 'nlp']\n"
+ ]
+ }
+ ],
+ "source": [
+ "#构建我们需要的语料库\n",
+ "corpus = [\"i like studying\", \"i love datawhale\", \"i hate playing\", \"i do nlp\"]\n",
+ "\n",
+ "word_list = ' '.join(corpus).split() #将语料库转化为一个个单词 ,如['i', 'like', 'reading', 'i', ...,'nlp']\n",
+ "word_list = list(sorted(set(word_list))) #用set去重后转化为链表\n",
+ "# print(word_list)\n",
+ "\n",
+ "word_dict = {w: i for i, w in enumerate(word_list)} #将词表转化为字典 这边是词对应到index\n",
+ "number_dict = {i: w for i, w in enumerate(word_list)}#这边是index对应到词\n",
+ "# print(word_dict)\n",
+ "# print(number_dict)\n",
+ "\n",
+ "n_class = len(word_dict) #计算出我们词表的大小,用于后面词向量的构建\n",
+ "\n",
+ "m = 2 #词嵌入维度\n",
+ "n_step = 2 #滑动窗口的大小\n",
+ "n_hidden = 2 #隐藏层的维度为2\n",
+ "\n",
+ "\n",
+ "def make_batch(sentence): #由于语料库较小,我们象征性将训练集按照批次处理 \n",
+ " input_batch = []\n",
+ " target_batch = []\n",
+ "\n",
+ " for sen in sentence:\n",
+ " word = sen.split()\n",
+ " input = [word_dict[n] for n in word[:-1]]\n",
+ " target = word_dict[word[-1]]\n",
+ "\n",
+ " input_batch.append(input)\n",
+ " target_batch.append(target)\n",
+ "\n",
+ " return input_batch, target_batch\n",
+ "\n",
+ "\n",
+ "class NNLM(nn.Module): #搭建一个NNLM语言模型\n",
+ " def __init__(self):\n",
+ " super(NNLM, self).__init__()\n",
+ " self.embed = nn.Embedding(n_class, m)\n",
+ " self.W = nn.Parameter(torch.randn(n_step * m, n_hidden).type(dtype))\n",
+ " self.d = nn.Parameter(torch.randn(n_hidden).type(dtype))\n",
+ "\n",
+ " self.U = nn.Parameter(torch.randn(n_hidden, n_class).type(dtype))\n",
+ " self.b = nn.Parameter(torch.randn(n_class).type(dtype))\n",
+ "\n",
+ " def forward(self, x):\n",
+ " x = self.embed(x) # 4 x 2 x 2\n",
+ " x = x.view(-1, n_step * m)\n",
+ " tanh = torch.tanh(self.d + torch.mm(x, self.W)) # 4 x 2\n",
+ " output = self.b + torch.mm(tanh, self.U)\n",
+ " return output\n",
+ "\n",
+ "model = NNLM()\n",
+ "\n",
+ "criterion = nn.CrossEntropyLoss() #损失函数的设置\n",
+ "optimizer = optim.Adam(model.parameters(), lr=0.001) #优化器的设置\n",
+ "\n",
+ "input_batch, target_batch = make_batch(corpus) #训练集和标签值\n",
+ "input_batch = Variable(torch.LongTensor(input_batch))\n",
+ "target_batch = Variable(torch.LongTensor(target_batch))\n",
+ "\n",
+ "for epoch in range(10000): #训练过程\n",
+ " optimizer.zero_grad()\n",
+ "\n",
+ " output = model(input_batch) # input: 4 x 2\n",
+ "\n",
+ " loss = criterion(output, target_batch)\n",
+ "\n",
+ " if (epoch + 1) % 1000 == 0:\n",
+ " print('epoch:', '%04d' % (epoch + 1), 'cost = {:.6f}'.format(loss.item()))\n",
+ "\n",
+ " loss.backward()\n",
+ " optimizer.step()\n",
+ "\n",
+ "predict = model(input_batch).data.max(1, keepdim=True)[1]#模型预测过程\n",
+ "\n",
+ "print([sen.split()[:2] for sen in corpus], '->', [number_dict[n.item()] for n in predict.squeeze()])"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "93d8cd2f",
+ "metadata": {},
+ "source": [
+ "# Word2Vec模型:主要采用Skip-gram和Cbow两种模式\n",
+ "前文提到的distributed representation稠密向量表达可以用Word2Vec模型进行训练得到。 \n",
+ "skip-gram模型(跳字模型)是用中心词去预测周围词 \n",
+ "cbow模型(连续词袋模型)是用周围词预测中心词 "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "066f68a0",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 11%|█ | 10615/100000 [00:02<00:24, 3657.80it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 10000 cost = 1.955088\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 21%|██ | 20729/100000 [00:05<00:21, 3758.47it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 20000 cost = 1.673096\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 30%|███ | 30438/100000 [00:08<00:18, 3710.13it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 30000 cost = 2.247422\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 41%|████ | 40638/100000 [00:11<00:15, 3767.87it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 40000 cost = 2.289902\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 50%|█████ | 50486/100000 [00:13<00:13, 3713.98it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 50000 cost = 2.396217\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 61%|██████ | 60572/100000 [00:16<00:11, 3450.47it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 60000 cost = 1.539688\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 71%|███████ | 70638/100000 [00:19<00:07, 3809.11it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 70000 cost = 1.638879\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 80%|████████ | 80403/100000 [00:21<00:05, 3740.33it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 80000 cost = 2.279797\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 90%|█████████ | 90480/100000 [00:24<00:02, 3680.03it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 90000 cost = 1.992100\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 100000/100000 [00:27<00:00, 3677.35it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 100000 cost = 1.307715\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "