Update notebooks

AlexFridman · May 10, 2017 · 9976737 · 9976737
1 parent 4183e47
commit 9976737
Show file tree

Hide file tree

Showing 5 changed files with 100 additions and 133 deletions.
diff --git a/embeddings/Skip-Gram_word2vec.ipynb b/embeddings/Skip-Gram_word2vec.ipynb
@@ -2,14 +2,11 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "metadata": {
-    "deletable": true,
-    "editable": true
-   },
+   "metadata": {},
    "source": [
     "# Skip-gram word2vec\n",
     "\n",
-    "In this notebook, I'll lead you through using TensorFlow to implement the word2vec algorithm using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like translations.\n",
+    "In this notebook, I'll lead you through using TensorFlow to implement the word2vec algorithm using the skip-gram architecture. By implementing this, you'll learn about embedding words for use in natural language processing. This will come in handy when dealing with things like machine translation.\n",
     "\n",
     "## Readings\n",
     "\n",
@@ -23,7 +20,31 @@
     "\n",
     "## Word embeddings\n",
     "\n",
-    "When you're dealing with language and words, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as \"black\", \"white\", and \"red\" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.\n",
+    "When you're dealing with words in text, you end up with tens of thousands of classes to predict, one for each word. Trying to one-hot encode these words is massively inefficient, you'll have one element set to 1 and the other 50,000 set to 0. The matrix multiplication going into the first hidden layer will have almost all of the resulting values be zero. This a huge waste of computation. \n",
+    "\n",
+    "![one-hot encodings](assets/one_hot_encoding.png)\n",
+    "\n",
+    "To solve this problem and greatly increase the efficiency of our networks, we use what are called embeddings. Embeddings are just a fully connected layer like you've seen before. We call this layer the embedding layer and the weights are embedding weights. We skip the multiplication into the embedding layer by instead directly grabbing the hidden layer values from the weight matrix. We can do this because the multiplication of a one-hot encoded vector with a matrix returns the row of the matrix corresponding the index of the \"on\" input unit.\n",
+    "\n",
+    "![lookup](assets/lookup_matrix.png)\n",
+    "\n",
+    "Instead of doing the matrix multiplication, we use the weight matrix as a lookup table. We encode the words as integers, for example \"heart\" is encoded as 958, \"mind\" as 18094. Then to get hidden layer values for \"heart\", you just take the 958th row of the embedding matrix. This process is called an **embedding lookup** and the number of hidden units is the **embedding dimension**.\n",
+    "\n",
+    "<img src='assets/tokenize_lookup.png' width=500>\n",
+    " \n",
+    "There is nothing magical going on here. The embedding lookup table is just a weight matrix. The embedding layer is just a hidden layer. The lookup is just a shortcut for the matrix multiplication. The lookup table is trained just like any weight matrix as well.\n",
+    "\n",
+    "Embeddings aren't only used for words of course. You can use them for any model where you have a massive number of classes. A particular type of model called **Word2Vec** uses the embedding layer to find vector representations of words that contain semantic meaning.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Word2Vec\n",
+    "\n",
+    "The word2vec algorithm finds much more efficient representations by finding vectors that represent the words. These vectors also contain semantic information about the words. Words that show up in similar contexts, such as \"black\", \"white\", and \"red\" will have vectors near each other. There are two architectures for implementing word2vec, CBOW (Continuous Bag-Of-Words) and Skip-gram.\n",
     "\n",
     "<img src=\"assets/word2vec_architectures.png\" width=\"500\">\n",
     "\n",
@@ -36,9 +57,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -61,7 +80,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -110,9 +129,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -124,7 +141,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -143,9 +160,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -174,7 +189,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -191,10 +206,7 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "deletable": true,
-    "editable": true
-   },
+   "metadata": {},
    "source": [
     "Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to grab all the words in a window around that word, with size $C$. \n",
     "\n",
@@ -209,9 +221,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": true,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -234,9 +244,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -262,18 +270,16 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {
-    "collapsed": false
-   },
+   "metadata": {},
    "source": [
     "## Building the graph\n",
     "\n",
-    "From Chris McCormick's blog, we can see the general structure of our network.\n",
+    "From [Chris McCormick's blog](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), we can see the general structure of our network.\n",
     "![embedding_network](./assets/skip_gram_net_arch.png)\n",
     "\n",
-    "The input words are passed in as one-hot encoded vectors. This will go into a hidden layer of linear units, then into a softmax layer. We'll use the softmax layer to make a prediction like normal.\n",
+    "The input words are passed in as integers. This will go into a hidden layer of linear units, then into a softmax layer. We'll use the softmax layer to make a prediction like normal.\n",
     "\n",
-    "The idea here is to train the hidden layer weight matrix to find efficient representations for our words. This weight matrix is usually called the embedding matrix or embedding look-up table. We can discard the softmax layer becuase we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in other networks we build from the dataset.\n",
+    "The idea here is to train the hidden layer weight matrix to find efficient representations for our words. We can discard the softmax layer becuase we don't really care about making predictions with this network. We just want the embedding matrix so we can use it in other networks we build from the dataset.\n",
     "\n",
     "I'm going to have you build the graph in stages now. First off, creating the `inputs` and `labels` placeholders like normal.\n",
     "\n",
@@ -284,7 +290,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -308,16 +314,10 @@
     "collapsed": true
    },
    "source": [
-    "The embedding matrix has a size of the number of words by the number of neurons in the hidden layer. So, if you have 10,000 words and 300 hidden units, the matrix will have size $10,000 \\times 300$. Remember that we're using one-hot encoded vectors for our inputs. When you do the matrix multiplication of the one-hot vector with the embedding matrix, you end up selecting only one row out of the entire matrix:\n",
+    "The embedding matrix has a size of the number of words by the number of units in the hidden layer. So, if you have 10,000 words and 300 hidden units, the matrix will have size $10,000 \\times 300$. Remember that we're using tokenized data for our inputs, usually as integers, where the number of tokens is the number of words in our vocabulary.\n",
     "\n",
-    "![one-hot matrix multiplication](assets/matrix_mult_w_one_hot.png)\n",
     "\n",
-    "You don't actually need to do the matrix multiplication, you just need to select the row in the embedding matrix that corresponds to the input word. Then, the embedding matrix becomes a lookup table, you're looking up a vector the size of the hidden layer that represents the input word.\n",
-    "\n",
-    "<img src=\"assets/word2vec_weight_matrix_lookup_table.png\" width=500>\n",
-    "\n",
-    "\n",
-    "> **Exercise:** Tensorflow provides a convenient function [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) that does this lookup for us. You pass in the embedding matrix and a tensor of integers, then it returns rows in the matrix corresponding to those integers. Below, set the number of embedding features you'll use (200 is a good start), create the embedding matrix variable, and use [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) to get the embedding tensors. For the embedding matrix, I suggest you initialize it with a uniform random numbers between -1 and 1 using [tf.random_uniform](https://www.tensorflow.org/api_docs/python/tf/random_uniform). This [TensorFlow tutorial](https://www.tensorflow.org/tutorials/word2vec) will help if you get stuck."
+    "> **Exercise:** Tensorflow provides a convenient function [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) that does this lookup for us. You pass in the embedding matrix and a tensor of integers, then it returns rows in the matrix corresponding to those integers. Below, set the number of embedding features you'll use (200 is a good start), create the embedding matrix variable, and use `tf.nn.embedding_lookup` to get the embedding tensors. For the embedding matrix, I suggest you initialize it with a uniform random numbers between -1 and 1 using [tf.random_uniform](https://www.tensorflow.org/api_docs/python/tf/random_uniform)."
    ]
   },
   {
@@ -356,9 +356,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -414,9 +412,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": true,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -437,9 +433,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -505,7 +499,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -530,9 +524,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -547,9 +539,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -562,9 +552,7 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false,
-    "deletable": true,
-    "editable": true
+    "collapsed": true
    },
    "outputs": [],
    "source": [
@@ -591,7 +579,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.0"
+   "version": "3.6.1"
   }
  },
  "nbformat": 4,