Rearrange Files

khanhnamle1994 · Dec 5, 2017 · f6e6d59 · f6e6d59
1 parent 96feda6
commit f6e6d59
Show file tree

Hide file tree

Showing 17 changed files with 97,725 additions and 0 deletions.
diff --git a/A-Neural-Probabilistic-Language-Model.pdf → ...A-Neural-Probabilistic-Language-Model.pdf b/A-Neural-Probabilistic-Language-Model.pdf → ...A-Neural-Probabilistic-Language-Model.pdf
diff --git a/...Networks-for-Image-Speech-Time-Series.pdf → ...Networks-for-Image-Speech-Time-Series.pdf b/...Networks-for-Image-Speech-Time-Series.pdf → ...Networks-for-Image-Speech-Time-Series.pdf
diff --git a/...rning-Applied-to-Document-Recognition.pdf → ...rning-Applied-to-Document-Recognition.pdf b/...rning-Applied-to-Document-Recognition.pdf → ...rning-Applied-to-Document-Recognition.pdf
diff --git a/Assignment2/README.txt b/Assignment2/README.txt
@@ -0,0 +1,125 @@
+#######################################
+Neural Networks for Machine Learning
+Programming Assignment 2
+Learning word representations.
+#######################################
+
+In this assignment, you will design a neural net language model that will
+learn to predict the next word, given previous three words.
+
+The data set consists of 4-grams (A 4-gram is a sequence of 4 adjacent words
+in a sentence). These 4-grams were extracted from a large collection of text.
+The 4-grams are chosen so that all the words involved come
+from a small vocabulary of 250 words. Note that for the purposes of this
+assignment special characters such as commas, full-stops, parentheses etc
+are also considered words. The training set consists of 372,550 4-grams. The
+validation and test sets have 46,568 4-grams each.
+
+### GETTING STARTED. ###
+Look at the file raw_sentences.txt. It contains the raw sentences from which
+these 4-grams were extracted. Take a look at the kind of sentences we are
+dealing with here. They are fairly simple ones.
+
+To load the data set, go to an octave terminal and cd to the directory where the
+downloaded data is located. Type
+
+> load data.mat
+
+This will load a struct called 'data' with 4 fields in it.
+You can see them by typing
+
+> fieldnames(data)
+
+'data.vocab' contains the vocabulary of 250 words. Training, validation and
+test sets are in 'data.trainData', 'data.validData' and 'data.testData'  respectively.
+To see the list of words in the vocabulary, type -
+
+> data.vocab
+
+'data.trainData' is a matrix of 372550 X 4. This means there are 372550
+training cases and 4 words per training case. Each entry is an integer that is
+the index of a word in the vocabulary. So each row represents a sequence of 4
+words. 'data.validData' and 'data.testData' are also similar. They contain
+46,568 4-grams each. All three need to be separated into inputs and targets
+and the training set needs to be split into mini-batches. The file load_data.m
+provides code for doing that. To run it type:
+
+>[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);
+
+This will load the data, separate it into inputs and target, and make
+mini-batches of size 100 for the training set.
+
+train.m implements the function that trains a neural net language model.
+To run the training, execute the following -
+
+> model = train(1);
+
+This will train the model for one epoch (one pass through the training set).
+Currently, the training is not implemented and the cross entropy will not
+decrease. You have to fill in parts of the code in fprop.m and train.m.
+Once the code is correctly filled-in, you will see that the cross entropy
+starts decreasing. At this point, try changing the hyperparameters (number
+of epochs, number of hidden units, learning rates, momentum, etc) and see
+what effect that has on the training and validation cross entropy. The
+questions in the assignment will ask you try out specific values of these.
+
+The training method will output a 'model' (a struct containing weights, biases
+and a list of words). Now it's time to play around with the learned model
+and answer the questions in the assignment.
+
+### DESCRIPTION OF THE NETWORK. ###
+The network consists of an input layer, embedding layer, hidden layer and output
+layer. The input layer consists of three word indices. The same
+'word_embedding_weights' are used to map each index to a distributed feature
+representation. These mapped features constitute the embedding layer. This layer
+is connected to the hidden layer, which in turn is connected to the output
+layer. The output layer is a softmax over the 250 words.
+
+### THINGS YOU SEE WHEN THE MODEL IS TRAINING. ###
+As the model trains it prints out some numbers that tell you how well the
+training is going.
+(1) The model shows the average per-case cross entropy (CE) obtained
+on the training set. The average CE is computed every 100 mini-batches. The
+average CE over the entire training set is reported at the end of every epoch.
+
+(2) After every 1000 mini-batches of training, the model is run on the
+validation set. Recall, that the validation set consists of data that is not
+used for training. It is used to see how well the model does on unseen data. The
+cross entropy on validation set is reported.
+
+(3) At the end of training, the model is run both on the validation set and on
+the test set and the cross entropy on both is reported.
+
+You are welcome to change these numbers (100 and 1000) to see the CE's more
+frequently if you want to.
+
+
+### SOME USEFUL FUNCTIONS. ###
+These functions are meant to be used for analyzing the model after the training
+is done.
+  display_nearest_words.m : This method will display the words closest to a
+    given word in the word representation space.
+  word_distance.m : This method will compute the distance between two given
+    words.
+  predict_next_word.m : This method will produce some predictions for the next
+    word given 3 previous words.
+Take a look at the documentation inside these functions to see how to use them.
+
+
+### THINGS TO TRY. ###
+Choose some words from the vocabulary and make a list. Find the words that
+the model thinks are close to words in this list (for example, find the words
+closest to 'companies', 'president', 'day', 'could', etc). Do the outputs make
+sense ?
+
+Pick three words from the vocabulary that go well together (for example,
+'government of united', 'city of new', 'life in the', 'he is the' etc). Use
+the model to predict the next word. Does the model give sensible predictions?
+
+Which words would you expect to be closer together than others ? For example,
+'he' should be closer to 'she' than to 'federal', or 'companies' should be
+closer to 'business' than 'political'. Find the distances using the model.
+Do the distances that the model predicts make sense ?
+
+You are welcome to try other things with this model and post any interesting
+observations on the forums!
diff --git a/Assignment2/data.mat b/Assignment2/data.mat
diff --git a/Assignment2/display_nearest_words.m b/Assignment2/display_nearest_words.m
@@ -0,0 +1,28 @@
+function display_nearest_words(word, model, k)
+% Shows the k-nearest words to the query word.
+% Inputs:
+%   word: The query word as a string.
+%   model: Model returned by the training script.
+%   k: The number of nearest words to display.
+% Example usage:
+%   display_nearest_words('school', model, 10);
+
+word_embedding_weights = model.word_embedding_weights;
+vocab = model.vocab;
+id = strmatch(word, vocab, 'exact');
+if ~any(id)
+  fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word);
+  return;
+end
+% Compute distance to every other word.
+vocab_size = size(vocab, 2);
+word_rep = word_embedding_weights(id, :);
+diff = word_embedding_weights - repmat(word_rep, vocab_size, 1);
+distance = sqrt(sum(diff .* diff, 2));
+
+% Sort by distance.
+[d, order] = sort(distance);
+order = order(2:k+1);  % The nearest word is the query word itself, skip that.
+for i = 1:k
+  fprintf('%s %.2f\n', vocab{order(i)}, distance(order(i)));
+end
diff --git a/Assignment2/fprop.m b/Assignment2/fprop.m
@@ -0,0 +1,84 @@
+function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
+  fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
+  hid_to_output_weights, hid_bias, output_bias)
+% This method forward propagates through a neural network.
+% Inputs:
+%   input_batch: The input data as a matrix of size numwords X batchsize where,
+%     numwords is the number of words, batchsize is the number of data points.
+%     So, if input_batch(i, j) = k then the ith word in data point j is word
+%     index k of the vocabulary.
+%
+%   word_embedding_weights: Word embedding as a matrix of size
+%     vocab_size X numhid1, where vocab_size is the size of the vocabulary
+%     numhid1 is the dimensionality of the embedding space.
+%
+%   embed_to_hid_weights: Weights between the word embedding layer and hidden
+%     layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
+%     number of hidden units.
+%
+%   hid_to_output_weights: Weights between the hidden layer and output softmax
+%               unit as a matrix of size numhid2 X vocab_size
+%
+%   hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
+%
+%   output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
+%
+% Outputs:
+%   embedding_layer_state: State of units in the embedding layer as a matrix of
+%     size numhid1*numwords X batchsize
+%
+%   hidden_layer_state: State of units in the hidden layer as a matrix of size
+%     numhid2 X batchsize
+%
+%   output_layer_state: State of units in the output layer as a matrix of size
+%     vocab_size X batchsize
+%
+
+[numwords, batchsize] = size(input_batch);
+[vocab_size, numhid1] = size(word_embedding_weights);
+numhid2 = size(embed_to_hid_weights, 2);
+
+%% COMPUTE STATE OF WORD EMBEDDING LAYER.
+% Look up the inputs word indices in the word_embedding_weights matrix.
+embedding_layer_state = reshape(...
+  word_embedding_weights(reshape(input_batch, 1, []),:)',...
+  numhid1 * numwords, []);
+
+%% COMPUTE STATE OF HIDDEN LAYER.
+% Compute inputs to hidden units.
+inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
+  repmat(hid_bias, 1, batchsize);
+
+% Apply logistic activation function.
+% FILL IN CODE. Replace the line below by one of the options.
+hidden_layer_state = zeros(numhid2, batchsize);
+% Options
+% (a) hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units));
+% (b) hidden_layer_state = 1 ./ (1 - exp(-inputs_to_hidden_units));
+% (c) hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units));
+% (d) hidden_layer_state = -1 ./ (1 + exp(-inputs_to_hidden_units));
+
+%% COMPUTE STATE OF OUTPUT LAYER.
+% Compute inputs to softmax.
+% FILL IN CODE. Replace the line below by one of the options.
+inputs_to_softmax = zeros(vocab_size, batchsize);
+% Options
+% (a) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state +  repmat(output_bias, 1, batchsize);
+% (b) inputs_to_softmax = hid_to_output_weights' * hidden_layer_state +  repmat(output_bias, batchsize, 1);
+% (c) inputs_to_softmax = hidden_layer_state * hid_to_output_weights' +  repmat(output_bias, 1, batchsize);
+% (d) inputs_to_softmax = hid_to_output_weights * hidden_layer_state +  repmat(output_bias, batchsize, 1);
+
+% Subtract maximum. 
+% Remember that adding or subtracting the same constant from each input to a
+% softmax unit does not affect the outputs. Here we are subtracting maximum to
+% make all inputs <= 0. This prevents overflows when computing their
+% exponents.
+inputs_to_softmax = inputs_to_softmax...
+  - repmat(max(inputs_to_softmax), vocab_size, 1);
+
+% Compute exp.
+output_layer_state = exp(inputs_to_softmax);
+
+% Normalize to get probability distribution.
+output_layer_state = output_layer_state ./ repmat(...
+  sum(output_layer_state, 1), vocab_size, 1);
diff --git a/Assignment2/load_data.m b/Assignment2/load_data.m
@@ -0,0 +1,27 @@
+function [train_input, train_target, valid_input, valid_target, test_input, test_target, vocab] = load_data(N)
+% This method loads the training, validation and test set.
+% It also divides the training set into mini-batches.
+% Inputs:
+%   N: Mini-batch size.
+% Outputs:
+%   train_input: An array of size D X N X M, where
+%                 D: number of input dimensions (in this case, 3).
+%                 N: size of each mini-batch (in this case, 100).
+%                 M: number of minibatches.
+%   train_target: An array of size 1 X N X M.
+%   valid_input: An array of size D X number of points in the validation set.
+%   test: An array of size D X number of points in the test set.
+%   vocab: Vocabulary containing index to word mapping.
+
+load data.mat;
+numdims = size(data.trainData, 1);
+D = numdims - 1;
+M = floor(size(data.trainData, 2) / N);
+train_input = reshape(data.trainData(1:D, 1:N * M), D, N, M);
+train_target = reshape(data.trainData(D + 1, 1:N * M), 1, N, M);
+valid_input = data.validData(1:D, :);
+valid_target = data.validData(D + 1, :);
+test_input = data.testData(1:D, :);
+test_target = data.testData(D + 1, :);
+vocab = data.vocab;
+end
diff --git a/Assignment2/predict_next_word.m b/Assignment2/predict_next_word.m
@@ -0,0 +1,37 @@
+function predict_next_word(word1, word2, word3, model, k)
+% Predicts the next word.
+% Inputs:
+%   word1: The first word as a string.
+%   word2: The second word as a string.
+%   word3: The third word as a string.
+%   model: Model returned by the training script.
+%   k: The k most probable predictions are shown.
+% Example usage:
+%   predict_next_word('john', 'might', 'be', model, 3);
+%   predict_next_word('life', 'in', 'new', model, 3);
+
+word_embedding_weights = model.word_embedding_weights;
+vocab = model.vocab;
+id1 = strmatch(word1, vocab, 'exact');
+id2 = strmatch(word2, vocab, 'exact');
+id3 = strmatch(word3, vocab, 'exact');
+if ~any(id1)
+  fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word1);
+  return;
+end
+if ~any(id2)
+  fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word2);
+  return;
+end
+if ~any(id3)
+  fprintf(1, 'Word ''%s\'' not in vocabulary.\n', word3);
+  return;
+end
+input = [id1; id2; id3];
+[embedding_layer_state, hidden_layer_state, output_layer_state] = ...
+  fprop(input, model.word_embedding_weights, model.embed_to_hid_weights,...
+        model.hid_to_output_weights, model.hid_bias, model.output_bias);
+[prob, indices] = sort(output_layer_state, 'descend');
+for i = 1:k
+  fprintf(1, '%s %s %s %s Prob: %.5f\n', word1, word2, word3, vocab{indices(i)}, prob(i));
+end