Experimental Variables

Experiments to try

Following is a list of possible hyperparameter settings and other "things to try" in the space of possibility we have mapped out so far. We need to zero in on the most essential paths (from raw text to document-level and observation-level features on through to classification and evaluation). We have to make sure those paths work and can be accessed through a user interface, and test them on our data.

Features (document vectors)

Bag of words

Vocabulary size: 5000*, 10000, 20000

tf-idf vectors

Vocabulary size

Skip-thoughts

Aggregation method: first sentence*; first and last (concatenated); avg all

Latent Dirichlet Allocation

K (number of 'topics'): 40, 100, 300, ...

One-hot (word)

Represent document as matrix of one-hot vectors, one per word

Vocabulary size: 5000, 10000*
Document size (if it needs to be fixed): 100 words, 500 words?

Pretrained word embedding lookup

Represent document based on word2vec/GloVe vectors from pretrained models.

Which model: w2v/news, GloVe/news, GloVe/Twitter (+ dimensionality etc)
How to aggregate: sum/avg, concatenate top K by tfidf

Observation-level feature extraction

History/preceding documents

How to represent the documents against which the query is compared?

Options:

Average all document vectors
Sum all document vectors*
Elementwise max over document vectors
Nothing

"Interaction" vectors

Some methods may do better if we explicitly supply a vector representing the interaction between the query and its history.

Options:

Difference (query - history): Using BoW as an example, words that occur equally often across documents yield 0, no matter the original frequencies. Sign provides signal of which (query or history) was bigger.
Quotient (query/history): May need to remove zeros from document vectors, but this would emphasize vector entries where the query is loaded high and the history isn't represented. Also consider log difference.
Product (query * history): Low when both query and history are low, much higher when both are high.

Modeling Algorithms

Logistic regression

Tried and true.

Hyperparameters:

Regularization strength: 1/lambda = C = 1.0, 0.1, 0.001, etc.
Regularization penalty type: L1, L2*

Kernel SVM

Because it has "machine" in the name.

Hyperparameters:

Regularization strength and size
Kernel type: linear (good for bow); radial basis function (good for w2v/other dense vectors?)
Probably other things

Boosted trees, bagged/random forests

Reputed to be very good classifiers.

Hyperparameters (not all apply to all methods):

number of trees grown
maximum tree depth
number of features to consider

Recurrent classifier (LSTM, GRU)

LSTM or GRU
Hidden size
Pooling/inference method: sum over hidden vectors, average hidden vectors, take last output/hidden state > regression.

Dynamic memory networks

Encoding mechanism: word2vec->GRU (original), one-hot word->CNN, one-hot word->bidir LSTM
With/without Q module (unnecessary parameter reduplication)?
With multi-task-training (how many upvotes did this post get etc).
Can take any input vectorization technique, presumably. Uses word2vec (?).

Summary

That is a lot of paths. I bet we won't do the full Cartesian product of all possibilities since most of that would be boring and pointless. I count anywhere from 50-100 meaningful runs at a squinty glance. Please let me know what you think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!