-
Notifications
You must be signed in to change notification settings - Fork 21
Experimental Variables
Following is a list of possible hyperparameter settings and other "things to try" in the space of possibility we have mapped out so far. We need to zero in on the most essential paths (from raw text to document-level and observation-level features on through to classification and evaluation). We have to make sure those paths work and can be accessed through a user interface, and test them on our data.
- Vocabulary size: 5000*, 10000, 20000
- Vocabulary size
- Aggregation method: first sentence*; first and last (concatenated); avg all
- K (number of 'topics'): 40, 100, 300, ...
Represent document as matrix of one-hot vectors, one per word
- Vocabulary size: 5000, 10000*
- Document size (if it needs to be fixed): 100 words, 500 words?
Represent document based on word2vec/GloVe vectors from pretrained models.
- Which model: w2v/news, GloVe/news, GloVe/Twitter (+ dimensionality etc)
- How to aggregate: sum/avg, concatenate top K by tfidf
How to represent the documents against which the query is compared?
Options:
- Average all document vectors
- Sum all document vectors*
- Elementwise max over document vectors
- Nothing
Some methods may do better if we explicitly supply a vector representing the interaction between the query and its history.
Options:
- Difference (query - history): Using BoW as an example, words that occur equally often across documents yield 0, no matter the original frequencies. Sign provides signal of which (query or history) was bigger.
- Quotient (query/history): May need to remove zeros from document vectors, but this would emphasize vector entries where the query is loaded high and the history isn't represented. Also consider log difference.
- Product (query * history): Low when both query and history are low, much higher when both are high.
Tried and true.
Hyperparameters:
- Regularization strength: 1/lambda = C = 1.0, 0.1, 0.001, etc.
- Regularization penalty type: L1, L2*
Because it has "machine" in the name.
Hyperparameters:
- Regularization strength and size
- Kernel type: linear (good for bow); radial basis function (good for w2v/other dense vectors?)
- Probably other things
Reputed to be very good classifiers.
Hyperparameters (not all apply to all methods):
- number of trees grown
- maximum tree depth
- number of features to consider
- LSTM or GRU
- Hidden size
- Pooling/inference method: sum over hidden vectors, average hidden vectors, take last output/hidden state > regression.
- Encoding mechanism: word2vec->GRU (original), one-hot word->CNN, one-hot word->bidir LSTM
- With/without Q module (unnecessary parameter reduplication)?
- With multi-task-training (how many upvotes did this post get etc).
- Can take any input vectorization technique, presumably. Uses word2vec (?).
That is a lot of paths. I bet we won't do the full Cartesian product of all possibilities since most of that would be boring and pointless. I count anywhere from 50-100 meaningful runs at a squinty glance. Please let me know what you think.