-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy pathmission_log.txt
276 lines (244 loc) · 11.6 KB
/
mission_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
I. Fisher Kernel for CIFAR/STL classification
make a global generative model (rather than patch based). this could
be convolutional S3C with probabilistic pooling, or it could be the
jellyfish
then use Fisher Kernel for classification, like in Rajat's ICML 07
paper
II. analysis of various models
GRBM: prove that >2 collinear equiprobable modes are impossible
GRBM: prove that not a universal approximator on R^n
prove that linear factorial models are not universal approximators
(should include S3C)
III. estimation algorithms
1. KL theory
on binary data
checkpoint = my e-mail w/subject "Binary KL guides features, not pmfs!"
see how much any of this generalizes
another idea: with multilayer models, deep
reconstruction (like my first project with Andrew)
might be a particularly bad idea b/c it
under-constrains the pdf, ie, going to deep models
might be an easy way to find a class where there is so
much capacity that KL only chooses parameters, not
pmfs
on continous data
still possible we might be able to derive KL from
score matching on continuous inputs
status of currently existing estimators:
smd: implemented and confident it works. see
exploring_estimation_criteria/cifar_grbm_smd.yaml
lnce: implemented and confident it works. not implemented as a
pylearn2 cost.
sm: implemented but slow (must use scan to compute each row
of the hessian and then index out the diagonals of the
hessian-- no good way of computing vector of second
derivatives). could use optimization based on
kevin swersky's paper to make it fast for models that can be
converted to autoencoders.
don't remember whether I was confident it works or not, could
definitely use more testing
nce: implemented and reasonably confident it works, could use
some more testing.
I was trying to do NCE of Coates/Lee preprocessed
CIFAR patches against a full covariance gaussian
distribution. This doesn't seem to work at all.
The experiments in the NCE paper preprocessed by
subtracting the mean and dividing by the variance,
and using a uniform distribution on the unit sphere
for the noise. I implemented this but haven't
experimented with it enough to know if it works.
ratio matching: have not implemented.
probably want to consider a version based on
the Bregman divergence paper that samples
which bits to flip
adding consistent estimators:
try for example score matching + noise contrastive.
LNCE:
Is LNCE really consistent? I think not, the version of
consistency we were using when we wrote that proof was weaker
than Hyvarinen's definitions (need to check this). Also, I think the
Bregman divergence paper's derivation of ratio matching lets
us do a fixed version LNCE for the binary case (by sampling B
matrices other than single bit flips). Comparing
these may be helpful for deciding whether LNCE is consistent.
IV. CAE
If model is gaussian RBM, then score matching is extremely similar to
CAE. However, large singular vectors of CAE do not correspond to the
manifold. I sent Yoshua and Salah the formula for the non-varying
subspace of directions. Can this be used to improve visualization or
manifold-based classification?
V. General theory of mutual information between v and h being useful for
classification
It seems like many algorithms that work well for classification are
loosely based on generative pretraining but are modified in some way
to be biased toward having more mutual information between the visible
units and the hidden units.
-CAE: the CAE is essentially an RBM with two modifications:
1. The contractive penalty is REDUCED. In score
matching, the contractive penalty scales like
h(1-h) while in the CAE it scales like (h(1-h))^2,
which is strictly lesser. Also, in the CAE this term
has a cross-validated coefficient, which Yann observes
to usually come out at around .1, in other words, much
smaller than the coefficient of 1 imposed by score
matching.
2. The use of binary cross entropy as a reconstruction
penalty rather than mean squared error
-autoencoders: autoencoders with binary cross entropy loss are
often found to be very effective for classification,
even though binary cross entropy is not a criterion
based on a binary pmf. Rather it is a criterion that
encourages mean field passes in both directions to
preserve information.
-sparse coding beats rbms for a lot of classification tasks,
even though sparse coding is a very poor generative model. It
is however a model that seems designed to have a lot of mutual
information between v and h, especially relative to rbms,
especially when applied to real valued data.
-(not mentioned to Yoshua yet) For GRBMs, the maximum
information can be increased without bound by adding more
hidden units. I'm not sure if the likelihood would or not,
though improving likelihood and MI at the same time is still
beneficial.
see e-mail to Yoshua, "idea regarding feature learning", for two ideas
about how to test this
bilinear rbm
figure out conditions needed for learning to do anything useful
Aaron's approach with noise variables in the middle
use slowness
try a PSD version
SFA
run Wiskott lab's code, see if their results are reproducible
did Wiskott lab ever re-run their own code after fixing the bug I
found?
revisit ICML 11 code, try to get MLPs working better maybe using
mu-ssRBM or score matching training
Contractive Sparse Coding
Differentiable sparse coding is written up but slow. Haven't added the
contractive terms.
May want to put contraction penalty on LCC as well as SC
Turn PSD into an EBM and sample from it
PMIL
Rejected from UAI.
Redo with flashier experiments, such as running it on video.
Quoc's paper is out. Check that this is still
state-of-the-art, but we could probably easily extend it to
use PMIL and get some increase in performance. "Learning
hierarchical invariant spatio-temporal features for action
recognition with independent subspace analysis"
New competition: look at Hugo's UAI 2011 paper on discriminative RBMs.
Part of it is about MIL with dRBMs.
Making hidden units do more work
Aaron has an idea for showing that Canonical Ridge Analysis
gives rise to Partial Least Squares at one end of a spectrum
and Canonical Correlation Analysis at another. The idea is that
CRA does more generative modeling work with the covariance matrices
on the visible units, and PLS does more modeling work with the
hidden units. (Clarifification: it is already known that CCA and PLS
are two ends of the same spectrum, but only CCA has a probabilistic
model interpretation. So we should try to find a probabilistic model
for PLS and show that there is a spectrum of models as well as a
spectrum of algorithms)
Aaron told me about this b/c I said I was interested in regularizers
that make RBM's do more of the modeling work with the hidden units
(i.e., make a GRBM or ssRBM try not to rely too much on the visible
unit variance parameter)
Aaron is also interested in taking Francis Bach's probabilistic model
for CCA and making a spike-and-slab version
Activation function idea
Based on the success of OMP-10, would it be possible to make an
activation function that generalizes softmax to a distribution
where k units are always on instead of just 1?
It's possible this would make a better prior on S3C than the current
factorial bernoulli distribution does
Learning Transformations for Transformation Equivariant Models
see "Transformation Equivariant Boltzmann Machines" ICANN 2011
this is somewhat similar to James' "One Gabor To Rule Them All" paper
both are equivariant to translation and rotation
I propose learning a model that is equivariant to learned
transformations
the idea is that each weight vector should be generated by applying
a sequence of transformations to one of a set of underlying templates.
two possible sources of statistical strength:
tie transform weights across templates (ie, same transform
must be useful for each template)
different degrees of transform are formed by composing one
incremental transform. ie, if r(x) is rotate 10 degrees, then
r(r(x)) is rotate 10 degrees.
further ideas:
each form of tying gives info about pooling structures that
might work well. for example, pooling over all transforms of
one template, or say we have r (rotate) and t (translate).
we could make a grid of repeated applications of r and t,
then pool down to one softmax unit giving best number of
compositions of r and another giving best number of
compositions of t
could have more than one copy of all transforms for each
template, put them in a directed model so they compete, and
then each copy will hopefully explain a different instance
of the feature described by the template
Work on PASCAL VOC to get street cred with the hacky vision crowd?
Stuff Aaron is interested in:
S3C on S3C
linear CG with R op for inference of s variables in S3C
S3C with binary or multinomial units
Stuff I don't mind farming out:
temporal S3C
convolutional S3C
some of the stuff Aaron is interested in
Forward looking stuff:
General strategies for learning to learn:
should be able to learn that if a weight vector w_i is useful,
then f(w_i) is too (i.e., discover that some layer of the
representation should be equivariant to in-plane transformations)
if two features vectors are related by some transformation,
then share statistical power between them. ie, the bias or the
mean/precision term on a hidden unit is a function of both the
identity of the feature and of its transformation coordinates
should be able to learn that h_i and h_j are synonymous for
the purpose of classification (even if not for generation).
could probably do this with a pooling layer-- remember
Yoshua's idea for learning the pooling from 2011?
should be able to learn that if h_i and h_j are synonyms
and h_k = f(h_i) and h_l = f(h_j) then h_k and h_l are
synonyms
More advanced tasks to consider:
highlight an object, ie per pixel labeling. this shows how
much the model really understands
also, should be able to do this without the full
pixel labeling. ie, learn to label objects per-pixel
after just being given centroid labels of the objects.
to do this model will need to figure out which things
it can explain with existing things it knows about
and which are part of the new object
one shot learning
recognize formations of objects (ie, a square of 7s and a
square of cars are both squares, and can be discriminated
from a triangle of 6s, etc.)
object counting (especially for overlapping objects, where
you need to figure out which eye belongs to which person,
etc)
dealing with extreme occlusion, ie Guillaume's bubbling idea
Dead ends:
Reconstruction SRBM:
The Reconstruction SRBM turns out to be just doing a directed
model, generally equivalent to sparse coding (though sparse
coding doesn't estimate the model with true maximum
likelihood).
Contractive coding:
All ways that I tried to pose the problem ended up being
differential equations that only had numerical solutions,
even for extremely simple versions with only one hidden unit,
etc.
Deriving binary cross-entropy between data and autoencoder reconstruction
as being some kind of consistent estimator:
This is a dead end specifically for the case of binary input
data. Autoencoder reconstruction is a function only of the
model's score, and the pmf conveys no information about the
score. Thus estimators based on the consistency of the
recovered pmf are not able to influence the autoencoder
reconstruction.
This is not yet proven to be a dead end for the case of
continous inputs.
Completed missions:
1. S3C