Skip to content

Commit

Permalink
Add notes to unit norm method
Browse files Browse the repository at this point in the history
  • Loading branch information
alan-cooney committed Nov 5, 2023
1 parent 5fc6ad9 commit 1a6ed4e
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 6 deletions.
1 change: 1 addition & 0 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
"maxiter",
"miniter",
"monosemantic",
"Monosemanticity",
"multipled",
"nanda",
"ncols",
Expand Down
43 changes: 37 additions & 6 deletions sparse_autoencoder/autoencoder/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,14 +104,45 @@ def reset_parameters(self) -> None:
def make_decoder_weights_and_grad_unit_norm(self) -> None:
"""Make the decoder weights and gradients unit norm.
> Recall that we constrain our dictionary vectors to have unit norm. Our first naive
implementation simply reset all vectors to unit norm after each gradient step. This means
any gradient updates modifying the length of our vector are removed, creating a discrepancy
between the gradient used by the Adam optimizer and the true gradient. We find that instead
removing any gradient information parallel to our dictionary vectors before applying the
gradient step results in a small but real reduction in total loss.
Unit norming the dictionary vectors, which are essentially the columns of the encoding and
decoding matrices, serves a few purposes:
1. It helps with numerical stability, by preventing the dictionary vectors from growing
too large.
2. It acts as a form of regularization, preventing overfitting by not allowing any one
feature to dominate the representation. It limits the capacity of the model by
forcing the dictionary vectors to live on the hypersphere of radius 1.
3. It encourages sparsity. Since the dictionary vectors have a fixed length, the model
must carefully select which features to activate in order to best reconstruct the
input.
Each input vector is a row of size `(1, n_input_features)`. The encoding matrix is then of
shape `(n_input_features, n_learned_features)`. The columns are the dictionary vectors, i.e.
each one projects the input vector onto a basis vector in the learned feature space.
Each decoding matrix is of shape `(n_learned_features, n_input_features)`, with the output
vectors as rows of size `(1, n_input_features)`. The columns of the decoding matrix are the
dictionary vectors that reconstruct the learned features in the input space.
Note that the *Towards Monosemanticity: Decomposing Language Models With Dictionary
Learning* paper found that removing the gradient information parallel to the dictionary
vectors before applying the gradient step, rather than resetting the dictionary vectors to
unit norm after each gradient step, [results in a small but real reduction in total
loss](https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-autoencoder-optimization).
Approach:
The gradient with respect to the decoder weights is of shape `(n_learned_features,
n_input_features)` (and similarly for the encoder weights it's just the same shape as
the weights themselves). By subtracting the projection of the gradient onto the
dictionary vectors, we remove the component of the gradient that is parallel to the
dictionary vectors and just keep the component that is orthogonal to the dictionary
vectors (i.e. moving around the hypersphere). The result is that the gradient moves
around the hypersphere, but never moves towards or away from the center. Note this does
mean that we must assume
TODO: Implement this.
TODO: Consider creating a custom module to do this.
"""
raise NotImplementedError

Expand Down

0 comments on commit 1a6ed4e

Please sign in to comment.