-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move from manual sharding to HF dataset builder. #391
Draft
tom-pollak
wants to merge
2
commits into
jbloomAus:main
Choose a base branch
from
tom-pollak:dev/arrow-generator-final
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Before `CacheActivationConfig` had a inconsistent config file for some interopability with `LanguageModelSAERunnerConfig`. It was kind of unclear which parameters were necessary vs redundant, and just was fairly unclear. Simplified to the required arguments: - `hf_dataset_path`: Tokenized or untokenized dataset - `total_training_tokens` - `model_name` - `model_batch_size` - `hook_name` - `final_hook_layer` - `d_in` I think this scheme captures everything you need when attempting to cache activations and makes it a lot easier to reason about. Optional: ``` activation_save_path # defaults to "activations/{dataset}/{model}/{hook_name} shuffle=True prepend_bos=True streaming=True seqpos_slice buffer_size_gb=2 # Size of each buffer. Affects memory usage and saving freq device="cuda" or "cpu" dtype="float32" autocast_lm=False compile_llm=True hf_repo_id # Push to hf model_kwargs # `run_with_cache` model_from_pretrained_kwargs ```
Oops previous commit wasn't merged yet, this is only for Move from manual sharding to HF dataset builder. |
fc9a460
to
a1da04c
Compare
Old
New
Not a huge amount of difference in benchmarks, slightly faster. |
Depends on jbloomAus#389. Inspired by: https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709 Instead of manually writing the single arrow shards, we can create a dataset builder that can do this more efficiently. This speeds up saving quite a lot, old method spent a some time calculating the fingerprint of the shard, which was unecessary and would require a hack to get around. > Along with this change, I also switched to a 1D activation scheme. - Previously the dataset was stored as a `(seq_len d_in)` array. - Now stored as a flat `d_in` Primary reason for this change is shuffling activations. I found that by using activations sequence, the activations are not properly shuffled. This is a problem with `ActivationCache` too but there's not a great solution for it there. You can observe this in the loss of the SAE by using small buffer sizes with either using cache or `ActivationStore`.
a1da04c
to
28ee687
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Depends on #389.
Inspired by:
https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709
Instead of manually writing the single arrow shards, we can create a
dataset builder that can do this more efficiently. This speeds up saving
quite a lot, old method spent a some time calculating the fingerprint of
the shard, which was unecessary and would require a hack to get around.
(seq_len d_in)
array.d_in
Primary reason for this change is shuffling activations. I found that by
using activations sequence, the activations are not properly shuffled.
This is a problem with
ActivationCache
too but there's not a greatsolution for it there.
You can observe this in the loss of the SAE by using small buffer sizes
with either using cache or
ActivationStore
.Fixes # (issue)
Type of change
Please delete options that are not relevant.
(d_in)
activations vs(seq_len, d_in)
Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-ci
to check format and linting. (you can runmake format
to format code if needed.)