Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move from manual sharding to HF dataset builder. #391

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tom-pollak
Copy link
Contributor

@tom-pollak tom-pollak commented Nov 25, 2024

Description

Depends on #389.

Inspired by:
https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709

Instead of manually writing the single arrow shards, we can create a
dataset builder that can do this more efficiently. This speeds up saving
quite a lot, old method spent a some time calculating the fingerprint of
the shard, which was unecessary and would require a hack to get around.

Along with this change, I also switched to a 1D activation scheme.

  • Previously the dataset was stored as a (seq_len d_in) array.
  • Now stored as a flat d_in

Primary reason for this change is shuffling activations. I found that by
using activations sequence, the activations are not properly shuffled.
This is a problem with ActivationCache too but there's not a great
solution for it there.

You can observe this in the loss of the SAE by using small buffer sizes
with either using cache or ActivationStore.

Screenshot 2024-11-25 at 16 21 09

Small buffer size of 512 activations

Fixes # (issue)

Type of change

Please delete options that are not relevant.

  • Breaking change (d_in) activations vs (seq_len, d_in)

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

  • I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

Before `CacheActivationConfig` had a inconsistent config file for some
interopability with `LanguageModelSAERunnerConfig`. It was kind of
unclear which parameters were necessary vs redundant, and just was
fairly unclear.

Simplified to the required arguments:

- `hf_dataset_path`: Tokenized or untokenized dataset
- `total_training_tokens`
- `model_name`
- `model_batch_size`
- `hook_name`
- `final_hook_layer`
- `d_in`

I think this scheme captures everything you need when attempting to
cache activations and makes it a lot easier to reason about.

Optional:

```
activation_save_path # defaults to "activations/{dataset}/{model}/{hook_name}
shuffle=True
prepend_bos=True
streaming=True
seqpos_slice
buffer_size_gb=2 # Size of each buffer. Affects memory usage and saving freq
device="cuda" or "cpu"
dtype="float32"
autocast_lm=False
compile_llm=True
hf_repo_id # Push to hf
model_kwargs # `run_with_cache`
model_from_pretrained_kwargs
```
@tom-pollak
Copy link
Contributor Author

tom-pollak commented Nov 25, 2024

Oops previous commit wasn't merged yet, this is only for Move from manual sharding to HF dataset builder.

@tom-pollak tom-pollak force-pushed the dev/arrow-generator-final branch from fc9a460 to a1da04c Compare November 25, 2024 16:12
@tom-pollak
Copy link
Contributor Author

Old

No saving took: 28.3446
HF Dataset took: 31.2326 ****                                                                                                                                                                                                                                                                                                                                                                                                                             
HF Dataset size: 636.83 MB
Safetensors took: 28.4925                                                                                                                                                                                                                                                                                                                                                                                                                             
Safetensors size: 635.50 MB

New

No saving took: 28.3992                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
HF Dataset took: 28.5294 ****                                                                                                                                                                                                                                                                                                                                                                                                                              
HF Dataset size: 635.56 MB
Safetensors took: 28.6687                                                                                                                                                                                                                                                                                                                                                                                                                             
Safetensors size: 635.50 MB

Not a huge amount of difference in benchmarks, slightly faster.

Depends on jbloomAus#389.

Inspired by:
https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709

Instead of manually writing the single arrow shards, we can create a
dataset builder that can do this more efficiently. This speeds up saving
quite a lot, old method spent a some time calculating the fingerprint of
the shard, which was unecessary and would require a hack to get around.

> Along with this change, I also switched to a 1D activation scheme.

- Previously the dataset was stored as a `(seq_len d_in)` array.
- Now stored as a flat `d_in`

Primary reason for this change is shuffling activations. I found that by
using activations sequence, the activations are not properly shuffled.
This is a problem with `ActivationCache` too but there's not a great
solution for it there.

You can observe this in the loss of the SAE by using small buffer sizes
with either using cache or `ActivationStore`.
@tom-pollak tom-pollak force-pushed the dev/arrow-generator-final branch from a1da04c to 28ee687 Compare November 25, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant