Move from manual sharding to HF dataset builder. #391

tom-pollak · 2024-11-25T15:51:05Z

Description

Depends on #389.

Inspired by:
https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709

Instead of manually writing the single arrow shards, we can create a
dataset builder that can do this more efficiently. This speeds up saving
quite a lot, old method spent a some time calculating the fingerprint of
the shard, which was unecessary and would require a hack to get around.

Along with this change, I also switched to a 1D activation scheme.

Previously the dataset was stored as a (seq_len d_in) array.
Now stored as a flat d_in

Primary reason for this change is shuffling activations. I found that by
using activations sequence, the activations are not properly shuffled.
This is a problem with ActivationCache too but there's not a great
solution for it there.

You can observe this in the loss of the SAE by using small buffer sizes
with either using cache or ActivationStore.

Small buffer size of 512 activations

Fixes # (issue)

Type of change

Please delete options that are not relevant.

Breaking change (d_in) activations vs (seq_len, d_in)

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

Before `CacheActivationConfig` had a inconsistent config file for some interopability with `LanguageModelSAERunnerConfig`. It was kind of unclear which parameters were necessary vs redundant, and just was fairly unclear. Simplified to the required arguments: - `hf_dataset_path`: Tokenized or untokenized dataset - `total_training_tokens` - `model_name` - `model_batch_size` - `hook_name` - `final_hook_layer` - `d_in` I think this scheme captures everything you need when attempting to cache activations and makes it a lot easier to reason about. Optional: ``` activation_save_path # defaults to "activations/{dataset}/{model}/{hook_name} shuffle=True prepend_bos=True streaming=True seqpos_slice buffer_size_gb=2 # Size of each buffer. Affects memory usage and saving freq device="cuda" or "cpu" dtype="float32" autocast_lm=False compile_llm=True hf_repo_id # Push to hf model_kwargs # `run_with_cache` model_from_pretrained_kwargs ```

tom-pollak · 2024-11-25T15:53:09Z

Oops previous commit wasn't merged yet, this is only for Move from manual sharding to HF dataset builder.

tom-pollak · 2024-11-25T16:18:27Z

Old

No saving took: 28.3446
HF Dataset took: 31.2326 ****                                                                                                                                                                                                                                                                                                                                                                                                                             
HF Dataset size: 636.83 MB
Safetensors took: 28.4925                                                                                                                                                                                                                                                                                                                                                                                                                             
Safetensors size: 635.50 MB

New

No saving took: 28.3992                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
HF Dataset took: 28.5294 ****                                                                                                                                                                                                                                                                                                                                                                                                                              
HF Dataset size: 635.56 MB
Safetensors took: 28.6687                                                                                                                                                                                                                                                                                                                                                                                                                             
Safetensors size: 635.50 MB

Not a huge amount of difference in benchmarks, slightly faster.

Depends on jbloomAus#389. Inspired by: https://opensourcemechanistic.slack.com/archives/C07EHMK3XC7/p1732413633220709 Instead of manually writing the single arrow shards, we can create a dataset builder that can do this more efficiently. This speeds up saving quite a lot, old method spent a some time calculating the fingerprint of the shard, which was unecessary and would require a hack to get around. > Along with this change, I also switched to a 1D activation scheme. - Previously the dataset was stored as a `(seq_len d_in)` array. - Now stored as a flat `d_in` Primary reason for this change is shuffling activations. I found that by using activations sequence, the activations are not properly shuffled. This is a problem with `ActivationCache` too but there's not a great solution for it there. You can observe this in the loss of the SAE by using small buffer sizes with either using cache or `ActivationStore`.

tom-pollak force-pushed the dev/arrow-generator-final branch from fc9a460 to a1da04c Compare November 25, 2024 16:12

tom-pollak force-pushed the dev/arrow-generator-final branch from a1da04c to 28ee687 Compare November 25, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move from manual sharding to HF dataset builder. #391

Move from manual sharding to HF dataset builder. #391

tom-pollak commented Nov 25, 2024 •

edited

Loading

tom-pollak commented Nov 25, 2024 •

edited

Loading

tom-pollak commented Nov 25, 2024

Move from manual sharding to HF dataset builder. #391

Are you sure you want to change the base?

Move from manual sharding to HF dataset builder. #391

Conversation

tom-pollak commented Nov 25, 2024 • edited Loading

Description

Type of change

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

tom-pollak commented Nov 25, 2024 • edited Loading

tom-pollak commented Nov 25, 2024

tom-pollak commented Nov 25, 2024 •

edited

Loading

tom-pollak commented Nov 25, 2024 •

edited

Loading