We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug PretokenizeRunner throws an error when streaming=True because num_proc is not a valid argument for an IterableDataset.
PretokenizeRunner
streaming=True
num_proc
Code example Adding the following test to tests/training/test_pretokenize_runner will raise an error:
tests/training/test_pretokenize_runner
def test_pretokenize_runner_streaming_dataset(): cfg = PretokenizeRunnerConfig( tokenizer_name="gpt2", context_size=10, num_proc=2, dataset_path="NeelNanda/c4-10k", split="train", streaming=True, ) dataset = PretokenizeRunner(cfg).run()
> tokenized_dataset = dataset.map( process_examples, batched=True, batch_size=cfg.pretokenize_batch_size, num_proc=cfg.num_proc, remove_columns=dataset.column_names, ) E TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc' sae_lens/pretokenize_runner.py:102: TypeError
System Info Describe the characteristic of your environment:
The text was updated successfully, but these errors were encountered:
Successfully merging a pull request may close this issue.
Describe the bug
PretokenizeRunner
throws an error whenstreaming=True
becausenum_proc
is not a valid argument for an IterableDataset.Code example
Adding the following test to
tests/training/test_pretokenize_runner
will raise an error:System Info
Describe the characteristic of your environment:
Checklist
The text was updated successfully, but these errors were encountered: