-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generation of own dataset with Dolma Tokenizer CLI #225
Comments
hey @WenJett ! You command looks correct, so it is strange that it is failing. is the |
Hi @soldni, I have uploaded the data.json.gz (as above) that I have been testing the pipeline with, hence it is only ~10 data points which resulted in "unable to mmap an empty file" error. Strangely enough, if I were to add a few more new data point to ~13 data points which is the file below, I get another error instead. Now I am getting another error after it is trying to start the stage 2 training: These are the CLI I been running:
Thanks for any advice you could provide on this matter! |
I tried with your file locally on my machine: dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32``` Checked the output as follows: import os
import numpy as np
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")
path = 'part-0-00000.npy'
size = os.path.getsize(path)
data = np.memmap(path, dtype='uint32', mode='r', shape=(size // 4,))
print(tokenizer.decode(data[:50]))
# Hi all, what do you think of the new maybe worth updating toolkit? I tested this using |
For issues with OLMo code, please open an issue on its repo, referencing this issue. Thank you! |
Hi @soldni, I checked my toolkit version it is the same as yours 1.0.14.post1. I have also tried updating the toolkit to version 1.1.0 but does not resolve the issue. pip show dolma I have also included my pip versions below if that is relevant. ai2-olmo 0.6.0 |
If you run the python decoding code i shared above, what's your output? or do you get any error? |
My output is as per what is in the 'text' field which I pasted below. I do not get any error at all. edited to show one full "text" output from your decoding code. Best Credit Card for Tax/IRAS Payment |
Hi,
Appreciate your work done so far.
With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.
Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.
dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32
Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".
Hence, I was wondering if
I hope you can provide some guidance for this matter.
Thank you.
The text was updated successfully, but these errors were encountered: