Generation of own dataset with Dolma Tokenizer CLI #225

WenJett · 2025-01-26T02:26:31Z

Hi,

Appreciate your work done so far.

With the new release of OLMo 2, the tokenizer used seems to be allenai_domla2.json but in prepare_memmap_dataset.py, the tokenizer is allenai/eleuther-ai-gpt-neox-20b-pii-special.

Understand that the above Python script has been depreciated, so I have also tried the Dolma tokenizer CLI with the example below.

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32

Although a .npy file is generated, when I run the generated .npy file with official-1124/OLMo2-7B-stage2-seed42.yaml by modifying the data paths at the bottom, I will get an error of "unable to mmap an empty file".

Hence, I was wondering if

the correct tokenizer to be used is allenai/dolma2-tokenizer or allenai/dolma2-tokenizer-sigdig?
there should be anything else I should include in the flags for the CLI?
my data in data.json.gz does contain a field 'text' which I assume is the bare minimum requirement?

I hope you can provide some guidance for this matter.

Thank you.

soldni · 2025-01-29T05:45:23Z

hey @WenJett !

You command looks correct, so it is strange that it is failing. is the data.json.gz something you could share?

WenJett · 2025-01-29T05:59:39Z

Hi @soldni,

data.json.gz

I have uploaded the data.json.gz (as above) that I have been testing the pipeline with, hence it is only ~10 data points which resulted in "unable to mmap an empty file" error.

Strangely enough, if I were to add a few more new data point to ~13 data points which is the file below, I get another error instead.

new_data.json.gz

Now I am getting another error after it is trying to start the stage 2 training:
CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero

These are the CLI I been running:

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32
torchrun --nproc_per_node=2 scripts/train.py configs/official-1124/OLMo2-7B-stage2-seed42.yaml (with the data path modified to my local path, I tried with the original data paths provided and the pipeline works)

Thanks for any advice you could provide on this matter!

soldni · 2025-01-29T06:03:18Z

I tried with your file locally on my machine:

dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32```

Checked the output as follows:

import os
import numpy as np
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")
path = 'part-0-00000.npy'

size = os.path.getsize(path)
data = np.memmap(path, dtype='uint32', mode='r', shape=(size // 4,))
print(tokenizer.decode(data[:50]))

# Hi all, what do you think of the new

maybe worth updating toolkit? I tested this using 1.0.14.post1

soldni · 2025-01-29T06:04:29Z

For issues with OLMo code, please open an issue on its repo, referencing this issue. Thank you!

WenJett · 2025-01-29T06:09:19Z

Hi @soldni,

I checked my toolkit version it is the same as yours 1.0.14.post1. I have also tried updating the toolkit to version 1.1.0 but does not resolve the issue.

pip show dolma
Name: dolma
Version: 1.0.14.post1
Summary: Data filters
Home-page: https://github.com/allenai/dolma
Author:
Author-email: Allen Institute for Artificial Intelligence [email protected], Luca Soldaini [email protected], Kyle Lo [email protected], Rodney Kinney [email protected], Aakanksha Naik [email protected], Abhilasha Ravichander [email protected], Akshita Bhagia [email protected], Dirk Groeneveld [email protected], Dustin Schwenk [email protected], Ian Magnusson [email protected], Khyathi Chandu [email protected]
License: Apache-2.0
Location: /usr/local/lib/python3.12/site-packages
Requires: anyascii, blingfire, boto3, cached-path, charset-normalizer, fasttext-wheel, fsspec, msgspec, necessary, nltk, numpy, omegaconf, platformdirs, pyyaml, requests, rich, s3fs, smart-open, tokenizers, tqdm, uniseg, zstandard
Required-by:

I have also included my pip versions below if that is relevant.

ai2-olmo 0.6.0
ai2-olmo-core 0.1.0
aiobotocore 2.19.0
aiohappyeyeballs 2.4.4
aiohttp 3.11.11
aioitertools 0.12.0
aiosignal 1.3.2
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyascii 0.3.2
attrs 24.3.0
beaker-gantry 1.12.0
beaker-py 1.32.3
black 23.12.1
blingfire 0.1.8
boltons 24.1.0
boto3 1.36.2
botocore 1.36.2
build 1.2.2.post1
cached_path 1.6.7
cachetools 5.5.0
certifi 2024.12.14
cffi 1.17.1
charset-normalizer 3.4.1
click 8.1.8
click-help-colors 0.9.4
cryptography 44.0.0
datasets 3.2.0
dill 0.3.8
docker 7.1.0
docker-pycreds 0.4.0
docutils 0.21.2
dolma 1.0.14.post1
einops 0.8.0
face 24.0.0
fasttext-wheel 0.9.2
filelock 3.16.1
flash-attn 2.7.3
frozenlist 1.5.0
fsspec 2024.12.0
ftfy 6.3.1
gitdb 4.0.12
GitPython 3.1.44
glom 24.11.0
google-api-core 2.24.0
google-auth 2.37.0
google-cloud-core 2.4.1
google-cloud-storage 2.19.0
google-crc32c 1.6.0
google-resumable-media 2.7.2
googleapis-common-protos 1.66.0
huggingface-hub 0.27.1
idna 3.10
importlib_resources 6.5.2
iniconfig 2.0.0
isort 5.12.0
jaraco.classes 3.4.0
jaraco.context 6.0.1
jaraco.functools 4.1.0
jeepney 0.8.0
Jinja2 3.1.5
jmespath 1.0.1
joblib 1.4.2
keyring 25.6.0
lightning-utilities 0.11.9
markdown-it-py 3.0.0
MarkupSafe 3.0.2
mdurl 0.1.2
more-itertools 10.6.0
mpmath 1.3.0
msgspec 0.19.0
multidict 6.1.0
multiprocess 0.70.16
mypy 1.3.0
mypy-extensions 1.0.0
necessary 0.4.3
networkx 3.4.2
nh3 0.2.20
nltk 3.9.1
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
omegaconf 2.3.0
packaging 24.2
pandas 2.2.3
pathspec 0.12.1
petname 2.6
pip 24.3.1
pkginfo 1.12.0
platformdirs 4.3.6
pluggy 1.5.0
propcache 0.2.1
proto-plus 1.25.0
protobuf 5.29.3
psutil 6.1.1
pyarrow 19.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybind11 2.13.6
pycparser 2.22
pydantic 2.10.5
pydantic_core 2.27.2
Pygments 2.19.1
pyproject_hooks 1.2.0
pytest 8.3.4
pytest-sphinx 0.6.3
python-dateutil 2.9.0.post0
pytz 2024.2
PyYAML 6.0.2
readme_renderer 44.0
regex 2024.11.6
requests 2.32.3
requests-toolbelt 1.0.0
requirements-parser 0.11.0
rfc3986 2.0.0
rich 13.9.4
rsa 4.9
ruff 0.9.2
s3fs 2024.12.0
s3transfer 0.11.1
safetensors 0.5.2
scikit-learn 1.6.1
scipy 1.15.1
SecretStorage 3.3.3
sentry-sdk 2.20.0
setproctitle 1.3.4
setuptools 75.8.0
six 1.17.0
smart-open 7.1.0
smashed 0.21.5
smmap 5.0.2
sympy 1.13.1
threadpoolctl 3.5.0
tokenizers 0.21.0
torch 2.5.1
torchmetrics 1.6.1
tqdm 4.67.1
transformers 4.48.1
triton 3.1.0
trouting 0.3.3
twine 6.0.1
types-setuptools 75.8.0.20250110
typing_extensions 4.12.2
tzdata 2024.2
uniseg 0.10.0
urllib3 2.3.0
wandb 0.19.4
wcwidth 0.2.13
wheel 0.45.1
wrapt 1.17.2
xxhash 3.5.0
yarl 1.18.3
zstandard 0.23.0

soldni · 2025-01-29T06:36:01Z

If you run the python decoding code i shared above, what's your output? or do you get any error?

WenJett · 2025-01-29T06:39:03Z

My output is as per what is in the 'text' field which I pasted below. I do not get any error at all.

edited to show one full "text" output from your decoding code.

Best Credit Card for Tax/IRAS Payment
Some users on the HardwareZone forum were discussing the best credit cards for making tax or IRAS payments. One user mentioned that Citi can also be used to pay from the website, although another user was unsure about the exact process and thought it might be through AXS. To verify credit card payments, users can log into the IRAS website, click on "Account Summary" on the left panel, and check the details.
Another user noted that refunds are often a hassle for IRAS, but if the payment is made using an SCB card, there's a high chance of getting a refund. One user confirmed that the refund was credited directly to their bank account, which was a pleasant surprise as they got a "free" $200. Another user asked if the refund would appear in the account summary, and it was clarified that the transaction would indeed show up.
One user paid to IRAS on 18 March and was wondering if they would get rebates, which depend on the card cycle and statement date. Another user mentioned that although the refund was credited to their bank account, it took a bit of time to reflect on the IRAS website. Some users reported that their payments were reflected the next day, while others faced delays, with payments still not showing even after IRAS had called to confirm the payment.
One user noted that with effect from 21 April 2012, payments made to billing organizations via online banking using a credit card no longer earn points or cashback. This change has made using SCB for online payments less attractive. Some users expressed disappointment, stating that they had been taking advantage of the loophole until it was closed. One user even joked that when they figured out the loophole, they knew they would be spending a lot until the next year, but were glad they had used it while it lasted.
Another user pointed out that they didn't see a significant loophole, suggesting that others might have found ways to exploit it. The current loophole, if any, is not related to SCB online banking. Overall, users found the process of making IRAS payments and getting refunds to be somewhat inconsistent, but generally manageable.<|endoftext|>

WenJett mentioned this issue Jan 31, 2025

Tokenizer to be used for generation of data to .npy files allenai/OLMo#791

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation of own dataset with Dolma Tokenizer CLI #225

Generation of own dataset with Dolma Tokenizer CLI #225

WenJett commented Jan 26, 2025 •

edited

Loading

soldni commented Jan 29, 2025

WenJett commented Jan 29, 2025 •

edited

Loading

soldni commented Jan 29, 2025

soldni commented Jan 29, 2025 •

edited

Loading

WenJett commented Jan 29, 2025 •

edited

Loading

soldni commented Jan 29, 2025

WenJett commented Jan 29, 2025 •

edited

Loading

Generation of own dataset with Dolma Tokenizer CLI #225

Generation of own dataset with Dolma Tokenizer CLI #225

Comments

WenJett commented Jan 26, 2025 • edited Loading

soldni commented Jan 29, 2025

WenJett commented Jan 29, 2025 • edited Loading

soldni commented Jan 29, 2025

soldni commented Jan 29, 2025 • edited Loading

WenJett commented Jan 29, 2025 • edited Loading

soldni commented Jan 29, 2025

WenJett commented Jan 29, 2025 • edited Loading

WenJett commented Jan 26, 2025 •

edited

Loading

WenJett commented Jan 29, 2025 •

edited

Loading

soldni commented Jan 29, 2025 •

edited

Loading

WenJett commented Jan 29, 2025 •

edited

Loading

WenJett commented Jan 29, 2025 •

edited

Loading