You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@WA225 , Thanks for reporting this issue.
The sentencepiece converted tokenizer does not split long texts into smaller segments before applying BPE merges. This can lead to longer processing times for lengthy texts, though results can still be achieved. As a workaround, you can process the text sentence by sentence rather than in a single batch. We will incorporate code to handle text splitting into the tokenization process to address this issue soon.
I have made a PR to avoid this slowness of a long text, microsoft/onnxruntime-extensions#799. After it is integrated into genai, you can have a try again.
Describe the bug
Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.
To Reproduce
Steps to reproduce the behavior:
import onnxruntime_genai as og
from datasets import load_dataset
tokenizer = og.Tokenizer(model)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
tokenizer.encode("\n\n".join(testdata["text"]))
Expected behavior
Return a list of encoded wikitext2 dataset.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: