Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

WA225 · 2024-08-19T16:21:58Z

Describe the bug
Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce
Steps to reproduce the behavior:
import onnxruntime_genai as og
from datasets import load_dataset

tokenizer = og.Tokenizer(model)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior
Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

OS: Windows 11
onnxruntime-genai-0.4.0

wenbingl · 2024-08-27T23:45:44Z

Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset

tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

OS: Windows 11

onnxruntime-genai-0.4.0

Consider this is a very large text operation, how long do you wait for the result?

wenbingl · 2024-08-29T17:03:59Z

@WA225 , Thanks for reporting this issue.
The sentencepiece converted tokenizer does not split long texts into smaller segments before applying BPE merges. This can lead to longer processing times for lengthy texts, though results can still be achieved. As a workaround, you can process the text sentence by sentence rather than in a single batch. We will incorporate code to handle text splitting into the tokenization process to address this issue soon.

WA225 · 2024-09-04T15:40:12Z

Thank you for your help @wenbingl

wenbingl · 2024-09-04T17:55:52Z

I have made a PR to avoid this slowness of a long text, microsoft/onnxruntime-extensions#799. After it is integrated into genai, you can have a try again.

github-actions bot added the platform:windows label Aug 19, 2024

yufenglee assigned wenbingl Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

WA225 commented Aug 19, 2024 •

edited

Loading

wenbingl commented Aug 27, 2024

wenbingl commented Aug 29, 2024

WA225 commented Sep 4, 2024

wenbingl commented Sep 4, 2024 •

edited

Loading

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

Comments

WA225 commented Aug 19, 2024 • edited Loading

wenbingl commented Aug 27, 2024

wenbingl commented Aug 29, 2024

WA225 commented Sep 4, 2024

wenbingl commented Sep 4, 2024 • edited Loading

WA225 commented Aug 19, 2024 •

edited

Loading

wenbingl commented Sep 4, 2024 •

edited

Loading