Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

Open
WA225 opened this issue Aug 19, 2024 · 4 comments
Open

Encoding wikitext-2-raw-v1 using OGA Tokenizer hangs #815

WA225 opened this issue Aug 19, 2024 · 4 comments
Assignees

Comments

@WA225
Copy link

WA225 commented Aug 19, 2024

Describe the bug
Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce
Steps to reproduce the behavior:
import onnxruntime_genai as og
from datasets import load_dataset

tokenizer = og.Tokenizer(model)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior
Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

  • OS: Windows 11
  • onnxruntime-genai-0.4.0
@wenbingl
Copy link
Member

Describe the bug Using the OGA tokenizer to encode the wikitext-2-raw-v1 hangs and does not return, but works fine for wikitest-2-v1.

To Reproduce Steps to reproduce the behavior: import onnxruntime_genai as og from datasets import load_dataset

tokenizer = og.Tokenizer(model) testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") tokenizer.encode("\n\n".join(testdata["text"]))

Expected behavior Return a list of encoded wikitext2 dataset.

Desktop (please complete the following information):

  • OS: Windows 11
  • onnxruntime-genai-0.4.0

Consider this is a very large text operation, how long do you wait for the result?

@wenbingl
Copy link
Member

@WA225 , Thanks for reporting this issue.
The sentencepiece converted tokenizer does not split long texts into smaller segments before applying BPE merges. This can lead to longer processing times for lengthy texts, though results can still be achieved. As a workaround, you can process the text sentence by sentence rather than in a single batch. We will incorporate code to handle text splitting into the tokenization process to address this issue soon.

@WA225
Copy link
Author

WA225 commented Sep 4, 2024

Thank you for your help @wenbingl

@wenbingl
Copy link
Member

wenbingl commented Sep 4, 2024

I have made a PR to avoid this slowness of a long text, microsoft/onnxruntime-extensions#799. After it is integrated into genai, you can have a try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants