Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecodeStream raise error #1705

Open
irexyc opened this issue Dec 26, 2024 · 5 comments
Open

DecodeStream raise error #1705

irexyc opened this issue Dec 26, 2024 · 5 comments

Comments

@irexyc
Copy link

irexyc commented Dec 26, 2024

Hi, I was trying use the DecodeStream api but an error occured.

Below is the reproduce code and error trace.

from tokenizers.decoders import DecodeStream
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('/mnt/140/llama3/Meta-Llama-3-8B-Instruct')
stream = DecodeStream(skip_special_tokens=False)

prompt = "To modify below `HTML` to `{{ '{KEY}' | transalte }}` obeys below Keys.\nExample:\n{{ 'Text.Action\\_Edit' | transalte }}\nKeys:\nText.Action\\_Edit\nText.Action\\_Log\nText.Action\\_View\nText.Action\\_Clone\nText.Action\\_Credit\nText.Action\\_CancelReward\nHTML:\n \n [View](#)\n[Edit](17.reward-template-setting.html)\n[Clone](#)\n[Credit](#)\n[Cancel Rewards](#)\n[Log](#)"

token_ids = tok(prompt).input_ids
# [1271, 5719, 3770, 1595, 5959, 63, 311, 1595, 3052, 11834, 4889, 11923, 765, 1380, 93420, 3954, 63, 98502, 1065, 3770, 25104, 627, 13617, 512, 3052, 364, 1199, 11614, 76838, 4126, 6, 765, 1380, 93420, 8256, 9026, 512, 1199, 11614, 76838, 4126, 198, 1199, 11614, 76838, 2250, 198, 1199, 11614, 76838, 860, 198, 1199, 11614, 76838, 38777, 198, 1199, 11614, 76838, 34593, 198, 1199, 11614, 76838, 9453, 60722, 198, 5959, 512, 720, 510, 860, 9725, 2, 340, 58, 4126, 9725, 1114, 83480, 34509, 61556, 2628, 340, 58, 38777, 9725, 2, 340, 58, 34593, 9725, 2, 340, 58, 9453, 50868, 9725, 2, 340, 58, 2250, 9725, 2, 8]

for token_id in token_ids:
   stream.step(tok._tokenizer, token_id)
'To'
' modify'
' below'
' `'
'HTML'
'`'
' to'
' `'
'{{'
thread '<unnamed>' panicked at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/vec/mod.rs:2207:36:
slice index starts at 18446744073709551615 but ends at 1
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::slice::index::slice_index_order_fail
   3: tokenizers::decoders::PyDecodeStream::__pymethod_step__
   4: pyo3::impl_::trampoline::trampoline
   5: tokenizers::decoders::<impl pyo3::impl_::pyclass::PyMethods<tokenizers::decoders::PyDecodeStream> for pyo3::impl_::pyclass::PyClassImplCollector<tokenizers::decoders::PyDecodeStream>>::py_methods::ITEMS::trampoline
   6: method_vectorcall_VARARGS_KEYWORDS
             at /usr/local/src/conda/python-3.10.16/Objects/descrobject.c:344:14
   7: _PyObject_VectorcallTstate
             at /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:114:11
   8: PyObject_Vectorcall
             at /usr/local/src/conda/python-3.10.16/Include/cpython/abstract.h:123:12
   9: call_function
             at /usr/local/src/conda/python-3.10.16/Python/ceval.c:5893:13
  10: _PyEval_EvalFrameDefault
             at /usr/local/src/conda/python-3.10.16/Python/ceval.c:4198:23
  11: _PyEval_EvalFrame
             at /usr/local/src/conda/python-3.10.16/Include/internal/pycore_ceval.h:46:12
  12: _PyEval_Vector
             at /usr/local/src/conda/python-3.10.16/Python/ceval.c:5067:24
  13: PyEval_EvalCode
             at /usr/local/src/conda/python-3.10.16/Python/ceval.c:1134:12
  14: run_eval_code_obj
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1291:9
  15: run_mod
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:1312:19
  16: PyRun_InteractiveOneObjectEx
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:277
  17: _PyRun_InteractiveLoopObject
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:148
  18: _PyRun_AnyFileObject
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:84:15
  19: PyRun_AnyFileExFlags
             at /usr/local/src/conda/python-3.10.16/Python/pythonrun.c:116
  20: pymain_run_stdin
             at /usr/local/src/conda/python-3.10.16/Modules/main.c:506:15
  21: pymain_run_python
             at /usr/local/src/conda/python-3.10.16/Modules/main.c:598:21
  22: Py_RunMain
             at /usr/local/src/conda/python-3.10.16/Modules/main.c:674:5
  23: Py_BytesMain
             at /usr/local/src/conda/python-3.10.16/Modules/main.c:1094:12
  24: <unknown>
  25: __libc_start_main
  26: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
pyo3_runtime.PanicException: slice index starts at 18446744073709551615 but ends at 1

I test #1699 and the error seems fixed.

And I have another question. The token_ids above is for input. If I want to get the token in the generation phase, how should I do it?
Should I use a for loop and stream instance to iterate over input_ids or Is there a better way to achieve this?

@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2025

The bug is fixed on main for now.

@irexyc
Copy link
Author

irexyc commented Jan 10, 2025

@Narsil

Thanks, and I would like to ask you a question about usage. Is there a demo that the StreamDecode can start from the generated one like below code?

tokens = tokenizer("Hello my name is Robert and I work on vLLM.").input_ids
# 'Hello my name is Robert' is the input prompt
# ' and I work on vLLM.' is the generated tokens

# how ?
stream.step(tokenizer, tokens[5]) == " and"

Or shoud I do the step operations myself for the input prompt like below ? I am not sure if this approach is efficient in python side.

# input
stream.step(tokenizer, tokens[0])
stream.step(tokenizer, tokens[1])
stream.step(tokenizer, tokens[2])
stream.step(tokenizer, tokens[3])
stream.step(tokenizer, tokens[4])

# generate
out = stream.step(tokenizer, tokens[5]) # which output ' and'

@Ankita-Phd
Copy link

I am receiving an error as:
RuntimeError: Failed to import transformers.models.whisper.tokenization_whisper because of the following error (look up to see its traceback):
module 'decoders' has no attribute 'DecodeStream'

Please help me out.

@ArthurZucker
Copy link
Collaborator

Did you install from source? What is the version of python you are using?
I actually have the same issue with python <= 3.9 I think

@Ankita-Phd
Copy link

Ankita-Phd commented Jan 28, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants