Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaCpp model crashes with multi-token characters #934

Open
knilink opened this issue Jun 30, 2024 · 5 comments
Open

LlamaCpp model crashes with multi-token characters #934

knilink opened this issue Jun 30, 2024 · 5 comments

Comments

@knilink
Copy link

knilink commented Jun 30, 2024

The bug
A strings containing certain unicode characters to causes an exception.
Likely because is a multi-token characters for this tokenizer

llama3.engine.tokenizer('歪'.encode('utf8')) -> [15722, 103]

I also tested transformers model which seems to be working fine

To Reproduce

from guidance import models, select
llama3 = models.LlamaCpp('./Meta-Llama-3-8B-Instruct.Q4_0.gguf')
llama3 + '歪' + select(['打正着','门邪道'])
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

System info (please complete the following information):
Ubuntu 22.04
Python 3.10.12

guidance==0.1.15
llama_cpp_python==0.2.79

@Harsha-Nori
Copy link
Collaborator

Hi @knilink , thanks for reporting this! Do you know if this happens if you try to generate with llama-cpp-python directly? Getting the full stack trace here would be very helpful!

@paulbkoch might have thoughts here too

@knilink
Copy link
Author

knilink commented Jul 7, 2024

Hi @Harsha-Nori I did a bit more investigation and can confirm the error was caused by sending incomplete Unicode bytes to llama_cpp tokenizer

$ printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted

After adding byte_string.decode('utf8') before

return self._model_obj.tokenize(byte_string, add_bos=False, special=True)

I got the following stack trace:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[21], line 4
      2 # guidance.models._model.ipython_is_imported = False
      3 llama3 = LlamaCpp('/home/jovyan/cache/Meta-Llama-3-8B-Instruct.Q8_0.gguf', file_name='',chat_template=chat.Llama3ChatTemplate, n_gpu_layers=-1,)
----> 4 llama3 + '歪' + select(['打正着','门邪道']) + gen(stop='。')

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1159, in Model.__add__(self, value)
   1157 # run stateless functions (grammar nodes)
   1158 elif isinstance(value, GrammarFunction):
-> 1159     out = lm._run_stateless(value)
   1161 # run stateful functions
   1162 else:
   1163     out = value(lm)

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:1364, in Model._run_stateless(self, stateless_function, temperature, top_p, n)
   1362 delayed_bytes = b""
   1363 # last_is_generated = False
-> 1364 for chunk in gen_obj:
   1365 
   1366     # we make everything full probability if we are not computing uncertainty
   1367     # if not self.engine.compute_log_probs:
   1368     #     chunk.new_bytes_prob = 1.0
   1369 
   1370     # convert the bytes to a string (delaying if we don't yet have a valid unicode string)
   1371     lm.token_count += chunk.new_token_count
   1372     chunk.new_bytes = delayed_bytes + chunk.new_bytes

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:732, in Engine.__call__(self, parser, grammar, ensure_bos_token)
    717 def __call__(self, parser, grammar, ensure_bos_token=True):
    718     """Returns a new updated parser state executed through the grammar.
    719 
    720     Parameters
   (...)
    729         This is the grammar we are extending the parser with.
    730     """
--> 732     self.start(parser, grammar, ensure_bos_token)
    734     logits = None
    735     while True:

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:264, in Engine.start(self, parser, grammar, ensure_bos_token)
    262 # run a simple tokenizer (that does not use a grammar) on the prefix for better performance
    263 self._token_ids, self._token_byte_positions = self._tokenize_prefix(prompt)
--> 264 self._token_ids, self._token_byte_positions = self._cleanup_tokens(
    265     self._token_ids, self._token_byte_positions
    266 )
    267 if len(self._token_byte_positions) > 0:
    268     self._pre_parser_bytes = self._token_byte_positions[-1]

File /opt/conda/lib/python3.11/site-packages/guidance/models/_model.py:808, in Engine._cleanup_tokens(self, token_ids, token_byte_positions)
    805 def _cleanup_tokens(self, token_ids, token_byte_positions):
    806 
    807     # compute a joint tokenization
--> 808     joint_token_ids = self._joint_tokenize(token_ids)
    810     # see if we need to redo the tokenization
    811     redo = False

Cell In[20], line 151, in LlamaCppEngine._joint_tokenize(self, token_ids)
    149 """What a full joint tokenizer would give for a given byte string"""
    150 byte_string = b"".join([self.tokenizer.tokens[t] for t in token_ids])
--> 151 return self.tokenizer(byte_string)

Cell In[20], line 81, in LlamaCppTokenizer.__call__(self, byte_string)
     79 print('[LlamaCppTokenizer] begin', flush=True)
     80 print(byte_string, flush=True)
---> 81 print(byte_string.decode('utf8'), flush=True)
     82 res = self._model_obj.tokenize(byte_string, add_bos=False, special=True)
     83 print('[LlamaCppTokenizer] end', flush=True)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 17-18: unexpected end of data

Transformer model didn't have the issue because its _joint_tokenize didn't use the tokenizer directly.
I didn't do much testing but copy TransformersEngine._joint_tokenize over to LlamaCppEngine seem to get the issue fixed.

@riedgar-ms
Copy link
Collaborator

@knilink , thank you for bringing this up. I've drafted a (very) tentative fix in #962 , which works by chopping off bytes given to the encode() method until it has a valid UTF-8 string. However, I'm really concerned that this is going to be causing trouble for us elsewhere.

Have you filed your repro printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin as a bug with llamacpp?

@riedgar-ms
Copy link
Collaborator

I have been doing some more prodding based on @knilink 's examples, and I've opened a bug on the HF repo whence I grabbed the model (although this does look like something going wrong at the LlamaCpp layer):
https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/discussions/9

@riedgar-ms
Copy link
Collaborator

Also filed the bug on LlamaCpp
ggerganov/llama.cpp#8691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants