Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Bug] Issue 934 #962

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
4 changes: 4 additions & 0 deletions .github/workflows/workflow-pr-gate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ jobs:
- "transformers_phi2_cpu"
# - "transformers_mistral_7b_cpu" See Issue 713
- "llamacpp_llama2_7b_cpu"
- "llamacpp_llama3_9b_cpu"
- "llamacpp_mistral_7b_cpu"
- "transformers_phi3_mini_4k_instruct_cpu"
- "llamacpp_phi3_mini_4k_instruct_cpu"
Expand Down Expand Up @@ -108,6 +109,7 @@ jobs:
- "transformers_phi2_cpu"
# - "transformers_mistral_7b_cpu" See Issue 713
- "llamacpp_llama2_7b_cpu"
- "llamacpp_llama3_9b_cpu"
- "llamacpp_mistral_7b_cpu"
- "transformers_phi3_mini_4k_instruct_cpu"
- "llamacpp_phi3_mini_4k_instruct_cpu"
Expand Down Expand Up @@ -153,6 +155,7 @@ jobs:
- "transformers_phi2_cpu"
# - "transformers_mistral_7b_cpu" See Issue 713
- "llamacpp_llama2_7b_cpu"
- "llamacpp_llama3_9b_cpu"
- "llamacpp_mistral_7b_cpu"
# - "transformers_phi3_mini_4k_instruct_cpu" Gives trouble on MacOS
- "llamacpp_phi3_mini_4k_instruct_cpu"
Expand Down Expand Up @@ -193,6 +196,7 @@ jobs:
- "transformers_phi2_cpu"
# - "transformers_mistral_7b_cpu" See Issue 713
- "llamacpp_llama2_7b_cpu"
- "llamacpp_llama3_9b_cpu"
- "llamacpp_mistral_7b_cpu"
- "transformers_phi3_mini_4k_instruct_cpu"
- "llamacpp_phi3_mini_4k_instruct_cpu"
Expand Down
21 changes: 20 additions & 1 deletion guidance/models/llama_cpp/_llama_cpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,28 @@ def __init__(self, model_obj, chat_template=None):
)

def encode(self, byte_string: bytes) -> Sequence[int]:
# Some models (e.g. Llama3) can produce tokens which are only partial
# UTF-8 codepoints. The underlying tokenizer can't cope with these
# and has a tendency to segfault.
# To address this, shorten the bytes sent to the tokenizer to a valid
# UTF-8 value
# I hope this will not bite us with some subtle other bug in future
Copy link
Collaborator Author

@riedgar-ms riedgar-ms Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very optimistic about this. This 'fix' almost certainly breaks recode() when said fix kicks in. And fundamentally, these bytes came from the LLM, so why on earth is its tokeniser refusing to have anything to do with them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for investigating this Richard. I agree, this feels like potentially buggy behavior directly in llama-cpp. Do we have a repro for 934 that directly uses the llama-cpp (or maybe llama-cpp-python if we need to) bindings? Instead of attempting to fix through this route, a lower level repo might mean we could/should raise an issue on llama-cpp?

Copy link
Collaborator Author

@riedgar-ms riedgar-ms Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knilink had found:
printf '\xe6\xad' | ./llama-tokenize -m ./Meta-Llama-3-8B-Instruct.Q8_0.gguf --stdin

I have filed a llama-cpp-python-based issue (rewritten from @knilink 's investigation) directly on the HF repo whence I'm grabbing the GGUF

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And filed the LlamaCpp bug:
ggerganov/llama.cpp#8691

bytes_to_encode = b""
got_bytes_to_encode = False
for i in range(0, 3):
# Recall that a UTF-8 codepoint can be up to 3 bytes long
try:
bytes_to_encode = byte_string[0 : len(byte_string) - i]
_ = bytes_to_encode.decode()
got_bytes_to_encode = True
break
except UnicodeDecodeError:
pass
if not got_bytes_to_encode:
raise ValueError(f"Failed to shorten {byte_string!r} to valid unicode")
# Workaround for the LlamaCpp prepending spaces on encoding
raw_tokens = self._model_obj.tokenize(
self._sentinel_bytes + byte_string, add_bos=False, special=True
self._sentinel_bytes + bytes_to_encode, add_bos=False, special=True
)
assert raw_tokens[: len(self._sentinel_tokens)] == self._sentinel_tokens
return raw_tokens[len(self._sentinel_tokens) :]
Expand Down
4 changes: 4 additions & 0 deletions tests/_llms_for_testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@
name="transformers:meta-llama/Meta-Llama-3-8B-Instruct",
kwargs={"trust_remote_code": True, "torch_dtype": torch.bfloat16, "device_map": "cuda:0"},
),
"llamacpp_llama3_9b_cpu": dict(
name="huggingface_hubllama:bartowski/Meta-Llama-3-8B-Instruct-GGUF:Meta-Llama-3-8B-Instruct-IQ3_S.gguf",
kwargs={"verbose": True, "n_ctx": 4096},
),
}

_MISTRAL = {
Expand Down
1 change: 1 addition & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def llamacpp_model(selected_model, selected_model_name):
if selected_model_name in [
"llamacpp_llama2_7b_cpu",
"llamacpp_llama2_7b_gpu",
"llamacpp_llama3_9b_cpu",
"llamacpp_gemma2_9b_cpu",
"llamacpp_phi3_mini_4k_instruct_cpu",
"llamacpp_mistral_7b_cpu",
Expand Down
11 changes: 8 additions & 3 deletions tests/model_integration/test_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ def my_function(lm):
assert str(lm) in ["this is a test another item1", "this is a test another item2"]


def test_with_multitokenchars(selected_model: guidance.models.Model):
# Taken from https://github.com/guidance-ai/guidance/issues/934
lm = selected_model
lm += "歪" + select(["打正着", "门邪道"])
Copy link
Collaborator Author

@riedgar-ms riedgar-ms Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record:

>>> a
'打'
>>> b
'门'
>>> a.encode()
b'\xe6\x89\x93'
>>> b.encode()
b'\xe9\x97\xa8'

So the two leading characters in the select() do not share any common bytes here.

assert str(lm) == "歪打正着" or str(lm) == "歪门邪道"


def test_token_count(selected_model):
lm = selected_model
lm2 = lm + " 1 1 1 1 1" + gen(max_tokens=9) + gen(max_tokens=9)
Expand All @@ -36,9 +43,7 @@ def test_token_healing(selected_model):
if model_type != "GPT2LMHeadModel":
pytest.skip("Test for GPT2 bug only")
gpt2 = selected_model
lm = gpt2 + (
"This is a story of 10 or 5 or " + zero_or_more(byte_range(b"0", b"9"))
)
lm = gpt2 + ("This is a story of 10 or 5 or " + zero_or_more(byte_range(b"0", b"9")))
assert len(lm) > len("This is a story of 10 or 5 or ")


Expand Down
Loading