Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit tests #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add unit tests #9

wants to merge 1 commit into from

Conversation

snowyu
Copy link

@snowyu snowyu commented Feb 6, 2024

default bge-large-zh-v1.5 to test, see tests/CMakeLists.txt#TEST_MODEL_NAME to change.

test tokens in test_prompts.txt

run unit test: make -C build test

It will:

  1. Generate the HF's tokens to compare in hf_tokenized_ids.txt(require python, AutoTokenizer installed)
  2. test bert_cpp generated tokens with hf_tokenized_ids.txt

@iamlemec
Copy link
Owner

iamlemec commented Feb 6, 2024

@snowyu Sorry, I was just putting into the new testing infrastructure! Check it out in tests. I think it'll make it a lot easier to fill out the test suite. Feel free to add any additional strings or tests you think are needed.

This is all through the Python API. I'm not too worried about if being different from the underlying C++. It's a really thin wrapper for tokenization especially. And if something breaks, it'll show up different from huggingface anyway.

And incidentally, these tests all work for bge-base-en-v1.5 and bge-base-zh-v1.5 but they fail for bge-m3, so I guess something with pre-tokenization or vocab loading is off there.

@snowyu
Copy link
Author

snowyu commented Feb 6, 2024

I thought that converting text to token is only necessary when embedding (bert_forward). If you are using python, js or rust, you can directly use the official tokenizer. It is not needed for cpp. This situation will continue until llama.cpp completely implements HF's dynamic tokenizer.

The current problem is that the gguf format of llama.cpp lacks the necessary configuration required by the tokenizer. See: ggerganov/llama.cpp#5143

Regarding bge-m3's tokenizer , it has special processing on the normalizer and is also processed in the subsequent post_processor, see its tokenizer.json.

@iamlemec
Copy link
Owner

iamlemec commented Feb 6, 2024

We still need to support tokenization on the cpp side of things. People aren't always going to want to use the Python interface, if they're doing inference on edge devices for instance. In principle, we could add a use_hf option to the Python tokenizer interface, since we already require the library for GGUF conversion.

I'm not sure putting the full tokenizer config in the GGUF file is the best option here. In the end, the difficult part is implementing the tokenizer logic in C++. As we get more coverage, we can include tokenizer params that reflect that in the GGUF and try to do as much pre-processing in Python land on conversion. Realistically, there aren't that many tokenizer types out there. I think we should err on the side of keeping things as simple as possible. That said, if there was a straightforward tokenizer.cpp library out there, I'd be happy to make it a dependency!

As for bge-m3, yeah it has a "precompiled_charsmap" in there. Looking at transformers.js, I think they actually hard-code a default behavior there and don't use the actual charsmap. But the Rust implementaiton in HF tokenizers doesn't seem too bad. It would be nice to know if the approach used in transformers.js works in almost all cases, particularly bge-m3.

@snowyu
Copy link
Author

snowyu commented Feb 7, 2024

We still need to support tokenization on the cpp side of things.

Yes, indeed. But we need a workaround currently.

I'm not sure putting the full tokenizer config in the GGUF file is the best option here.

What's the best options? We now need a way to continue. If GGUF does not embed tokenizer related parameters, how to proceed? Whether implementing tokenizer in CPP or using tokenizer in other languages, parameters are required. If there are no necessary parameters and Still need to use gguf, the only ugly way is to copy "tokenizer.json" and "tokenizer_config.json" to the directory where the GGUF file is located.

Therefore, I thought that letting the tokenizer parameter exist in GGUF is the first step, and implementing the tokenizer is the second step. In fact, some parameters are already in the GGUF file, and you are already using them.

Go back to bge-m3, and you will find that there are still missing parameters and cannot be tokenized.
If you use sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, you will find that the pre_tokenizer array parameter is missing too.

I don't understand why there is a problem with pre-writing tokenizer parameters that are temporarily unused?

Btw the transformer.js has the Precompiled Normalizer, and bge-m3 works well on transformer.js. See https://github.com/xenova/transformers.js/blob/41f98b761f183b1cd27d6cf461e5843e0af8caf7/src/tokenizers.js#L2234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants