-
Notifications
You must be signed in to change notification settings - Fork 10
Add voyage3 model #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds basic Unicode normalization support and introduces the new voyage3 model, along with test and benchmark updates to accommodate optional Tiktoken tokenizers and fixed equivalence tests.
- Introduce
Normalizable
/NormalizedString
and wire NFC normalization into theTokenizer
API - Add
voyage3_base
BPE model (build script, library, benchmarks) and make Tiktoken optional - Fix equivalence tests by disabling Hugging Face pretokenizer and adding a byte-char mapping
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
crates/bpe/benchmarks/performance.rs | Guard Tiktoken benchmarks behind Option check |
crates/bpe/benchmarks/lib.rs | Make Tiktoken optional and add voyage3 entry |
crates/bpe/benchmarks/equivalence.rs | Implement char_to_byte , fix equivalence tests |
crates/bpe/benchmarks/Cargo.toml | Enable rand and tiktoken features for bpe |
crates/bpe-openai/src/normalizer.rs | New NormalizedString type and Normalizable trait |
crates/bpe-openai/src/lib.rs | Wire in NFC, add voyage3_base , update Tokenizer |
crates/bpe-openai/build.rs | Serialize voyage3_base in build script |
crates/bpe-openai/Cargo.toml | Add unicode-normalization dependency |
Comments suppressed due to low confidence (3)
crates/bpe-openai/build.rs:25
- The build script directive is incorrect. It should be
cargo:rerun-if-changed=build.rs
(single colon) to inform Cargo properly.
println!("cargo::rerun-if-changed=build.rs");
crates/bpe/benchmarks/performance.rs:166
- [nitpick] Shadowing the outer
tiktoken
variable may reduce clarity. Consider renaming the inner binding (e.g.if let Some(tok)
) to avoid confusion.
if let Some(tiktoken) = tiktoken {
crates/bpe-openai/src/lib.rs:266
- The new
voyage3_base
model isn’t covered by this test. Consider adding assertions forvoyage3_base().count_till_limit(...)
to ensure it behaves correctly.
fn test_count_till_limit() {
I pushed a commit which fixed a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
use unicode_normalization::UnicodeNormalization; | ||
|
||
/// Type which represents a normalized string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Type which represents a normalized string. | |
/// Type which represents an NFC normalized string. |
@aneubeck This is a breaking change, so it requires a version bump, right? |
Co-authored-by: Jason Orendorff <[email protected]>
Co-authored-by: Jason Orendorff <[email protected]>
voyage3 requires unicode normalization for which I introduced basic support.
I also fixed the broken equivalence tests in the benchmarks folder.
It is important to note that huggingface uses some byte to char conversion, so that it can run the BPE algorithm on unicode strings. In the test, the pretokenizer were not properly disabled.
We could consider adding features for the different models.