Skip to content

Add voyage3 model #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 7, 2025
Merged

Add voyage3 model #62

merged 9 commits into from
May 7, 2025

Conversation

aneubeck
Copy link
Collaborator

@aneubeck aneubeck commented May 5, 2025

voyage3 requires unicode normalization for which I introduced basic support.
I also fixed the broken equivalence tests in the benchmarks folder.
It is important to note that huggingface uses some byte to char conversion, so that it can run the BPE algorithm on unicode strings. In the test, the pretokenizer were not properly disabled.

We could consider adding features for the different models.

@Copilot Copilot AI review requested due to automatic review settings May 5, 2025 12:08
@aneubeck aneubeck requested a review from a team as a code owner May 5, 2025 12:08
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds basic Unicode normalization support and introduces the new voyage3 model, along with test and benchmark updates to accommodate optional Tiktoken tokenizers and fixed equivalence tests.

  • Introduce Normalizable/NormalizedString and wire NFC normalization into the Tokenizer API
  • Add voyage3_base BPE model (build script, library, benchmarks) and make Tiktoken optional
  • Fix equivalence tests by disabling Hugging Face pretokenizer and adding a byte-char mapping

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/bpe/benchmarks/performance.rs Guard Tiktoken benchmarks behind Option check
crates/bpe/benchmarks/lib.rs Make Tiktoken optional and add voyage3 entry
crates/bpe/benchmarks/equivalence.rs Implement char_to_byte, fix equivalence tests
crates/bpe/benchmarks/Cargo.toml Enable rand and tiktoken features for bpe
crates/bpe-openai/src/normalizer.rs New NormalizedString type and Normalizable trait
crates/bpe-openai/src/lib.rs Wire in NFC, add voyage3_base, update Tokenizer
crates/bpe-openai/build.rs Serialize voyage3_base in build script
crates/bpe-openai/Cargo.toml Add unicode-normalization dependency
Comments suppressed due to low confidence (3)

crates/bpe-openai/build.rs:25

  • The build script directive is incorrect. It should be cargo:rerun-if-changed=build.rs (single colon) to inform Cargo properly.
println!("cargo::rerun-if-changed=build.rs");

crates/bpe/benchmarks/performance.rs:166

  • [nitpick] Shadowing the outer tiktoken variable may reduce clarity. Consider renaming the inner binding (e.g. if let Some(tok) ) to avoid confusion.
if let Some(tiktoken) = tiktoken {

crates/bpe-openai/src/lib.rs:266

  • The new voyage3_base model isn’t covered by this test. Consider adding assertions for voyage3_base().count_till_limit(...) to ensure it behaves correctly.
fn test_count_till_limit() {

@CleanCut
Copy link
Contributor

CleanCut commented May 6, 2025

I pushed a commit which fixed a cargo fmt difference in Rust 1.86.

Copy link
Contributor

@jorendorff jorendorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


use unicode_normalization::UnicodeNormalization;

/// Type which represents a normalized string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Type which represents a normalized string.
/// Type which represents an NFC normalized string.

@jorendorff
Copy link
Contributor

@aneubeck This is a breaking change, so it requires a version bump, right?

@aneubeck aneubeck enabled auto-merge May 7, 2025 09:47
@aneubeck aneubeck merged commit bcb4204 into main May 7, 2025
7 checks passed
@aneubeck aneubeck deleted the aneubeck/voyage branch May 7, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants