-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can tokenizers.fromPretrained cache directory be specified #29
Comments
what do you mean by this? can you paste your repro for the error? |
There was no error @daulet . I only found that every time I run it was giving the message:
And if I put the tokenizer command ( Just to let you know I am using this inside a docker environment where I am building a chunking service in go and wanted your function to check for token limit for any specified model (via .env). Subsequently I create a singleton as follows, now it gives the message as designed only once as below:
Thanks for making this convenient library in Go |
I am able to get the no of tokens as desired when I run using
Is a HF token required? Tho, I wondered how it was able to download it when run locally. The following is my dockerfile
And following is the fetch.sh #!/bin/bash
set -e
# Define where the libraries should be placed within your project
LIB_DIR="./libs/tokenizers"
mkdir -p "$LIB_DIR"
# Detect platform
PLATFORM=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
# Determine which library to download
if [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "x86_64" ]; then
LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-amd64.tar.gz"
elif [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "aarch64" ]; then
LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-arm64.tar.gz"
elif [ "$PLATFORM" = "darwin" ] && [ "$ARCH" = "arm64" ]; then
LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.darwin-arm64.tar.gz"
else
echo "Unsupported platform: $PLATFORM-$ARCH"
exit 1
fi
# Download and extract the pre-built library
echo "Downloading tokenizer library for $PLATFORM-$ARCH..."
curl -L -o "$LIB_DIR/libtokenizers.tar.gz" "$LIB_URL"
# Extract the library into the target directory
echo "Extracting the library..."
tar -xzf "$LIB_DIR/libtokenizers.tar.gz" -C "$LIB_DIR"
# Clean up
rm "$LIB_DIR/libtokenizers.tar.gz"
echo "Library downloaded and extracted successfully to $LIB_DIR" Thanks |
Thanks for this convenient library to use HF tokenizers. It seems that the json file downloaded does get cached but could not determine its location. Is there a way to specify it via TokenizerConfig?
Thanks.
The text was updated successfully, but these errors were encountered: