Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can tokenizers.fromPretrained cache directory be specified #29

Open
rmrbytes opened this issue Nov 16, 2024 · 3 comments
Open

Can tokenizers.fromPretrained cache directory be specified #29

rmrbytes opened this issue Nov 16, 2024 · 3 comments

Comments

@rmrbytes
Copy link

Thanks for this convenient library to use HF tokenizers. It seems that the json file downloaded does get cached but could not determine its location. Is there a way to specify it via TokenizerConfig?

Thanks.

@daulet
Copy link
Owner

daulet commented Nov 18, 2024

does get cached but could not determine its location

what do you mean by this? can you paste your repro for the error?

@rmrbytes
Copy link
Author

There was no error @daulet . I only found that every time I run it was giving the message:

Successfully downloaded /var/folders/xb/50fkm1vj7mj_mb14nvc18h5r0000gn/T/huggingface-tokenizer-432552157/tokenizer.json

And if I put the tokenizer command (tokenizers.FromPretrained("google-bert/bert-base-uncased") inside the loop, that many times it gave the message. Hence I wondered about the caching and whether we are supposed to specify a directory as an option.

Just to let you know I am using this inside a docker environment where I am building a chunking service in go and wanted your function to check for token limit for any specified model (via .env).

Subsequently I create a singleton as follows, now it gives the message as designed only once as below:

package splitters

import (
	"log"
	"sync"

	"github.com/daulet/tokenizers"
)

// Global tokenizer variable
var (
	tokenizerInstance *tokenizers.Tokenizer
	once              sync.Once
)

// initTokenizer initializes the tokenizer only once
func initTokenizer() {
	var err error
	tokenizerInstance, err = tokenizers.FromPretrained("google-bert/bert-base-uncased")
	if err != nil {
		log.Fatalf("Failed to load tokenizer: %v", err)
	}
}

// GetTokenizerInstance provides access to the tokenizer instance
func GetTokenizerInstance() *tokenizers.Tokenizer {
	// Ensure the tokenizer is loaded only once using sync.Once
	once.Do(initTokenizer)
	return tokenizerInstance
}

// getTokenLength calculates the number of tokens in the given text
func getTokenLength(input string) int {
	// Get the tokenizer instance
	tokenizer := GetTokenizerInstance()

	// Encode the input text
	encodings, _ := tokenizer.Encode(input, true)

	// Return the number of tokens
	return len(encodings)
}

Thanks for making this convenient library in Go

@rmrbytes
Copy link
Author

I am able to get the no of tokens as desired when I run using go run main.go ... but when I try the same via docker file it gives me the following error.

Failed to load tokenizer: failed to download mandatory file tokenizer.json: failed to download from https://huggingface.co/google-bert/bert-base-uncased/resolve/main/tokenizer.json: Get "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/tokenizer.json": tls: failed to verify certificate: x509: certificate signed by unknown authority

Is a HF token required? Tho, I wondered how it was able to download it when run locally.

The following is my dockerfile

# Stage 1: Build Go Application
FROM golang:1.23.2-bullseye AS builder

WORKDIR /app

# Copy Go modules manifests and download dependencies
COPY go.mod go.sum ./
RUN go mod download

# Copy the source code and fetch script
COPY . .
RUN chmod +x fetch_tokenizer_library.sh && ./fetch_tokenizer_library.sh

# Set environment variables for CGO to link to the downloaded library
ENV CGO_ENABLED=1
ENV CGO_LDFLAGS="-L/app/libs/tokenizers -ltokenizers"  
ENV CGO_CXXFLAGS="--std=c++11"

# Build the Go application
RUN go build -o my-app .

# Stage 2: Create Runtime Image
FROM debian:bullseye-slim

# Copy the Go application binary from the builder stage
COPY --from=builder /app/myapp .

# Create data directory directly in the runtime stage
RUN mkdir -p /data

# Expose the port (if needed)
EXPOSE 8080

# Run the application
CMD ["./my-app"]

And following is the fetch.sh

#!/bin/bash

set -e

# Define where the libraries should be placed within your project
LIB_DIR="./libs/tokenizers"
mkdir -p "$LIB_DIR"

# Detect platform
PLATFORM=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)

# Determine which library to download
if [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "x86_64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-amd64.tar.gz"
elif [ "$PLATFORM" = "linux" ] && [ "$ARCH" = "aarch64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.linux-arm64.tar.gz"
elif [ "$PLATFORM" = "darwin" ] && [ "$ARCH" = "arm64" ]; then
  LIB_URL="https://github.com/daulet/tokenizers/releases/latest/download/libtokenizers.darwin-arm64.tar.gz"
else
  echo "Unsupported platform: $PLATFORM-$ARCH"
  exit 1
fi

# Download and extract the pre-built library
echo "Downloading tokenizer library for $PLATFORM-$ARCH..."
curl -L -o "$LIB_DIR/libtokenizers.tar.gz" "$LIB_URL"

# Extract the library into the target directory
echo "Extracting the library..."
tar -xzf "$LIB_DIR/libtokenizers.tar.gz" -C "$LIB_DIR"

# Clean up
rm "$LIB_DIR/libtokenizers.tar.gz"

echo "Library downloaded and extracted successfully to $LIB_DIR"

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants