Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please update tokenizers and transformers version #607

Open
ominfowave opened this issue Apr 23, 2022 · 16 comments
Open

Please update tokenizers and transformers version #607

ominfowave opened this issue Apr 23, 2022 · 16 comments

Comments

@ominfowave
Copy link

Please add tokenizers version 0.11.1, it is a requirement for some of the latest python modules like indic-punct.

@mhsmith
Copy link
Member

mhsmith commented Apr 23, 2022

We currently offer the following tokenizers versions:

  • tokenizers 0.10.3 (compatible with transformers==4.15.0)
  • tokenizers 0.7.0 (compatible with transformers==2.11.0)

Both of these are currently only available for Python 3.8. To change the Python version of your app, see here.

As you can see from its setup.py file, indic-punct pins all of its requirements to specific versions. With packages that do this, it's sometimes possible to get them working by specifying whatever is the closest version available in the Chaquopy repository:

                install "indic-punct"
                install "torch==1.8.1"
                install "torchvision==0.9.1"
                install "transformers==4.15.0"
                install "tokenizers==0.7.0"

In this case I've used the closest newer version of each requirements, but sometimes you might need to use the closest older one.

@mhsmith
Copy link
Member

mhsmith commented Apr 25, 2022

Unfortunately, the current version of indic-punct (2.1.4) also has a native requirement which Chaquopy doesn't support at all (pynini). It's possible that one of the older versions of indic-punct doesn't have this requirement, but the release history is confusing (8 releases in one day, and no tags on GitHub), so that's something you'd have to look into by yourself.

See also #608.

@mhsmith mhsmith changed the title Please add tokenizers version 0.11.1 indic-punct: please add tokenizers version 0.11.1 Apr 25, 2022
@mhsmith mhsmith closed this as completed Apr 29, 2022
@mhsmith mhsmith changed the title indic-punct: please add tokenizers version 0.11.1 Please update tokenizers version Apr 29, 2022
@mhsmith
Copy link
Member

mhsmith commented Apr 29, 2022

We're not planning to update this package in the near future, but if you'd like to try building the new version yourself, follow the instructions here. However, our package build tool doesn't currently have working support for Rust – see #1030 for details.

If anyone else needs a newer version of tokenizers, please click the thumbs up button above, and post a comment explaining why you need it.

@Benoit-W
Copy link

Benoit-W commented Jul 20, 2023

Hello,
I am trying to use some recent model from transformers which require more recent tokenizer version (transformers 4.23.1 or higher which require tokenizers!=0.11.3,<0.14,>=0.11.1) but as i saw on #608 it seems to be a bit complicated because of rust.
I would like to know if there are some update about tokenizers library planned soon.

@mhsmith
Copy link
Member

mhsmith commented Jul 20, 2023

Sorry, we have no update planned in the near future. But if you'd like to try updating it yourself, see the links in my previous comment.

Our current tokenizers versions are listed in my comment above. If none of those would work for your project, please post a comment explaining why.

@melink14
Copy link

Looks like I also need an updated version of tokenizers package for working with manga-ocr (Requires transformers >= 4.25.0

Failed to install tokenizers<0.15,>=0.14 from https://files.pythonhosted.org/packages/b2/b9/bf025d763bbdd333cb88bedb23426f932c5b4a6ce6f033c498517fad5b90/tokenizers-0.14.1.tar.gz#sha256=ea3b3f8908a9a5b9d6fc632b5f012ece7240031c44c6d4764809f33736534166 (from transformers>=4.25.0->manga-ocr).

I've added my thumbs up and might lo0ok at the instructions to install myself later if I have time.

@mhsmith
Copy link
Member

mhsmith commented Nov 6, 2023

Thanks – I haven't checked, but you may be able to work around this by using an older version of manga-ocr.

@pcrwebdesign
Copy link

In my case I need version 0.13.3 because it is a requirement of faster-whisper.
In case it helps others I have made some progress updating it myself by:

  1. Building my own versions of openssl for each abi (mimicking cryptography's approach) and setting OPENSSL__LIB_DIR and OPENSSL_INCLUDE_DIR to the resulting directories.
  2. Setting RUSTUP_TOOLCHAIN to 1.72.1 to avoid error due to the stricter newer rust compiler. See stackoverflow answer
  3. Modifying the generated Cargo.toml to lower the version of the clap dependency (to 4.4.18) because the existing one requires a higher version of rustc (see point 2)

However I am blocked due to the build-wheel.sh script setting
env["_PYTHON_HOST_PLATFORM"] = f"linux_{ABIS[self.abi].uname_machine}"
which overrides sysconfig.get_platform() returning a value without a dash, thus causing setuptools_rust.build.get_dylib_ext_path to crash.

I wonder if someone knows the reasoning for setting that env variable and/or the consequences of unsetting it or setting it to a different value that conforms to the usual {osname}-{release}-{machine}.

@mhsmith
Copy link
Member

mhsmith commented Apr 10, 2024

I don't remember exactly why we added that variable; you can probably find out from the Git history. But going by the sysconfig.get_platform documentation, I agree it should use a dash rather than an underscore, but without a version number on Linux.

@choyuansu
Copy link

I needed a module in a more recent version of transformers, which requires tokenizers>=0.14.

I tried building a wheel for tokenizers==0.15.2 following this README and met this error:

Error log
warning: [email protected]: src/esaxx.cpp:620:10: fatal error: 'cstdint' file not found
warning: [email protected]: #include <cstdint>
warning: [email protected]:          ^~~~~~~~~
warning: [email protected]: 1 error generated.

error: failed to run custom build command for `esaxx-rs v0.1.10`

Caused by:
  process didn't exit successfully: `/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-43d93e9b64a75770/build-script-build` (exit status: 1)
  --- stdout
  TARGET = Some("x86_64-unknown-linux-gnu")
  OPT_LEVEL = Some("3")
  HOST = Some("x86_64-unknown-linux-gnu")
  cargo:rerun-if-env-changed=CXX_x86_64-unknown-linux-gnu
  CXX_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_x86_64_unknown_linux_gnu
  CXX_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = Some("/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++")
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64-unknown-linux-gnu
  CXXFLAGS_x86_64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_x86_64_unknown_linux_gnu
  CXXFLAGS_x86_64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = Some("")
  running: "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "--target=x86_64-unknown-linux-gnu" "-I" "src" "-std=c++11" "-o" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-5858a4f309d526f4/out/src/esaxx.o" "-c" "src/esaxx.cpp"
  cargo:warning=src/esaxx.cpp:620:10: fatal error: 'cstdint' file not found

  cargo:warning=#include <cstdint>

  cargo:warning=         ^~~~~~~~~

  cargo:warning=1 error generated.

  exit status: 1

  --- stderr


  error occurred: Command "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/wrappers/aarch64-linux-android21-clang++" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "--target=x86_64-unknown-linux-gnu" "-I" "src" "-std=c++11" "-o" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/target/release/build/esaxx-rs-5858a4f309d526f4/out/src/esaxx.o" "-c" "src/esaxx.cpp" with args "aarch64-linux-android21-clang++" did not execute successfully (status code exit status: 1).


warning: build failed, waiting for other jobs to finish...
💥 maturin failed
  Caused by: Failed to build a native library through cargo
  Caused by: Cargo build finished with "exit status: 101": `env -u CARGO PYO3_ENVIRONMENT_SIGNATURE="cpython-3.8-64bit" PYO3_PYTHON="/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python" PYTHON_SYS_EXECUTABLE="/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python" "cargo" "rustc" "--features" "pyo3/extension-module" "--message-format" "json-render-diagnostics" "--manifest-path" "/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/src/bindings/python/Cargo.toml" "--release" "--lib"`
Error: command ['maturin', 'pep517', 'build-wheel', '-i', '/home/alex/Documents/chaquopy/server/pypi/packages/tokenizers/build/0.15.2/cp38-cp38-android_21_arm64_v8a/env/bin/python', '--compatibility', 'off'] returned non-zero exit status 1
build-wheel: Error: Backend subprocess exited when trying to invoke build_wheel

Not sure how to proceed from here. Any help is appreciated.

@mhsmith
Copy link
Member

mhsmith commented Apr 15, 2024

This appears to be caused by the --target option, which is unnecessary because the target is already encoded into the compiler launcher. You'd have to examine the build system to work out how to remove the option, but unfortunately I don't know any more than that.

@choyuansu
Copy link

@mhsmith
Thanks for the hint. I now switched to building inside a docker container, and I'm getting a different error: build-wheel: Error: /workdir/chaquopy/server/pypi/packages/tokenizers/build/0.14.1/cp38-cp38-android_21_arm64_v8a/fix_wheel/tokenizers/tokenizers.so is linked against unknown library 'libstdc++.so.6'.

Here's the Dockerfile and docker-compose.yaml I used, and some other changes to help reproduce the error:

Dockerfile
FROM python:3.8.18-slim-bookworm

RUN apt update && apt install -y \
    patch \
    patchelf \
    unzip \
    curl \
    build-essential \
    wget

WORKDIR /workdir

COPY server/pypi/requirements.txt /workdir
RUN pip install -r requirements.txt
RUN curl https://sh.rustup.rs -sSf | bash -s -- -y

ENV ANDROID_HOME=/workdir/chaquopy/server/pypi/android-sdk
docker-compose.yaml
services:
  build-wheel:
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
      - .:/workdir/chaquopy
    command:
      - bash
      - -ecl
      - |
        # download target if not exist
        cd /workdir/chaquopy
        if [ ! -d /workdir/chaquopy/maven/com/chaquo/python/target/3.8.18-0 ]; then
          target/download-target.sh maven/com/chaquo/python/target/3.8.18-0
        fi

        # build wheel
        cd /workdir/chaquopy/server/pypi
        ./build-wheel.py --python 3.8 --abi arm64-v8a tokenizers
Other changes
diff --git a/.dockerignore b/.dockerignore
index 775ef4ae..10b58a74 100644
--- a/.dockerignore
+++ b/.dockerignore
@@ -12,6 +12,7 @@
 !server/pypi/pkgtest
 !server/pypi/dist
 !server/pypi/piptest
+!server/pypi/requirements.txt
 
 **/.gradle/
 **/.idea/
diff --git a/server/pypi/packages/tokenizers/meta.yaml b/server/pypi/packages/tokenizers/meta.yaml
index 9d4b96f8..3932e56f 100644
--- a/server/pypi/packages/tokenizers/meta.yaml
+++ b/server/pypi/packages/tokenizers/meta.yaml
@@ -1,7 +1,7 @@
 package:
   name: tokenizers
-  version: "0.10.3"
+  version: "0.15.2"
 
 requirements:
   build:
-    - setuptools-rust 0.11.6
\ No newline at end of file
+    - setuptools-rust 0.11.6
diff --git a/server/pypi/packages/tokenizers/patches/chaquopy.patch b/server/pypi/packages/tokenizers/patches/chaquopy.patch
deleted file mode 100644
index 50b3601b..00000000
--- a/server/pypi/packages/tokenizers/patches/chaquopy.patch
+++ /dev/null
@@ -1,51 +0,0 @@
---- src-original/setup.py	2020-04-17 16:57:37.000000000 +0000
-+++ src/setup.py	2021-01-12 23:57:10.005615920 +0000
-@@ -1,6 +1,38 @@
- from setuptools import setup
- from setuptools_rust import Binding, RustExtension
- 
-+
-+# BEGIN Chaquopy additions
-+import os
-+from os.path import abspath, dirname, exists
-+from subprocess import check_call
-+import sys
-+
-+triplet = os.environ["CHAQUOPY_TRIPLET"]
-+rust_toolchain = open("rust-toolchain").read().strip()
-+check_call(["rustup", "toolchain", "install", rust_toolchain])
-+check_call(["rustup", "target", "add", "--toolchain", rust_toolchain, triplet])
-+
-+os.environ["CARGO_BUILD_TARGET"] = triplet
-+sysroot = abspath(f"{dirname(os.environ['CC'])}/../sysroot")
-+py_version = "{}.{}".format(*sys.version_info[:2])
-+os.environ["PYO3_CROSS_INCLUDE_DIR"] = f"{sysroot}/usr/include/python{py_version}"
-+os.environ["PYO3_CROSS_LIB_DIR"] = f"{sysroot}/usr/lib"
-+
-+os.makedirs(".cargo", exist_ok=True)
-+config_filename = ".cargo/config.toml"
-+config = f"""\
-+[target.{triplet}]
-+ar = "{os.environ['AR']}"
-+linker = "{os.environ['CC']}"
-+"""
-+if exists(config_filename) and open(config_filename).read() != config:
-+    raise Exception(f"{config_filename} exists with different content")
-+with open(config_filename, "w") as config_file:
-+    config_file.write(config)
-+# END Chaquopy additions
-+
-+
- extras = {}
- extras["testing"] = ["pytest"]
- 
-@@ -15,7 +47,8 @@
-     author_email="[email protected]",
-     url="https://github.com/huggingface/tokenizers",
-     license="Apache License 2.0",
--    rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)],
-+    rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3,
-+                     rustc_flags=[f"-lpython{py_version}"])],  # Chaquopy
-     extras_require=extras,
-     classifiers=[
-         "Development Status :: 5 - Production/Stable",

I also found this comment suggesting adding the -stdlib=libstdc++ option, but I'm not sure where to add that.

Hope you can help me solve this error. Thanks!

@mhsmith
Copy link
Member

mhsmith commented Apr 16, 2024

Sorry, I don't have time to look into this in any detail. But libstdc++.so.6 is a Linux library name which should never appear in an Android build, so this is probably caused by the build using a mixture of Android and Linux elements.

@mhsmith mhsmith changed the title Please update tokenizers version Please update tokenizers and transformers version May 11, 2024
@divyanshluthra
Copy link

Hey, Hope you are doing well. I am facing issues while trying to pip install anthropic which has a dependency of tokenizer>=0.13. I tried with 0.13 version, but i get the attached errors. Could you please guide as to how we can work around this issue.
Regards
Divyansh
tokenizer.log

@mhsmith
Copy link
Member

mhsmith commented May 16, 2024

You could try using an older version of anthropic. Looking back through the blame of anthropic's pyproject.toml, the last version which didn't require such a new version of tokenizers was anthropic 0.2.10. That came out less than a year ago, but this is obviously a fast-moving package, so I don't know if that would be acceptable for you.

@divyanshluthra
Copy link

You could try using an older version of anthropic. Looking back through the blame of anthropic's pyproject.toml, the last version which didn't require such a new version of tokenizers was anthropic 0.2.10. That came out less than a year ago, but this is obviously a fast-moving package, so I don't know if that would be acceptable for you.

Luckily, the tokenizer version 0.10.3 has worked with the latest anthropic package so far. I thought to test it regardless of the incompatibility error during build and run, and it worked. Yeah, anthropic older versions are not available to newer users as per their api docs, because of huge changes/improvements in their latest offering "opus". So far so good..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants