Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce package size #251

Open
Goldziher opened this issue Nov 24, 2024 · 5 comments
Open

Reduce package size #251

Goldziher opened this issue Nov 24, 2024 · 5 comments

Comments

@Goldziher
Copy link

Goldziher commented Nov 24, 2024

Hi and thanks for this great library.

I am encountering a common problem with ML libraries using KeyBERT - namely, the package is very large due to its dependencies. E.g. torch is a gigantic library (used in the sentence-transformers li), scikit-learn is very large etc. This makes it very difficult to use this library in a serverless context due to cloud function size limitations and cold start issues.

I would like to suggest working to reduce the package size. This can be done by making some dependencies optional and adding guards against them.

@MaartenGr
Copy link
Owner

Thank you for the suggestion! Reducing package size would be helpful but I'm not quite sure what the suggested implementation would look like. Take scikit-learn for example, when I look through the source code I cannot find any way to make it optional as it is a necessary dependency. The same can be said for sentence-transformers as most users are using that as a backend.

How would you suggest removing those packages but still keep functionality of KeyBERT? Also, if these packages are needed but for some reason need a separate installation, how would you suggest making it possible that pip install keybert remains unchanged? For instance, pip install keybert[minimum] is not supported by pip.

@Goldziher
Copy link
Author

Goldziher commented Nov 25, 2024

Im glad you are looking positvely on this suggestion.

To make dependencies optional, there are a few elements that can be used:

  1. import blocks
try:
    from fast_query_parsers import parse_query_string as parse_qsl
except ImportError:
    from urllib.parse import parse_qsl as _parse_qsl

In this example (see source here) we try to import an optional dependency. If there is an ImportError is raised, a fallback is assigned instead.

This could be used for example to implement alternative logic in utils etc.

  1. validation:
  • The simplest approach is to validate that at least one backend is installed on the library load. I.e. runtime validation.
  • Another approach is to raise an error during installation using a setup.py post-install script. See for example this StackOverflow thread.
  1. how to allow backend selection?

The answer is to switch to runtime backend selection. This is a breaking change, and thus it will need to be implemented in a v1.0.0 to work. Basically, the user has to install a backend - either using a extra dependency group, or by installing it separately.

@MaartenGr
Copy link
Owner

In this example (see source here) we try to import an optional dependency. If there is an ImportError is raised, a fallback is assigned instead.

How would something like this be relevant for KeyBERT? I believe this is already implemented. There are many optional installations that you can do outside of the main package for different backends: https://maartengr.github.io/KeyBERT/guides/embeddings.html

validation:
The simplest approach is to validate that at least one backend is installed on the library load. I.e. runtime validation.
Another approach is to raise an error during installation using a setup.py post-install script. See for example this StackOverflow thread.

The thing is, nearly all users will make use of sentence-transformers as that is typically not ony the most performant backend but also something that I'm 99% of users will use as a backend. I believe it's the industry standard.

how to allow backend selection?
The answer is to switch to runtime backend selection. This is a breaking change, and thus it will need to be implemented in a v1.0.0 to work. Basically, the user has to install a backend - either using a extra dependency group, or by installing it separately.

I'm not sure whether this is ideal as this would mean that pip install keybert will result in a package that cannot be used since it does not come with a necessary backend. You would always have to run something like pip install keybert[sbert]. One of the most important components to my packages is ease of use, and I believe adding more steps would make the user experience less pleasant.

Let me rephrase my initial question. I believe you cannot remove sentence-transformers or scikit-learn since the former is a backend that almost all users will use and the package really cannot work without the latter. Thus, how do you propose reducing the package size when these dependencies are necessary?

If these packages are not necessary for the functionality of KeyBERT, could you explain why?

@Goldziher
Copy link
Author

Let me rephrase my initial question. I believe you cannot remove sentence-transformers or scikit-learn since the former is a backend that almost all users will use and the package really cannot work without the latter. Thus, how do you propose reducing the package size when these dependencies are necessary?

in this case it is impossible.

This though makes it difficult to use this library in contexts where the size of the library is an issue.

@MaartenGr
Copy link
Owner

in this case it is impossible.

That's too bad. I hoped that since you specially mentioned sentence-transformers and scikit-learn, you know of a specific way to remove these dependencies that relate to the internals of KeyBERT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants