Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support user specified language only installation #97

Merged
merged 29 commits into from
Apr 2, 2024

Conversation

Vela-zz
Copy link
Contributor

@Vela-zz Vela-zz commented Mar 7, 2024

split the dependices into several optional dependencies accroding to which language use it.
now user can install like

pip install langcheck #equals to install [en]
pip install langcheck[en]
pip install langcheck[ja]
pip install langcheck[all]
...

TODOs (added by Kenny):

  • Test recursive optional dependencies with an older version of pip
  • Decide on lazy loading or not
  • Decide what to do about [optional]
  • Decide what to do about optional pytests
  • Update installation documentation and README
  • Add custom error message (not possible)

@Vela-zz Vela-zz marked this pull request as draft March 7, 2024 10:05
@Vela-zz
Copy link
Contributor Author

Vela-zz commented Mar 7, 2024

#88
@kennysong please check if this update align with the requirements and update the GH Action.

@Vela-zz
Copy link
Contributor Author

Vela-zz commented Mar 9, 2024

It seems we should update init in metrics first. langcheck init all language at once but not check which language was installed.

@kennysong
Copy link
Contributor

Yeah, I had the exact same conclusion. I'm wondering if we should lazy load the language sub-packages so they're still visible at the package level, but won't actually import any dependencies until a function is called.

I'll explore this a bit more tomorrow

@kennysong
Copy link
Contributor

Okay, I've implemented lazy loading of language-specific packages in langcheck.augment and langcheck.metrics.

Without lazy loading, the user is required to explicitly load each language, such as import langcheck.metrics.ja. Otherwise, they'll see this error which IMO is a bit mysterious:

>>> import langcheck
>>> langcheck.metrics.ja
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'langcheck.metrics' has no attribute 'ja'

With lazy loading, the user will see the exact same behavior as current LangCheck:

>>> import langcheck
>>> langcheck.metrics.ja
<module 'langcheck.metrics.ja' from '/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/__init__.py'>

If the user didn't install Japanese dependencies with pip install langcheck[ja], then they'll see:

>>> import langcheck  # The import error doesn't happen here due to lazy loading! 
>>> langcheck.metrics.ja
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 271, in _module_repr
  File "/usr/local/lib/python3.8/importlib/util.py", line 245, in __getattribute__
    self.__spec__.loader.exec_module(self)
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/__init__.py", line 1, in <module>
    from langcheck.metrics.ja._tokenizers import JanomeTokenizer, MeCabTokenizer
  File "/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/_tokenizers.py", line 6, in <module>
    from janome.tokenizer import Tokenizer
ModuleNotFoundError: No module named 'janome'

I think this is the best UX, but it also adds some complexity to metrics/__init__.py and augment/__init__.py. I've never used lazy loading before so I don't know if it'll cause unexpected problems.

Also, since this is a non-trivial change, I think we should merge this PR after cutting the 0.5.0 release so we have extra time to test it.

Would appreciate your thoughts @Vela-zz and @liwii!

@Vela-zz
Copy link
Contributor Author

Vela-zz commented Mar 10, 2024

Okay, I've implemented lazy loading of language-specific packages in langcheck.augment and langcheck.metrics.

Without lazy loading, the user is required to explicitly load each language, such as import langcheck.metrics.ja. Otherwise, they'll see this error which IMO is a bit mysterious:

>>> import langcheck
>>> langcheck.metrics.ja
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'langcheck.metrics' has no attribute 'ja'

With lazy loading, the user will see the exact same behavior as current LangCheck:

>>> import langcheck
>>> langcheck.metrics.ja
<module 'langcheck.metrics.ja' from '/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/__init__.py'>

If the user didn't install Japanese dependencies with pip install langcheck[ja], then they'll see:

>>> import langcheck  # The import error doesn't happen here due to lazy loading! 
>>> langcheck.metrics.ja
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 271, in _module_repr
  File "/usr/local/lib/python3.8/importlib/util.py", line 245, in __getattribute__
    self.__spec__.loader.exec_module(self)
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/__init__.py", line 1, in <module>
    from langcheck.metrics.ja._tokenizers import JanomeTokenizer, MeCabTokenizer
  File "/usr/local/lib/python3.8/site-packages/langcheck/metrics/ja/_tokenizers.py", line 6, in <module>
    from janome.tokenizer import Tokenizer
ModuleNotFoundError: No module named 'janome'

I think this is the best UX, but it also adds some complexity to metrics/__init__.py and augment/__init__.py. I've never used lazy loading before so I don't know if it'll cause unexpected problems.

Also, since this is a non-trivial change, I think we should merge this PR after cutting the 0.5.0 release so we have extra time to test it.

Would appreciate your thoughts @Vela-zz and @liwii!

FYI @kennysong SPEC1 mentioned about lazy loading and the latent risk by using lazy_loader library.
I followed the instructions in SPEC1 and change all attrs in metrics into lazy loading mode, but loading time saved is not noticable.

@kennysong
Copy link
Contributor

kennysong commented Mar 11, 2024

FYI @kennysong SPEC1 mentioned about lazy loading and the latent risk by using lazy_loader library.
I followed the instructions in SPEC1 and change all attrs in metrics into lazy loading mode, but loading time saved is not noticable.

Ah nice find, I hadn't heard of SPEC1 before!

The SPEC1 lazy_loader highlights an important difference with the standard library's LazyLoader (which I used).

The stdlib LazyLoader doesn't import the actual contents of the package, so while langcheck.metrics.ja is visible, langcheck.metrics.ja.factual_consistency() is not visible.

This means that code completion doesn't work:

CleanShot 2024-03-11 at 21 31 44@2x

And more importantly, type checking doesn't work:

CleanShot 2024-03-11 at 22 22 58@2x

Note that type checks do work if you import the full path:

CleanShot 2024-03-11 at 22 26 28@2x

Given this, I think that the current implementation with LazyLoader isn't a good solution. We have two alternatives:

  1. Don't import language-specific packages in langcheck.metrics and langcheck.augment
  2. Use a more sophisticated and complex option like SPEC1's lazy_loader

However, option 2 is fairly complex, not under active development (only 96 stars, last release in July 2023), and IMO beyond our team's current bandwidth to properly understand and maintain.

I think we should proceed with option 1. This will require users to explicitly import the full path like import langcheck.metrics.ja, which is annoying and a breaking change, but I think it's the only reasonable option right now.

Let me know what you think!

@Vela-zz
Copy link
Contributor Author

Vela-zz commented Mar 11, 2024

FYI @kennysong SPEC1 mentioned about lazy loading and the latent risk by using lazy_loader library.
I followed the instructions in SPEC1 and change all attrs in metrics into lazy loading mode, but loading time saved is not noticable.

Ah nice find, I hadn't heard of SPEC1 before!

The SPEC1 lazy_loader highlights an important difference with the standard library's LazyLoader (which I used).

The stdlib LazyLoader doesn't import the actual contents of the package, so while langcheck.metrics.ja is visible, langcheck.metrics.ja.factual_consistency() is not visible.

This means that code completion doesn't work:

CleanShot 2024-03-11 at 21 31 44@2x

And more importantly, type checking doesn't work:

CleanShot 2024-03-11 at 22 22 58@2x

Note that type checks do work if you import the full path:

CleanShot 2024-03-11 at 22 26 28@2x

Given this, I think that the current implementation with LazyLoader isn't a good solution. We have two alternatives:

  1. Don't import language-specific packages in langcheck.metrics and langcheck.augment
  2. Use a more sophisticated and complex option like SPEC1's lazy_loader

However, option 2 is fairly complex, not under active development (only 96 stars, last release in July 2023), and IMO beyond our team's current bandwidth to properly understand and maintain.

I think we should proceed with option 1. This will require users to explicitly import the full path like import langcheck.metrics.ja, which is annoying and a breaking change, but I think it's the only reasonable option right now.

Let me know what you think!

I upload my experiment (it can support type checking/ static check). I think through the example below, it can show that the lazy loader used in SPEC1 is quite flexible. they can mix eager import with lazy import, just eager import the ja subpackage but make func under ja package lazy import...so if abused, when times goes on... no one would know which one is lazy import and which one is eager.
so I make a slight change to two options,

  1. Don't import language-specific packages in langcheck.metrics and langcheck.augment
  2. Apply lazy import to all func under language package.

As PEP690 was rejected, it seems python would not have a standard way to do lazy import in recent future.
So, I think option 1 seems more reasonable, it only requires the user import language package explicitly, may influence backward compatibility (would not be a problem as the version is only 0.4.0 ?).

image

@kennysong
Copy link
Contributor

kennysong commented Mar 13, 2024

Thanks for the analysis! I agree that lazy_loader is more flexible than LazyLoader but hard to maintain. Let's go with option 1.

@Vela-zz could you then remove all of the LazyLoader code and fix tests if needed?

I can work on updating all of the documentation to reflect the new explicit imports.

@Vela-zz Vela-zz marked this pull request as ready for review March 21, 2024 14:56
@kennysong kennysong mentioned this pull request Mar 21, 2024
@kennysong
Copy link
Contributor

Thanks @Vela-zz! I think this PR is good to go. I was playing around with a bunch of tweaks, and was able to get import langcheck working the same as before by wrapping the language imports in a try/except. This way users don't need to import langcheck.metrics.ja anymore.

@kennysong
Copy link
Contributor

kennysong commented Mar 22, 2024

@yosukehigashi or @liwii – Since this is a big change, could one of you could test this PR before merging?

The key workflow should look like:

# Open a clean Python environment
docker run -it -v "$(pwd)":/langcheck python:3.8 bash -c "cd /langcheck && bash"

# Install english support
pip install .

# Test LangCheck
python
>>> import langcheck
>>> langcheck.metrics.toxicity  # Works
>>> langcheck.metrics.en.toxicity  # Works
>>> langcheck.metrics.ja.toxicity  # Doesn't work
>>> import langcheck.metrics  # Works
>>> import langcheck.metrics.en  # Works
>>> import langcheck.metrics.ja  # Doesn't work

# Install Japanese support
pip install .[ja]

# Test LangCheck
python
>>> import langcheck
>>> langcheck.metrics.toxicity  # Works
>>> langcheck.metrics.en.toxicity  # Works
>>> langcheck.metrics.ja.toxicity  # Works
>>> import langcheck.metrics  # Works
>>> import langcheck.metrics.en  # Works
>>> import langcheck.metrics.ja  # Works

Also, test in VSCode. Install only one language in a venv. Enable that venv in VSCode. Then create a test.py file in the root directory and make sure that VSCode's static analysis features still work for all languages.

CleanShot 2024-03-22 at 15 57 24@2x

CleanShot 2024-03-22 at 15 57 46@2x

CleanShot 2024-03-22 at 15 57 57@2x

@Vela-zz
Copy link
Contributor Author

Vela-zz commented Mar 23, 2024

Thanks @Vela-zz! I think this PR is good to go. I was playing around with a bunch of tweaks, and was able to get import langcheck working the same as before by wrapping the language imports in a try/except. This way users don't need to import langcheck.metrics.ja anymore.

Thanks, learn a lot.

@liwii
Copy link
Contributor

liwii commented Apr 2, 2024

Confirmed that

  • In a clean docker environment, import langcheck.metrics.ja works if and only if langcheck[ja] is installed
  • In a venv without langcheck[ja] installed, VSCode extensions properly recognizes the langcheck.metrics.ja module.

@kennysong
Copy link
Contributor

Thanks for testing! I'll merge this now. The only test that's failing is because the HanLP server is down and is unrelated.

@kennysong kennysong merged commit c750d2d into citadel-ai:main Apr 2, 2024
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants