Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turkish language package changed. #332

Open
simjanos-dev opened this issue Aug 19, 2024 · 4 comments
Open

Turkish language package changed. #332

simjanos-dev opened this issue Aug 19, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@simjanos-dev
Copy link
Owner

https://huggingface.co/turkish-nlp-suite/tr_core_news_md/tree/main

The Turkish language model was renamed. It also seems to have a Spacy version requirement of >=3.4.2,<3.5.0, which was present even before the name change. The tokenizer.py script dies when I try to install after changing the url, and it messes up the model directory with a spacy 3.4 version. It also stops functioning after an attempted Turkish install and restarting the script.

@sergiolaverde0 I don't know yet how to fix this issue. I tagged you in case you are interested, and have an idea.

@simjanos-dev simjanos-dev added the bug Something isn't working label Aug 19, 2024
@sergiolaverde0
Copy link
Contributor

I think pinning the version when installing the packages in DockerfilePythpn should solve this, but I won't be able to write a fix until tomorrow or the day after.

@simjanos-dev
Copy link
Owner Author

I think pinning the version when installing the packages in DockerfilePythpn should solve this, but I won't be able to write a fix until tomorrow or the day after.

I use the v13.0 latest image for personal use, it has Spacy 3.7.5. I would be surprised if we went from 3.4 to 3.7.5 just by not pinning a version number.

Pinning it to an older version would solve this, but not sure if we should use an older spacy version. Also other installable packages use 3.7.0 spacy version based on their url. I think this change could also mess up the model folder for people who already have installed models.

I think maybe we should also host these files on linguacafe github if possible.

Thank you so much for your help with it! Also please take your time, it is not urgent.

@simjanos-dev
Copy link
Owner Author

If it's something we cannot fix reasonably simply, maybe we could solve it by replace it with Stanza if Turkish is available.

@sergiolaverde0
Copy link
Contributor

Well I have tried the simple solution of just updating the link, and installing Turkish does indeed downgrade spacy which triggers #323. I'm actually ashamed I didn't notice this before, it is quite big.

I have a "hotfix" that enables you to install Turkish at the expense of breaking every other language, but we need a better solution. For the time being this should be announced as a known issue so people know it happens, know if it already affected them, and can decide whether to use Turkish anyway or not.

Pinning it to an older version would solve this, but not sure if we should use an older spacy version. Also other installable packages use 3.7.0 spacy version based on their url

We could downgrade those packages in theory so that they are compatible with the older spacy, but I don't like the idea and will 100% break every other extra package which is probably worse.

I think maybe we should also host these files on linguacafe github if possible

I thought about that at some point but was unsure given their size. at this point it is probably worth giving it a second chance.

we could solve it by replace it with Stanza if Turkish is available

Actually not a bad idea, but I need to actually go back to the open PR. Cobbling something together that fits out use case is far easier than making something fit for upstream and will also be useful for the prior point, however it will still take some time.

As a side comment lxml[html_clean] had a breaking change which I already addressed, in case you try a dev build and it fails for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants