RFC: Use stanza model for Finnish #255

rominf · 2024-05-10T14:42:01Z

This PR is a request for comments about using stanza model for Finnish and is not meant to be merged in current state, hence it is draft.

Unfortunately, Finnish lemmatization is not very accurate. I ran slightly updated benchmark: https://github.com/aajanki/finnish-pos-accuracy and found that spacy lemmatization model used in LinguaCafe has F1=0.842, whereas default stanza model for Finnish gives F1=0.958.

I tried to use stanza with https://github.com/explosion/spacy-stanza adapter (see PR code). It works. Also, code changes are generalizable to other languages (stanza supports over 70 languages).

There is a huge downside though: the size of resulting docker image, which is mostly because NVIDIA drivers I guess, which are automatically downloaded with pytorch installation.

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...

In conclusion, it is possible to significantly increase accuracy for Finnish (and probably some other languages) while not increasing code complexity at the cost of image size.

What do you think about Finnish lemmatization accuracy and introducing stanza?

Before (lemma is whole word – incorrect):

After (lemma is correct):

simjanos-dev · 2024-05-10T14:54:55Z

Oh wow, this looks great! I didn't know about this.

I would love to add this. We actually have a language install system, so the image size would not increase, it would only take up space for users who actually use this language.

Does this require a GPU? Can you please test what the size would be without the nvidia driver?

My only problem with it would be GPU dependence plus that my laptop is probably too weak to test this. After adding the 2 missing Spacy languages my plan was to use different tokenizers, it would be VERY useful if I could keep using Spacy for more languages.

Thank you so much for working on this!

@sergiolaverde0 You may be interested in this.

rominf · 2024-05-10T15:07:02Z

@simjanos-dev I am so glad you liked it!

Yes, it works without GPU: I just added installation of CPU version of torch on separate line. The size of the image dropped significantly:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after - GPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...
$ podman system df -v  # after - CPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    04233fafac2c  About a minute  2.804GB     1.428GB      1.376GB      0

What are my next actions? Fix the documentation (add references about stanza to all places where spacy is mentioned, write proper commit and PR messages) and undraft the PR, or is there something else that needs to be done?

rominf · 2024-05-10T15:08:13Z

docker/PythonDockerfileDev

@@ -8,6 +8,8 @@ RUN apt-get update -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

+RUN pip install torch --index-url https://download.pytorch.org/whl/cpu


Comment about stanza and NVIDIA drivers is needed here.

sergiolaverde0 · 2024-05-10T15:14:45Z

Looking at the URL for the Pytorch install this doesn't need a GPU since it uses CPU as the computing platform.

I heard we can reduce the size of that install by compiling Pytorch from source without the unnecessary features but I haven't done it before and I don't know by how much we can cut it.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default. I'm concerned with performance when using CPU so that's another thing to check.

I see a future here, but it will take effort.

simjanos-dev · 2024-05-10T15:50:43Z

My enthusiasm has dropped a lot, I thought it would be much smaller. The model size is still huge compared to the 20-50MB model we used before.

A few more questions:

How much RAM does it use compared to the old model?
How much more space does it take, if you install +1 or +2 languages? On hugging face the model is 350MB zipped. I'm asking this because I assume there are some shared parts, and not every language will add 1.8GB.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default.

I don't think I want to do that. Some users already had issues with the ram. I've seen attempted installs on raspberry pi-s, small free tier hosted servers and old laptops. I myself have an old laptop. And I also want to host LinguaCafe on a VPS in the future and try to optimize it. I want to rather make LinguaCafe smaller if possible by default. However, I definitely want to add these models as well as an option.

What are my next actions?

I'm not sure, I will need some time to figure out what I would like to do. I will more than likely have a problem with testing this myself.

Since this is only needed for lemmas (except for languages that have no spaces or have readings), what if we would use a huge amount of text, and generate a list of lemmas that we would use for linguacafe? For most languages, that is the only value that is being added by using an other model or tokenizer than the multilingual Spacy one.

2 other options would be: adding them as extra installable languages like "Finnish (large)", or adding an API that let people use other tokenizers. It would be easy to copy the current python container, and modify it and add different models.

What do you think?

sergiolaverde0 · 2024-05-10T16:01:05Z

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Next week I will try to compile Pytorch from source and see what would be the absolute minimum size so we can make a better informed decision.

For the time being the option for larger models is my favourite.

@rominf sorry if we take our time, but Simjanos is right to be concerned about accessibility of the hardware requirements.

simjanos-dev · 2024-05-10T16:16:44Z

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Implementing this will definitely take a lot of time. I want to add everything to linguacafe that I can, but I cant do it at the rate requests are coming in. Its been an insane progress in the last 4 months since release.

rominf · 2024-05-11T08:35:04Z

Please take your time! I will post my results, so that you have a food for thought meanwhile.

You are right about importance of accessibility of the hardware requirements: my mistake, I was not thoughtful about this.

I will write about Finnish only, since I have not tried to do lemmatization in other languages.

Stanza language support is split to multiple models. For lemmatization only tokenize, mwt, lemma models are required and pos is optional, however it greatly improves the accuracy. Size of tokenize, mwt, lemma models is 6.8 MiB (six and eight MiB), size of tokenize, mwt, lemma, pos models is 182.7 Mib.

My PC info:

Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM
Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Python: 3.9.19

Here are the results of lemmatization of Universal Dependencies tree bank:

model                  F1    token/s
spacy-fi_core_news_lg  0.871 25191
spacy-fi_core_news_md  0.870 24768
spacy-fi_core_news_sm  0.842 27826
stanza-fi (no pos)     0.879 4631
stanza-fi (with pos)   0.958 1794

I also measured RAM usage on lemmatization of Alice in Wonderland in Finnish using scalene. Here is the script:

import collections
import sys

text = open("pg46569.txt").read()
if sys.argv[1] == "spacy":
    import spacy
    spacy.require_cpu()
    nlp = spacy.load("fi_core_news_sm", disable=['ner', 'parser'])
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma_) for token in doc), maxlen=0)
elif sys.argv[1] == "stanza":
    import stanza
    # This will download only needed models to ~/stanza_resources/ and store them for next runs 
    nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma", verbose=False, use_gpu=False)
    #nlp = stanza.Pipeline("fi", processors="tokenize,mwt,pos,lemma", verbose=False, use_gpu=False)
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma) for sentence in doc.sentences for token in sentence.words), maxlen=0)
elif sys.argv[1] == "simplemma":
    from simplemma import simple_tokenizer, lemmatize
    doc = simple_tokenizer(text, iterate=True)
    collections.deque(((token, lemmatize(token, lang="fi")) for token in text), maxlen=0)

Results:

model                  max RAM (GiB) total time (s)
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           7.5
stanza-fi (with pos)   2.0           23.4
-- Python 3.12.3
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           5.6
stanza-fi (with pos)   2.0           18.2
simplemma              0.5           1.7
-- Python 3.11.9
spacy-fi_core_news_sm  0.9           2.8
stanza-fi (no pos)     0.4           6.6
stanza-fi (with pos)   2.0           19.2
simplemma              0.6           1.7

As you can notice from the script, I don't use spacy_stanza library anymore, but call stanza directly: there are no benefits for this specific task.

This is the size of the image now (without pos):

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    9cabe369f271  13 seconds      2.074GB     1.429GB      644.8MB      0
...

To sum up, stanza without pos processor is a bit more accurate on Finnish than spacy, takes significantly less disk space and RAM, however, much slower. Stanza with pos processor is much more accurate on Finnish than spacy, but takes significantly more disk space and RAM and tremendously slower.

The proposal about having multiple variants of language is my favorite as well!

Do you want me to do a benchmark of spacy vs stanza for other languages?

UPD: added results for Python 3.12 for Alice in Wonderland test.
UPD: added simplemma for Alice in Wonderland test.
UPD: added results for Python 3.11 for Alice in Wonderland test.

simjanos-dev · 2024-05-11T09:50:53Z

This is a really detailed test report, thank you so much!

Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM

Wow. I have i5-8250u and 8GB ram.

The proposal about having multiple variants of language is my favorite as well!

I think we should go with that as well to provide the best experience possible.

At first I was thinking about it the wrong way. My first idea was to have multiple languages for different tokenizers, but I realized it would be extremely difficult to implement, since Language names are used at a ton of places.

It is however reasonably simple to switch tokenizers. So we can just make the tokenizer selectable on the admin page without separating them into their own language.

Do you want me to do a benchmark of spacy vs stanza for other languages?

I mostly interested in that if we add multiple languages, would the additional disk space required decrease, due to shared dependencies.

I think the latest 2GB disk size you commented is very reasonable to be added as an option. But if the models themselfs are so small, is there any way to decrease the disk space further? Can we remove Spacy, and use Stanza by itself to save space? I know it returns a different format, but I can write a different tokenizer function for it.

The tokenizer is quite slower, but the PHP side of processing the text takes time as well, so it might won't be that much of an issue, plus users can decide which one they want to use.

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

I will think about how to implement a tokenizer selector. We should probably rebrand installable languages to installable packages or something.

rominf · 2024-05-11T10:57:28Z

What if we extend your idea about installable lemmatizers even further? Since some people want to run LinguaCafe in constrained environments, what if:

The size of the usable linguacafedev_python image decrease significantly?
Not just models, but model runners (spacy, stanza) be installable on demand just in a few seconds?

This can be done!

My proposal is to preinstall simplemma instead of spacy, so that the image is minimal. It has low footprint and runs very fast (it can be seen from the table in my previous message – I added simplemma there) – a good fit for raspberry pi. If user selects enhanced models spacy or stanza is installed in a few seconds using uv, which installs stanza just in few seconds (5 seconds on my machine)! This is just one extra call to uv for installation of the package in venv.

I created four venvs using uv: empty, simplemma, spacy, and stanza. Here is what I got:

pytorch takes the most place, as @sergiolaverde0 expected.

Here is the showcase of how fast uv is:

(stanza-pip) rominf@rominf-fedora /t/venv> time pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
...
________________________________________________________
Executed in   20.89 secs    fish           external
   usr time    8.45 secs    0.00 micros    8.45 secs
   sys time    0.84 secs  506.00 micros    0.84 secs

(stanza-uv) rominf@rominf-fedora /t/venv> time uv pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
Resolved 20 packages in 2.30s
Downloaded 20 packages in 2.52s
Installed 20 packages in 285ms
...
________________________________________________________
Executed in    5.13 secs    fish           external
   usr time    1.52 secs    0.00 micros    1.52 secs
   sys time    0.95 secs  335.00 micros    0.95 secs

Docker image building with uv becomes much, much faster and here is the footprint:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    5f19a370db3b  13 seconds      327.9MB     164.9MB      163MB        0
...

PS: please have a look at "UPD" in my previous message: stanza on Python 3.12 is quite a bit faster than on Python 3.9.

sergiolaverde0 · 2024-05-11T13:19:58Z

Hi, I have a few questions, hoping not to derail this too much:

How is performance in Python 3.11? That is what Debian 12 currently packages so probably RPis and other SBC do the same, and thus it makes for a good baseline of "most users will have this or newer".
Is there any documentation about Arm compatibility of these models? I don't seem to be finding any. Currently our Python image is only built for Amd64 because some Spacy languages have dependencies that are not available for Arm. Apple Silicon runs the image via virtualization and other devices are unsupported.
Off topic but I noticed you are using podman, have you encountered any issues? Some months ago users had trouble trying to run the images with rootless podman and the only solution we had was asking them to use rootful podman or docker.

About suing uv: I'm really not a big fan of using pre-1.0 software in "production" for critical tasks. However it might seriously make install of extra components faster and if we are having so many of those the benefits might outweight the issues they cause. If we go this route we will have to pin the version and update it manually unlike the rest of the tools we use.

I see simplemma does not consume less RAM than Stanza without pos. Sure it is faster but I think we could skip on it at least for the time being to reduce mental overload while planning what to do. We might also try a survey to ask users how much they care about text import times.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

simjanos-dev · 2024-05-11T13:45:25Z

I will comment on it more later, just a few quick comments from phone.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

I want to check other non spacy tokenizers as well, and compare the sizes. I think Spacy is a good default option based on its size, and if theres an other smaller tokenizer for Vietnamese, I would prefer that instead of Stanza.

Theres also an option for using Spacy multilingual model and simple lemmatizer together. It would be really good and easy solution for Czech and Latin lemma support.

We could replace spacy for most languages with simple lemmatizer, but there are 3 points to keep in mind:

Some languages have or will have gender tagging support.
We need to make sure that simple lemmatizer is accurate enough.
Part of speech may will be an important core feature. Im thinking about adding an option to treat the same word with different pos as 2 different unique words, so they can have more accurate lemmas and readings. This is just an idea, and wont be implemented soon.

I am thinking about it. I have no strong opinions about it, but I feel like using Spacy is a good default option when available.

Tokenization speeds importance will decrease in the future, because I want to make a queue for them, and users will be able to start reading after the first chapter is finished.

rominf · 2024-05-11T17:13:12Z

@sergiolaverde0

I updated the table in the message above. Python 3.11 is a bit slower than 3.12.
pytorch is available on ARM (https://download.pytorch.org/whl/torch_stable.html, look for cpu/torch-2.3.0, arm64.whl, and aarch64.whl). As for stanza itself, it is fully written in Python and there is no arch in PyPI classifiers. It should run on ARM fine. Of course, checking it on cloud server would not harm.
Yes, I use podman. There were issues with SELinux. Here are short instructions (disclaimer: I am not SELinux expert and I don't know if it is the most secure way, yet this is for sure more secure than disabling SELinux):

$ git clone -b deploy https://github.com/simjanos-dev/LinguaCafe.git linguacafe && cd linguacafe
$ sudo semanage fcontext -a -t svirt_sandbox_file_t "~/linguacafe(/.*)?"
$ sudo restorecon -vR ~/linguacafe  # repeat this command every time after downloading dictionaries into storage/app/dictionaries/
$ sudo setsebool container_manage_cgroup 1
$ sudo chmod 777 -R ~/linguacafe/  # as per original instruction
$ podman-compose up -d

rominf · 2024-05-11T17:42:03Z

@simjanos-dev

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

Thank you! I can do testing and support users for this feature. Also, I do not think pos version of stanza will behave differently in any way comparing to non-pos version (except for accuracy): and you should be able to run non-pos version. :-)

simjanos-dev · 2024-05-11T19:32:38Z

and you should be able to run non-pos version. :-)

I'll try it out sometime.

Thank you! I can do testing and support users for this feature.

In that case I am open for adding Stanza as an additional option for at least Finnish. If it goes well, I think we can add more languages and Stanza tokenizers. I will do everything on the Laravel and front-end side, and can also do Python if needed. (Honestly I am a bit worried about having parts of the code that I don't test/support completely.)

What are my next actions?

Currently I think the only thing needed on the Python/docker side is to make it installable like other language packages.

I will experiment with simplemma for Czech, Latin, Ukrainian and Welsh in the future. It also has Hindi which was a requested language.

I wanted to split up tokenizer.py for a while, because it keeps growing. Now it will be kind of necessary. Currently it should have 3 files: tokenizer, import and models(I'm not sure if this one can be separated). I will probably do it for v0.13 or v0.14.

It might take a while for me to do my part, I will be working a bit less on linguacafe, and will work on parts of it that I want to use, because I feel a bit burned out.

And thank you so much for working on this! Both Stanza and Simplemma are great tools for tokenizing, I didn't even know about them.

sergiolaverde0 · 2024-05-14T14:55:45Z

I did some really quick mockups last night and was able to reduce it to 1.81 GB by changing the base to python:slim and ensuring no cache is used when installing Pytorch.

I will add the first change for v0.13 regardless of what happens with the tokenizers because there's no reason not to. While doing this I realized we can use Python 3.12 regardless of anything so sorry for wasting your time with that pointless inquiry.

Later I will test how does size evolve as I replace more and more languages with the Stanza variants, and check if I can shrink Pytorch more.

simjanos-dev · 2024-05-14T17:30:14Z

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Edit: I think it was the former. Im a bit slow today and was confused.

sergiolaverde0 · 2024-05-14T23:40:38Z

docker/PythonDockerfileDev

@@ -22,6 +24,8 @@ RUN pip install -U --no-cache-dir \
        bottle \
 #spacy
        spacy \
+#stanza integration for spacy
+        stanza \


spacy_stanza should be installed as well

I propose to remove it, since there are at least two issues with the spacy_stanza library I bumped into:

Multi-word token expansion issue, misaligned tokens --> failed NER (German) explosion/spacy-stanza#70. As I understand this affects quality because of imprecise tokenization. Also, this generate verbose output I could not suppress.

stanza version is pinned and it is not the latest.

Also, doing lemmatization using stanza directly is straightforward, see #255 (comment).

The code in this PR should be changed a bit to make it work. Currently, it is broken, since I wanted to check the image size and did not care about usable LinguaCafe at this stage).

sergiolaverde0 · 2024-05-15T00:49:06Z

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Replace where possible to see the image size I ended up with, and also because the easy way to map the models to a language to test them was to ditch the Spacy counterparts anyways.

And after doing so to see if space savings from shared dependencies could shrink this image I found that:

Eight languages, including Norwegian, Sweedish, Croatian and Danish, either lack a mwt model or lack any model altogether, so if I try to install them with the generic python3 -c 'import stanza; stanza.download("x", processors="tokenize,mwt,lemma")' it fails. I kept those on their Spacy variants to get the image to build.
English also lacks a default mwt, since none of the options listed on the docs are marked as a default. It installs just fine but I don't know how will that impact accuracy and performance, but solving it should be easy enough if we dig into the docs deeper.
Darn, those Stanza models are tiny! Most if not all of them were less than 7MB, so by using them as a replacement of the usually larger Spacy models I reduce the image down to 1.59GB. I ended up with a total of 11 Stanza languages and 9 Spacy languages if we include the multilingual as its own.

I'm now to test that languages other than Finish actually work, as in, to check if they actually load and can tokenize a paragraph. I will be grateful for any help so I have built a test on my fork, pull it with docker pull ghcr.io/sergiolaverde0/linguacafe-python-service:stanza. Depending on how this goes I will see how things behave with the languages whose models and dependencies were too big to be shipped by default like Japanese and Russian.

If the decrease in performance is not that big of a deal, if the rest of the languages can be worked around to be used, and if they all follow the pattern of using less RAM than Spacy counterparts, I can vouch for this to be our new default. But those ifs are doing quite the heavylifting.

Edit: And yes I'm shunning away from compiling Pytorch until we exhaust alternatives, today I saw their setup.py is 1500 lines long.

simjanos-dev · 2024-05-15T06:42:48Z

A few things to keep in mind regarding replacing default tokenizers with stanza:

I do not know yet, but part of speech may be needed for an important feature in the future. I haven't decided on it yet.
Gender tagging is very important to keep where it's available. Some languages in spacy support it, but haven't added support for them yet in linguacafe. I'm pretty sure it exist in Danish, Swedish, Italian and Spanish.
Japanese has a post processing where I combine multiple words into one after the tokenization, which relies on the word splitting being the way it is in spacy, and having correct POS tagged. Chinese and Thai users would also lose data if their words are being split differently. I can't speak for their accuracy, except for Japanese, which I find pretty good, except for 2 problems that my post processing introduced.

I will work on moving this post processing from Laravel to Python today. It is just an additional function, so I will merge this in on Friday after release if there are no PR-s touching the file. If there will be, I'll modify my code to avoid creating conflicts with other people's work. Did not work.

rominf · 2024-05-19T17:57:05Z

https://rominf.github.io/spacy-vs-stanza

simjanos-dev · 2024-05-20T09:47:04Z

Thank you for the tests! This is VERY detailed list.

I'll do a few other things this week, but on the weekend or next week I am ready to do my part adding more tokenizers to linguacafe starting with Finnish.

rominf · 2024-05-20T10:15:45Z

You are welcome! It was fun and educational to work on this benchmark.

Here is the release with results in CSV format: https://github.com/rominf/spacy-vs-stanza/releases/tag/v0.1.0.

Feel free to ask any questions, but please note that from 14:00 UTC today until the morning of May 27th (so, ~one week), I will be offline.

sergiolaverde0 · 2024-05-27T23:06:45Z

I did some rough DA and in summary:

Stanza with pos is always more accurate, except when unavailable which is only Macedonian among the 59 languages. I think Spacy is available for a couple more languages that were not benchmarked.
Spacy with pos is still smaller than Stanza with pos in 23 of the 59 languages. Stanza without pos is always smaller I think.

I have been thinking about this and I'm still unsure on how to go about it, I think switching any of the existing tokenizers is technically a breaking change since the same phrase could get tokenized differently if there was a switch between importing two texts and I don't think the database would be happy with that.

Re-tokenizing everything when "upgrading" to a bigger model is not a good idea either, so the potential use case for Simplemma as a test bed for people to mess around until they install the "main" model is probably better discarded.

For Finnish specifically we could make it work as an extra language and remove it from the base image, but then it would be a gargantuan extra model of 2 GB. And there is still the matter of the languages exclusive to Stanza. Thoughts?

simjanos-dev · 2024-05-28T00:19:42Z

I think switching any of the existing tokenizers is technically a breaking change since the same phrase could get tokenized differently if there was a switch between importing two texts and I don't think the database would be happy with that.

It probably wouldn't be a problem, except for Japanese, Chinese and Thai.

For Finnish specifically we could make it work as an extra language and remove it from the base image, but then it would be a gargantuan extra model of 2 GB. And there is still the matter of the languages exclusive to Stanza. Thoughts?

I think the best option is to keep the current spacy languages as default, and add Stanza for Finnish as optional installable language, probably with the option to choose between Spacy and Stanza on the admin page, so already existing users won't have to install 2GB extra package.

For new languages that spacy doesn't have, I think having stanza as an option for 20+ extra language is great! Probably using other tokenizers would take up large place as well (have no idea, just guessing).

(Currently requested languages that I want to add next: Vietnamese, Tagalog, Swahili and Hindi.)

If we decide to add switchable tokenizers, maybe we should also think through installable languages, and rename it to installable packages or something similar. I've got some tools linked like #280 and some TTS libraries, that I don't think we should add to the main package either. ( I don't plan on adding any of them anytime soon.)

(Sorry if I wrote something confusing, I wrote this message very late.)

simjanos-dev · 2024-06-29T08:56:59Z

I am working the importing system currently. After that(2-3 weeks) I want to work on stanza. Would someone like to do the python side (I would still do it myself if not)?

sergiolaverde0 · 2024-06-29T11:57:25Z

I can do the Python side, I will be having a bit more spare time starting this week. Adding the option to install the extra-extra-large Finish model with Pytorch I can probably implement in one hour ~~and then spend five testing and debugging~~ since I can reuse the code we use for the regular models. How shall I name it for the API calls?

simjanos-dev · 2024-06-29T12:08:58Z

I think we should integrate it to the current api calls. If I remember currently, I just receive a simple array of the installed languages. I think we should change it to look like this:

[
    'japanese': [
        'spacy',
    ],
    'finnish': [
        'spacy',
        'stanza',
    ]
]

This way users can install multiple tokenizers. For the install function we could use/pass a language and a tokenizer post variable.

I am currently working on the tokenizer.py file. I'll rewrite the tokenizers themself if you want, or you can do it after I am finished. I'm changing a lot of things, so there would probably be a ton of conflicts. But I'm not touching the model functions, so you could work on those, save them somehow, and after I'm done you could just apply/copy-paste the functions to the latest tokenizer.py file.

I will probably won't have time to work on this from Monday to Friday, and I think I will be finished with the tokenizer.py today or tomorrow.

sergiolaverde0 · 2024-06-29T12:22:32Z

Is that really practical, given that the spacy model for Finish would always be present? Currently we only check for the extra languages, so that a default install returns an empty object.

simjanos-dev · 2024-06-29T12:31:39Z

Is that really practical, given that the spacy model for Finish would always be present? Currently we only check for the extra languages, so that a default install returns an empty object.

Sorry, I made a mistake, this is what I meant:

[
    'japanese': [
        'spacy',
    ],
    'finnish': [
        'stanza',
    ]
]

We should only handle extra languages on python, I'll handle the selection between stanza and spacy on the webserver side. But there will be languages where multiple packages can be installed in the future other than tokenizers, which I think should be handled with the same function. So maybe it would look like this:

[
    'japanese': [
        'spacy',
        'manga-ocr',
    ],
    'finnish': [
        'stanza',
    ]
]

We could use the current ["plain", "array"] format, but it would be more complicated on the webserver side to combine the array I get from python with the config file of already installed languages and tokenizers, because for selecting tokenizers I will have to use an array structure like the one above.

Please feel free to ask anything or suggest an other method. I'm not sure if I explained it correctly.

simjanos-dev · 2024-06-30T08:10:46Z

@sergiolaverde0 I think I am done with working on the tokenizer.py file (99%). The latest version is in the feature/websockets-vue branch. I will merge this into dev probably next week, but maybe even today.

simjanos-dev · 2024-07-07T14:46:38Z

I've mostly finished working on job queues. There are a few small tasks left, I plan to work on stanza next weekend.

simjanos-dev · 2024-07-13T19:04:03Z

Soo, I am done with adding job queues, and I am ready to test and implement stanza (while also working on smaller features). What image/repo should I test?

sergiolaverde0 · 2024-07-13T20:05:49Z

I have noticed a non-trivial issue, which is that stanza.download() does not seem to allow me to select the download location. Testing locally on my machine it puts them in ~/stanza_resources, not caring a bit about me running it on a virtual environment.

When we where shipping it inside the docker image this wasn't a problem, but if we try to add them at runtime, it might attempt to place those fails in the non-persisted location. We might need to add a new mounted volume just to persist the stanza models.

EDIT: Never mind, it was writing this and finding the documentation to override that location.

I haven't been able to do much more, the whole process has taken nearly two hours now. Sometimes I hate the Python dependency resolving system.

simjanos-dev · 2024-07-13T20:29:33Z

I haven't been able to do much more, the whole process has taken nearly two hours now. Sometimes I hate the Python dependency resolving system.

That's fine. Please take your time. It's okay even if it takes months, or if you end up stopping working on it. You have helped so much with this project, I dont want anyone to feel bad about working on linguacafe.

I'm not sure if I will start stanza tomorrow, maybe Ill just write manual for the queue and some small things.

sergiolaverde0 · 2024-07-14T00:07:57Z

Alright, we will have to do some hoop jumping for the Stanza part specifically.

For some reason doing stanza.download("fi", processors="tokenize,mwt,lemma") took nearly two hours, of which less than 5 seconds were spent actually downloading the 4 files needed, which aren't that big. And since, unlike with Spacy, Stanza models are not proper Python packages, they are not listed by pip list which is our currently used method. So I have this somewhat crazy idea:

Ditch stanza.download() to install new languages. Instead, we directly download the files from either Huggingface or from a mirror we setup for this, and place them in the right folder.
It would be located at storage/models/stanza/. It can be located elsewhere, but that path makes that they also get deleted together with the rest of the extra packages.
Check the presence of these languages by checking if the folder is in the given location or not.
An extra respurces.json file needs to be placed in the storage/models/stanza folder. Since it is only 383KB it can be shipped in the image and copied on that location by entrypoint.sh.

Since the process is so different from our existing packages the code will lose a bit of readability and most of it won't be reused, but it is also cleaner than anything else I can think of right now. Thoughts?

simjanos-dev · 2024-07-14T06:59:37Z

Sounds good, at least I cannot think of a better alternative. I think after it's done, we should split up the code into multiple parts.

rominf · 2024-07-14T09:52:26Z

@sergiolaverde0 Have you tried downloading some other Stanza models? I downloaded dozens of Stanza models and never ran into the issue of stanza.download() taking that much time. I just ran this command on my machine:

$ time python3 -c 'import stanza; stanza.download("fi", processors="tokenize,mwt,lemma")'
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 384kB [00:00, 76.4MB/s]
2024-07-14 08:29:54 INFO: Downloaded file to /home/rominf/stanza_resources/resources.json
2024-07-14 08:29:54 INFO: Downloading these customized packages for language: fi (Finnish)...
============================
| Processor | Package      |
----------------------------
| tokenize  | tdt          |
| mwt       | tdt          |
| lemma     | tdt_nocharlm |
============================

Downloading https://huggingface.co/stanfordnlp/stanza-fi/resolve/v1.8.0/models/tokenize/tdt.pt: 100%|█| 654k/654k [0
2024-07-14 08:29:55 INFO: Downloaded file to /home/rominf/stanza_resources/fi/tokenize/tdt.pt
Downloading https://huggingface.co/stanfordnlp/stanza-fi/resolve/v1.8.0/models/mwt/tdt.pt: 100%|█| 548k/548k [00:00<
2024-07-14 08:29:56 INFO: Downloaded file to /home/rominf/stanza_resources/fi/mwt/tdt.pt
Downloading https://huggingface.co/stanfordnlp/stanza-fi/resolve/v1.8.0/models/lemma/tdt_nocharlm.pt: 100%|█| 5.57M/
2024-07-14 08:29:57 INFO: Downloaded file to /home/rominf/stanza_resources/fi/lemma/tdt_nocharlm.pt
2024-07-14 08:29:57 INFO: Finished downloading models and saved to /home/rominf/stanza_resources

________________________________________________________
Executed in    3.15 secs    fish           external
   usr time    2.93 secs    0.00 micros    2.93 secs
   sys time    0.13 secs  427.00 micros    0.13 secs

There are not bug reports about slow downloads in issues repository: https://github.com/stanfordnlp/stanza/issues. I do not know the reason why it took two hours on your machine, but I suppose it might be something unrelated with stanza.

It it possible to replace stanza.download() that downloads files from HuggingFace with installing models using Python wheels, although it is not straightforward.

Spacy distributes model wheels via GitHub releases. If Stanza models were packaged as wheels, they wouldn't fit into PyPI: https://pypi.org/help/#file-size-limit, https://pypi.org/help/#project-size-limit. PyTorch uses its own PyPI index along with --index-url pip argument: https://pytorch.org/get-started/locally/.

Apparently, there is no automation for uploading Stanza models to HuggingFace. It is possible to create GitHub repository similar to https://github.com/explosion/spacy-models with GitHub Actions that would periodically check for updates in https://github.com/stanfordnlp/stanza-resources repository. In case some file was updated, the script running in GitHub Actions will download the model from HuggingFace, build the wheel, and create the release with the wheel in the same GitHub repository. Then, PyPI index will be updated and served as GitHub Pages (see https://github.com/astariul/github-hosted-pypi which partially implements what I mean).

Here is a proof of concept:

$ cd /tmp
$ python3 -m venv .venv
$ source .venv/bin/activate
(.venv) $ pip install stanza
(.venv) $ pip install rominf-stanza-fi --index-url https://rominf.github.io/test-github-hosted-pypi/
(.venv) $ python3
Python 3.12.4 (main, Jun  7 2024, 00:00:00) [GCC 14.1.1 20240607 (Red Hat 14.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stanza
>>> import importlib.util
>>> model_dir = importlib.util.find_spec("rominf-stanza-fi").submodule_search_locations[0]  # this can be easily replaced with, I just did not implement it: `import rominf_spacy_fi; model_dir = rominf_spacy_fi.dir`
>>> nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma", dir=model_dir, download_method=stanza.DownloadMethod.REUSE_RESOURCES)
2024-07-14 12:18:06 INFO: Loading these models for language: fi (Finnish):
============================
| Processor | Package      |
----------------------------
| tokenize  | tdt          |
| mwt       | tdt          |
| lemma     | tdt_nocharlm |
============================

2024-07-14 12:18:06 INFO: Using device: cuda
2024-07-14 12:18:06 INFO: Loading: tokenize
2024-07-14 12:18:07 INFO: Loading: mwt
2024-07-14 12:18:07 INFO: Loading: lemma
2024-07-14 12:18:07 INFO: Done loading processors!
>>> nlp("Kiitos paljon!")
[
  [
    {
      "id": 1,
      "text": "Kiitos",
      "lemma": "kiitos",
      "start_char": 0,
      "end_char": 6
    },
    {
      "id": 2,
      "text": "paljon",
      "lemma": "paljon",
      "start_char": 7,
      "end_char": 13,
      "misc": "SpaceAfter=No"
    },
    {
      "id": 3,
      "text": "!",
      "lemma": "!",
      "start_char": 13,
      "end_char": 14,
      "misc": "SpaceAfter=No"
    }
  ]
]

See https://github.com/rominf/test-github-hosted-pypi and https://github.com/rominf/test-stanza-models/releases/tag/stanza-fi-1.8.0.

Dependencies between wheels can be modeled by processing resources.json file. If Stanza releases the same version model multiple times (this happens, see: https://huggingface.co/stanfordnlp/stanza-en/commits/main/models/lemma), then I would just add an incremental number as the forth component to the version: 1.8.0.1.

Later, this might be contributed back to the main Stanza organization.

Please ping me if you want to see this solution implemented or if I can contribute in some other way.

sergiolaverde0 · 2024-07-14T14:38:58Z

I will probably fill a bug report, seeing how neither my network nor my CPU where actually stressed at any point yet:

time python -c 'import stanza; stanza.download("es", processors="tokenize,mwt,lemma")' 
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 384kB [00:00, 15.8MB/s]                                                                   
2024-07-14 07:20:26 INFO: Downloaded file to /home/desktop/stanza_resources/resources.json
2024-07-14 07:20:26 INFO: Downloading these customized packages for language: es (Spanish)...
=================================
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| lemma     | combined_nocharlm |
=================================

Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.8.0/models/tokenize/combined.pt: 100%|███████████████████████████████████████████████████| 664k/664k [00:00<00:00, 2.77MB/s]
2024-07-14 07:55:56 INFO: Downloaded file to /home/desktop/stanza_resources/es/tokenize/combined.pt
Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.8.0/models/mwt/combined.pt: 100%|████████████████████████████████████████████████████████| 627k/627k [00:00<00:00, 2.48MB/s]
2024-07-14 08:31:28 INFO: Downloaded file to /home/desktop/stanza_resources/es/mwt/combined.pt
Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.8.0/models/lemma/combined_nocharlm.pt: 100%|███████████████████████████████████████████| 6.84M/6.84M [00:00<00:00, 8.91MB/s]
2024-07-14 09:07:00 INFO: Downloaded file to /home/desktop/stanza_resources/es/lemma/combined_nocharlm.pt
2024-07-14 09:07:00 INFO: Finished downloading models and saved to /home/desktop/stanza_resources
python -c   2.06s user 0.29s system 0% cpu 1:55:29.37 total

It it possible to replace (...) with installing models using Python wheels

This would give us a clean pip list output and is much more elegant than my proposal, but I'm concerned it might be too much maintenance burden. @rominf You are more knowledgeable in this process, do you think it could get out of hand with +20 languages?

Later, this might be contributed back to the main Stanza organization.

It would be nice if upstream take it, but I'm not sure if this is allows to package all the different combinations of models they provide, since some languages have multiple pipelines. For our partial redistribution, however, it is just perfect.

rominf · 2024-07-14T15:46:22Z

@sergiolaverde0 I will answer in two parts for better structuring and a faster response.

2.06s user 0.29s system 0% cpu 1:55:29.37 total

It looks very strange to me. I suggest putting the same command into the script and executing it with some profiler to understand why it takes so long. I had great experience with Scalene: https://github.com/plasma-umass/scalene (it is easy to use and has low overhead). Maybe the reason will become evident, otherwise it will probably be helpful for filling an issue anyway. Also, it is probably enough to keep just the tokenizer model; maybe it will reproduce the bug still, but work a bit faster.

rominf · 2024-07-15T19:06:25Z

@sergiolaverde0 Now, regarding distributing models as wheels...

Firstly, regarding your second question about whether it is possible to package all different combinations of models provided by Stanza, I did some manual packaging work to understand how it should look like. You can take a look at https://github.com/rominf/test-stanza-models and https://rominf.github.io/test-github-hosted-pypi/. To install the default package for Finnish (excluding depparse and ner), run:

$ pip install rominf-stanza-fi --index-url https://rominf.github.io/test-github-hosted-pypi/

To install a specific model, run something like:

$ pip install rominf-stanza-fi-model-lemma-tdt-nocharlm --index-url https://rominf.github.io/test-github-hosted-pypi/

Packages might be packaged as rominf-stanza-fi-package-name.

resources.json cannot be packaged in wheels with the packaging scheme explained above (at least, I could not find a solution): this file is "shared" between models, yet models are installed and updated separately. However, all necessary information for loading pipelines in resources.json can be recreated from installed wheel metadata, as converting from resources.json to package names and dependencies is a lossless process that can be easily reversed.

Unfortunately, the current stanza.Pipeline heavily depends on resources.json, and it is not possible to either hook its processing or pass a dict instead, although the rework needed for supporting finding models in packages is not big: "if there is no resources.json file and it is not allowed to download it, look for installed packages and extract paths from there":

>>> import importlib.metadata
>>> importlib.metadata.files("rominf-stanza-fi-model-pos-tdt-charlm")
[PackagePath('rominf-stanza/fi/pos/tdt_charlm.pt'), ...]
>>> importlib.metadata.metadata("rominf-stanza-fi-model-pos-tdt-charlm").get_all("Requires-Dist")
['rominf-stanza-fi-model-pretrain-conll17 ==1.8.0',
 'rominf-stanza-fi-model-forward-charlm-conll17 ==1.8.0',
 'rominf-stanza-fi-model-backward-charlm-conll17 ==1.8.0']

For now, to load stanza.Pipeline without resources.json, run (substitute models_dir if necessary):

import stanza
models_dir = ".venv/lib/python3.12/site-packages/rominf-stanza/"
nlp = stanza.Pipeline('fi', processors='tokenize,mwt,lemma,pos',
    tokenize_model_path=f"{models_dir}/fi/tokenize/tdt.pt",
    mwt_model_path=f"{models_dir}/fi/mwt/tdt.pt",
    lemma_model_path=f"{models_dir}/fi/lemma/tdt_nocharlm.pt",
    pos_model_path=f"{models_dir}/fi/pos/tdt_charlm.pt",
    pos_pretrain_path=f"{models_dir}/fi/pretrain/conll17.pt",
    pos_backward_charlm_path=f"{models_dir}/fi/backward_charlm/conll17.pt",
    pos_forward_charlm_path=f"{models_dir}/fi/forward_charlm/conll17.pt",
    download_method=None)

In case Stanza won't accept models as wheel distributions, these arguments can be generated with a helper function using importlib.metadata package, described above.

Secondly, speaking of maintenance burden, I would not worry about it much, since if everything is automated (this is what I want), it does not matter how many Stanza languages need to be packaged and how often they update. The maintenance would be required for bug fixes, merging Dependabot updates in case the project has some affected dependencies, and in case Stanza dramatically changes its packaging way (like changing the file structure of stanza_resources). I am more concerned about the effort for initial implementation and whether Stanza will accept this solution, including changes to stanza.Pipeline. I can ask there in the discussions, but I would like to hear your feedback so that I could present it better.

sergiolaverde0 · 2024-07-16T00:19:47Z

pip install rominf-stanza-fi
pip install rominf-stanza-fi-model-lemma-tdt-nocharlm

I wonder if these could be merged together, so that the second is simply the first one plus some optional dependencies:

pip install rominf-stanza-fi[model, lemma, tdt, nocharlm]

This will keep most of the flexibility from the current system, while improving on the packaging. Different treebanks for the same language will probably require separate packages, so Italian for example would have:

rominf-stanza-it-isdt
rominf-stanza-it-markit
rominf-stanza-it-partut
rominf-stanza-it-parlamint
rominf-stanza-it-postwita
rominf-stanza-it-twittiro
rominf-stanza-it-vit

I don't think this can be dodged, but is not too big of a deal.

resources.json cannot be packaged in wheels

I don't think this is that big of a deal, we can probably download it if it is missing as a temporal solution that could become permanent. At least that is what I intended to do before ~~we scope creeped this into Oblivion~~ this more thorough proposal took shape.

For upstream, this would be a breaking change but they might deem it worth it because of the improved workflow when releasing. One little regression I can think of is that models couldn't be so easily isolated like they are nowadays by setting STANZA_RESOURCES_DIR to a convenient location, but only discussion will tell if that is fundamental to upstream's users.

rominf · 2024-07-16T07:07:59Z

Here is the packaging scheme I propose, visually explained. Note that I group models by packages as in Stanza, not by treebanks (these concepts are different; see the default_fast package as an example).

graph LR;
    FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-model-backward-charlm-conll17]-.-FI_MODEL_BACKWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/backward_charlm/conll17.pt];
    FI_MODEL_DEPPARSE_FTB_CHARLM_PIP[pip install rominf-stanza-fi-model-depparse-ftb-charlm]-.-FI_MODEL_DEPPARSE_FTB_CHARLM_FILE[~/stanza_resources/fi/depparse/ftb_charlm.pt];
    FI_MODEL_DEPPARSE_TDT_CHARLM_PIP[pip install rominf-stanza-fi-model-depparse-tdt-charlm]-.-FI_MODEL_DEPPARSE_TDT_CHARLM_FILE[~/stanza_resources/fi/depparse/tdt_charlm.pt];    
    FI_MODEL_DEPPARSE_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-nocharlm]-.-FI_MODEL_DEPPARSE_TDT_NOCHARLM_FILE[~/stanza_resources/fi/depparse/tdt_nocharlm.pt];
    FI_MODEL_POS_TDT_NOCHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-model-forward-charlm-conll17]-.-FI_MODEL_FORWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/forwward_charlm/conll17.pt];
    FI_MODEL_LEMMA_FTB_NOCHARLM_PIP[pip install rominf-stanza-fi-model-lemma-ftb-nocharlm]-.-FI_MODEL_LEMMA_FTB_NOCHARLM_FILE[~/stanza_resources/fi/lemma/ftb_nocharlm.pt];
    FI_MODEL_LEMMA_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-lemma-tdt-nocharlm]-.-FI_MODEL_LEMMA_TDT_NOCHARLM_FILE[~/stanza_resources/fi/lemma/tdt_nocharlm.pt];
    FI_MODEL_MWT_TDT_PIP[pip install rominf-stanza-fi-model-mwt-tdt]-.-FI_MODEL_MWT_TDT_FILE[~/stanza_resources/fi/mwt/tdt.pt];
    FI_MODEL_MWT_FTB_PIP[pip install rominf-stanza-fi-model-mwt-ftb]-.-FI_MODEL_MWT_FTB_FILE[~/stanza_resources/fi/mwt/ftb.pt];
    FI_MODEL_NER_TURKU_PIP[pip install rominf-stanza-fi-model-ner-turku]-.-FI_MODEL_NER_TURKU_FILE[~/stanza_resources/fi/ner/turku.pt];
    FI_MODEL_POS_FTB_CHARLM_PIP[pip install rominf-stanza-fi-model-pos-ftb-charlm]-.-FI_MODEL_POS_FTB_CHARLM_FILE[~/stanza_resources/fi/pos/ftb_charlm.pt];
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-nocharlm]-.-FI_MODEL_POS_TDT_NOCHARLM_FILE[~/stanza_resources/fi/pos/tdt_nocharlm.pt];
    FI_MODEL_POS_TDT_NOCHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-charlm]-.-FI_MODEL_POS_TDT_CHARLM_FILE[~/stanza_resources/fi/pos/tdt_charlm.pt];
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_PRETRAIN_CONNLL17_PIP[pip install rominf-stanza-fi-model-pretrain-conll17]-.-FI_MODEL_PRETRAIN_CONNLL17_FILE[~/stanza_resources/fi/pretrain/conll17.pt];
    FI_MODEL_TOKENIZE_TDT_PIP[pip install rominf-stanza-fi-model-tokenize-tdt]-.-FI_MODEL_TOKENIZE_TDT_FILE[~/stanza_resources/fi/tokenize/tdt.pt];
    FI_MODEL_TOKENIZE_FTB_PIP[pip install rominf-stanza-fi-model-tokenize-ftb]-.-FI_MODEL_TOKENIZE_FTB_FILE[~/stanza_resources/fi/tokenize/ftb.pt];
    FI_PACKAGE_DEFAULT_PIP[pip install rominf-stanza-fi-package-default]-->FI_MODEL_TOKENIZE_TDT_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_MWT_TDT_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_POS_TDT_CHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_DEPPARSE_TDT_CHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_NER_TURKU_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP[pip install rominf-stanza-fi-package-default-fast]-->FI_MODEL_TOKENIZE_TDT_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_MWT_TDT_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_POS_TDT_NOCHARLM_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_DEPPARSE_TDT_NOCHARLM_PIP;
    FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_NER_TURKU_PIP;
    FI_PACKAGE_FTB_PIP[pip install rominf-stanza-fi-package-ftb]-->FI_MODEL_TOKENIZE_FTB_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_MWT_FTB_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_POS_FTB_CHARLM_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_LEMMA_FTB_NOCHARLM_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_DEPPARSE_FTB_CHARLM_PIP;
    FI_PIP[pip install rominf-stanza-fi]-->FI_PACKAGE_DEFAULT_PIP;

pip install rominf-stanza-fi[model, lemma, tdt, nocharlm]

It looks like you are proposing using extras. I thought about leveraging them as well, though I was thinking about a different packaging scheme (I intended them to specify Stanza packages (default, default_fast, ftb, etc.)). It is explained in the diagram below.

graph LR;
    FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-backward-charlm-conll17]-.-FI_MODEL_BACKWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/backward_charlm/conll17.pt];
    FI_MODEL_DEPPARSE_FTB_CHARLM_PIP[pip install rominf-stanza-fi-depparse-ftb-charlm]-.-FI_MODEL_DEPPARSE_FTB_CHARLM_FILE[~/stanza_resources/fi/depparse/ftb_charlm.pt];
    FI_MODEL_DEPPARSE_TDT_CHARLM_PIP[pip install rominf-stanza-fi-depparse-tdt-charlm]-.-FI_MODEL_DEPPARSE_TDT_CHARLM_FILE[~/stanza_resources/fi/depparse/tdt_charlm.pt];
    FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-forward-charlm-conll17]-.-FI_MODEL_FORWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/forwward_charlm/conll17.pt];
    FI_MODEL_LEMMA_FTB_NOCHARLM_PIP[pip install rominf-stanza-fi-lemma-ftb-nocharlm]-.-FI_MODEL_LEMMA_FTB_NOCHARLM_FILE[~/stanza_resources/fi/lemma/ftb_nocharlm.pt];
    FI_MODEL_LEMMA_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-lemma-tdt-nocharlm]-.-FI_MODEL_LEMMA_TDT_NOCHARLM_FILE[~/stanza_resources/fi/lemma/tdt_nocharlm.pt];
    FI_MODEL_MWT_TDT_PIP[pip install rominf-stanza-fi-mwt-tdt]-.-FI_MODEL_MWT_TDT_FILE[~/stanza_resources/fi/mwt/tdt.pt];
    FI_MODEL_MWT_FTB_PIP[pip install rominf-stanza-fi-mwt-ftb]-.-FI_MODEL_MWT_FTB_FILE[~/stanza_resources/fi/mwt/ftb.pt];
    FI_MODEL_NER_TURKU_PIP[pip install rominf-stanza-fi-ner-turku]-.-FI_MODEL_NER_TURKU_FILE[~/stanza_resources/fi/ner/turku.pt];
    FI_MODEL_POS_FTB_CHARLM_PIP[pip install rominf-stanza-fi-pos-ftb-charlm]-.-FI_MODEL_POS_FTB_CHARLM_FILE[~/stanza_resources/fi/pos/ftb_charlm.pt];
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP[pip install rominf-stanza-fi-pos-tdt-charlm]-.-FI_MODEL_POS_TDT_CHARLM_FILE[~/stanza_resources/fi/pos/tdt_charlm.pt];
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
    FI_MODEL_PRETRAIN_CONNLL17_PIP[pip install rominf-stanza-fi-pretrain-conll17]-.-FI_MODEL_PRETRAIN_CONNLL17_FILE[~/stanza_resources/fi/pretrain/conll17.pt];
    FI_MODEL_TOKENIZE_TDT_PIP[pip install rominf-stanza-fi-tokenize-tdt]-.-FI_MODEL_TOKENIZE_TDT_FILE[~/stanza_resources/fi/tokenize/tdt.pt];
    FI_MODEL_TOKENIZE_FTB_PIP[pip install rominf-stanza-fi-tokenize-ftb]-.-FI_MODEL_TOKENIZE_FTB_FILE[~/stanza_resources/fi/tokenize/ftb.pt];
    FI_PACKAGE_DEFAULT_PIP["pip install rominf-stanza-fi[default]"]-->FI_MODEL_TOKENIZE_TDT_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_MWT_TDT_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_POS_TDT_CHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_DEPPARSE_TDT_CHARLM_PIP;
    FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_NER_TURKU_PIP;
    FI_PACKAGE_FTB_PIP["pip install rominf-stanza-fi[ftb]"]-->FI_MODEL_TOKENIZE_FTB_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_MWT_FTB_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_POS_FTB_CHARLM_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_LEMMA_FTB_NOCHARLM_PIP;
    FI_PACKAGE_FTB_PIP-->FI_MODEL_DEPPARSE_FTB_CHARLM_PIP;
    FI_PIP[pip install rominf-stanza-fi]-->FI_PACKAGE_DEFAULT_PIP;

Unfortunately, this does not work since it is not possible to make pip select the default extra, when no extra is selected. There were multiple discussions about this functionality; see, for example: https://discuss.python.org/t/adding-a-default-extra-require-environment/4898.

resources.json cannot be packaged in wheels

I don't think this is that big of a deal, we can probably download it if it is missing as a temporal solution that could become permanent.

As for downloading resources.json as it happens now, this might work in most cases, but will break if, for example, the models were installed using pip, then they were renamed or removed by Stanza and resources.json was updated, then an updated resources.json was downloaded.

I found a way to not require Stanza changes to support wheel packages models at all. From a user perspective:

import stanza
import stanza_models

with stanza_models.get_config() as config:
    nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma,pos", **config)

If model packages are installed with target option: --target=~/stanza_resources, then an optional argument is added:

import stanza
import stanza_models

with stanza_models.get_config("~/stanza_resources") as config:
    nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma,pos", **config)

stanza_models comes from the rominf-stanza-models package, which is installed along with any rominf-stanza-LANGUAGE_NAME-* packages. stanza_models.get_config implementation:

@contextlib.contextmanager
def get_config(resources_path: Path | None) -> contextlib.AbstractContextManager[dict]:
    resources = resources_from_installed_packages(target=resources_path)  # uses importlib.metadata package
    with tempfile.NamedTemporaryFile("w", delete_on_close=False) as f:
        json.dump(resources, f)
        f.close()
        yield {"resources_filepath": f.name, "download_method": None}

Please let me know what you think.

sergiolaverde0 · 2024-07-17T03:24:36Z

Looks solid, but not very pretty. Upstream might complain about names like stanza-fi-model-depparse-tdt-charlm and stanza-fi-model-mwt-ftb. Personally I can see myself trying a different order of elements (stanza-fi-model-ftb-mwt) and getting an error.

Wonder if an hybrid solution is possible, where each "base" package uses normal naming but down the line we still use extras, ending with something like:

stanza-fi-[pos]
stanza-fi-fast[lemma]
stanza-fi-ftb[tokenize, mwt]

Otherwise, this is as ready for upstream as it will get without their feedback on it.

rominf · 2024-07-17T12:40:25Z

Thanks for your answer. Since the user might omit specifying extras, what will:

pip install stanza-fi-ftb

install?

As a user, I would probably expect stanza-fi-ftb to install all models from the package ftb for Finnish (packages such as stanza-fi-model-lemma-ftb-nocharlm, stanza-fi-model-pos-ftb-charlm, ... and its dependencies, such as stanza-fi-model-backward-charlm-conll17). However, if stanza-fi-ftb dependencies include all these model packages, what would:

pip install stanza-fi-ftb[lemma]

install then?

Variations of this question apply to all usages of extras.

Also, is the last dash intentional here: stanza-fi-[pos]?

sergiolaverde0 · 2024-07-17T13:25:24Z

That part is what I haven't decided on yet.

At first I assumed no extras would contain only the tokenizer, since it makes sense as a minimalist default for our use case. So pip install stanza-fi-ftb would install the tokenizer only, and pip install stanza-fi-ftb[lemma] would install the tokenizer and lemma processor.

But I am not so sure that applies for upstream too, and you are right with it being counter intuitive with the current naming scheme.

And the last dash was a typo, don't think too hard about it.

Edit: I just remembered why I took the tokenizer as a sane default: it is the only processor without other requirements and is instead the requirements for the rest.

simjanos-dev · 2024-12-21T11:03:25Z

Hi!

I've just reread the last part of this thread after months, and it seems like it would be way too complicated for myself to handle this. Can you please help me with a few questions:

Why is this python package system/solution required?
Are there more simple ways to handle installing and listing installed packages/language models?
Would it simplify things, if we store python packages inside the image, and people have to reinstall them when they update? I've had limited time recently, and I'm thinking about possibly simplifying things if it can save time for me.

Hope you are doing well, and having a nice holiday!

rominf · 2024-12-22T14:03:26Z

Hi @simjanos-dev,

Looking back, I see that I was stuck in over-engineered solution. Feel free to discard everything related to Python packaging. Your suggestion makes total sense.

Happy holidays!

rominf force-pushed the rominf-stanza branch from a919a0a to ad4c5a2 Compare May 10, 2024 14:59

rominf commented May 10, 2024

View reviewed changes

RFC: Use stanza model for Finnish

8b84d6d

rominf force-pushed the rominf-stanza branch from ad4c5a2 to 8b84d6d Compare May 11, 2024 07:36

simjanos-dev mentioned this pull request May 11, 2024

What is needed to add additional languages? #3

Open

sergiolaverde0 reviewed May 14, 2024

View reviewed changes

RFC: Use stanza model for Finnish #255

Are you sure you want to change the base?

RFC: Use stanza model for Finnish #255

Conversation

rominf commented May 10, 2024

simjanos-dev commented May 10, 2024 • edited Loading

rominf commented May 10, 2024

rominf May 10, 2024

Choose a reason for hiding this comment

sergiolaverde0 commented May 10, 2024

simjanos-dev commented May 10, 2024

sergiolaverde0 commented May 10, 2024

simjanos-dev commented May 10, 2024

rominf commented May 11, 2024 • edited Loading

simjanos-dev commented May 11, 2024

rominf commented May 11, 2024 • edited Loading

sergiolaverde0 commented May 11, 2024 • edited Loading

simjanos-dev commented May 11, 2024 • edited Loading

rominf commented May 11, 2024 • edited Loading

rominf commented May 11, 2024

simjanos-dev commented May 11, 2024 • edited Loading

sergiolaverde0 commented May 14, 2024

simjanos-dev commented May 14, 2024 • edited Loading

sergiolaverde0 May 14, 2024

Choose a reason for hiding this comment

rominf May 15, 2024

Choose a reason for hiding this comment

sergiolaverde0 commented May 15, 2024 • edited Loading

simjanos-dev commented May 15, 2024 • edited Loading

rominf commented May 19, 2024

simjanos-dev commented May 20, 2024

rominf commented May 20, 2024

sergiolaverde0 commented May 27, 2024

simjanos-dev commented May 28, 2024

simjanos-dev commented Jun 29, 2024

sergiolaverde0 commented Jun 29, 2024

simjanos-dev commented Jun 29, 2024

sergiolaverde0 commented Jun 29, 2024

simjanos-dev commented Jun 29, 2024 • edited Loading

simjanos-dev commented Jun 30, 2024

simjanos-dev commented Jul 7, 2024

simjanos-dev commented Jul 13, 2024

sergiolaverde0 commented Jul 13, 2024 • edited Loading

simjanos-dev commented Jul 13, 2024

sergiolaverde0 commented Jul 14, 2024

simjanos-dev commented Jul 14, 2024

rominf commented Jul 14, 2024 • edited Loading

sergiolaverde0 commented Jul 14, 2024

rominf commented Jul 14, 2024

rominf commented Jul 15, 2024

sergiolaverde0 commented Jul 16, 2024

rominf commented Jul 16, 2024

sergiolaverde0 commented Jul 17, 2024

rominf commented Jul 17, 2024

sergiolaverde0 commented Jul 17, 2024 • edited Loading

simjanos-dev commented Dec 21, 2024

rominf commented Dec 22, 2024 • edited Loading

simjanos-dev commented May 10, 2024 •

edited

Loading

rominf commented May 11, 2024 •

edited

Loading

rominf commented May 11, 2024 •

edited

Loading

sergiolaverde0 commented May 11, 2024 •

edited

Loading

simjanos-dev commented May 11, 2024 •

edited

Loading

rominf commented May 11, 2024 •

edited

Loading

simjanos-dev commented May 11, 2024 •

edited

Loading

simjanos-dev commented May 14, 2024 •

edited

Loading

sergiolaverde0 commented May 15, 2024 •

edited

Loading

simjanos-dev commented May 15, 2024 •

edited

Loading

simjanos-dev commented Jun 29, 2024 •

edited

Loading

sergiolaverde0 commented Jul 13, 2024 •

edited

Loading

rominf commented Jul 14, 2024 •

edited

Loading

sergiolaverde0 commented Jul 17, 2024 •

edited

Loading

rominf commented Dec 22, 2024 •

edited

Loading