-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Use stanza model for Finnish #255
base: dev
Are you sure you want to change the base?
Conversation
Oh wow, this looks great! I didn't know about this. I would love to add this. We actually have a language install system, so the image size would not increase, it would only take up space for users who actually use this language. Does this require a GPU? Can you please test what the size would be without the nvidia driver? My only problem with it would be GPU dependence plus that my laptop is probably too weak to test this. After adding the 2 missing Spacy languages my plan was to use different tokenizers, it would be VERY useful if I could keep using Spacy for more languages. Thank you so much for working on this! @sergiolaverde0 You may be interested in this. |
@simjanos-dev I am so glad you liked it! Yes, it works without GPU: I just added installation of CPU version of
What are my next actions? Fix the documentation (add references about stanza to all places where spacy is mentioned, write proper commit and PR messages) and undraft the PR, or is there something else that needs to be done? |
@@ -8,6 +8,8 @@ RUN apt-get update -y \ | |||
&& apt-get clean \ | |||
&& rm -rf /var/lib/apt/lists/* | |||
|
|||
RUN pip install torch --index-url https://download.pytorch.org/whl/cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about stanza and NVIDIA drivers is needed here.
Looking at the URL for the Pytorch install this doesn't need a GPU since it uses CPU as the computing platform. I heard we can reduce the size of that install by compiling Pytorch from source without the unnecessary features but I haven't done it before and I don't know by how much we can cut it. If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default. I'm concerned with performance when using CPU so that's another thing to check. I see a future here, but it will take effort. |
My enthusiasm has dropped a lot, I thought it would be much smaller. The model size is still huge compared to the 20-50MB model we used before. A few more questions:
I don't think I want to do that. Some users already had issues with the ram. I've seen attempted installs on raspberry pi-s, small free tier hosted servers and old laptops. I myself have an old laptop. And I also want to host LinguaCafe on a VPS in the future and try to optimize it. I want to rather make LinguaCafe smaller if possible by default. However, I definitely want to add these models as well as an option.
I'm not sure, I will need some time to figure out what I would like to do. I will more than likely have a problem with testing this myself. Since this is only needed for lemmas (except for languages that have no spaces or have readings), what if we would use a huge amount of text, and generate a list of lemmas that we would use for linguacafe? For most languages, that is the only value that is being added by using an other model or tokenizer than the multilingual Spacy one. 2 other options would be: adding them as extra installable languages like "Finnish (large)", or adding an API that let people use other tokenizers. It would be easy to copy the current python container, and modify it and add different models. What do you think? |
Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time. Next week I will try to compile Pytorch from source and see what would be the absolute minimum size so we can make a better informed decision. For the time being the option for larger models is my favourite. @rominf sorry if we take our time, but Simjanos is right to be concerned about accessibility of the hardware requirements. |
Implementing this will definitely take a lot of time. I want to add everything to linguacafe that I can, but I cant do it at the rate requests are coming in. Its been an insane progress in the last 4 months since release. |
Please take your time! I will post my results, so that you have a food for thought meanwhile. You are right about importance of accessibility of the hardware requirements: my mistake, I was not thoughtful about this. I will write about Finnish only, since I have not tried to do lemmatization in other languages. Stanza language support is split to multiple models. For lemmatization only My PC info:
Here are the results of lemmatization of Universal Dependencies tree bank:
I also measured RAM usage on lemmatization of Alice in Wonderland in Finnish using scalene. Here is the script:
Results:
As you can notice from the script, I don't use This is the size of the image now (without pos):
To sum up, stanza without pos processor is a bit more accurate on Finnish than spacy, takes significantly less disk space and RAM, however, much slower. Stanza with pos processor is much more accurate on Finnish than spacy, but takes significantly more disk space and RAM and tremendously slower. The proposal about having multiple variants of language is my favorite as well! Do you want me to do a benchmark of spacy vs stanza for other languages? UPD: added results for Python 3.12 for Alice in Wonderland test. |
This is a really detailed test report, thank you so much!
Wow. I have i5-8250u and 8GB ram.
I think we should go with that as well to provide the best experience possible. At first I was thinking about it the wrong way. My first idea was to have multiple languages for different tokenizers, but I realized it would be extremely difficult to implement, since Language names are used at a ton of places. It is however reasonably simple to switch tokenizers. So we can just make the tokenizer selectable on the admin page without separating them into their own language.
I mostly interested in that if we add multiple languages, would the additional disk space required decrease, due to shared dependencies. I think the latest 2GB disk size you commented is very reasonable to be added as an option. But if the models themselfs are so small, is there any way to decrease the disk space further? Can we remove Spacy, and use Stanza by itself to save space? I know it returns a different format, but I can write a different tokenizer function for it. The tokenizer is quite slower, but the PHP side of processing the text takes time as well, so it might won't be that much of an issue, plus users can decide which one they want to use. I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this. I will think about how to implement a tokenizer selector. We should probably rebrand installable languages to installable packages or something. |
What if we extend your idea about installable lemmatizers even further? Since some people want to run LinguaCafe in constrained environments, what if:
This can be done! My proposal is to preinstall simplemma instead of spacy, so that the image is minimal. It has low footprint and runs very fast (it can be seen from the table in my previous message – I added simplemma there) – a good fit for raspberry pi. If user selects enhanced models spacy or stanza is installed in a few seconds using uv, which installs stanza just in few seconds (5 seconds on my machine)! This is just one extra call to uv for installation of the package in venv. I created four venvs using uv: empty, simplemma, spacy, and stanza. Here is what I got: pytorch takes the most place, as @sergiolaverde0 expected. Here is the showcase of how fast uv is:
Docker image building with uv becomes much, much faster and here is the footprint:
PS: please have a look at "UPD" in my previous message: stanza on Python 3.12 is quite a bit faster than on Python 3.9. |
Hi, I have a few questions, hoping not to derail this too much:
About suing I see I also want to remark: |
I will comment on it more later, just a few quick comments from phone.
I want to check other non spacy tokenizers as well, and compare the sizes. I think Spacy is a good default option based on its size, and if theres an other smaller tokenizer for Vietnamese, I would prefer that instead of Stanza. Theres also an option for using Spacy multilingual model and simple lemmatizer together. It would be really good and easy solution for Czech and Latin lemma support. We could replace spacy for most languages with simple lemmatizer, but there are 3 points to keep in mind:
I am thinking about it. I have no strong opinions about it, but I feel like using Spacy is a good default option when available. Tokenization speeds importance will decrease in the future, because I want to make a queue for them, and users will be able to start reading after the first chapter is finished. |
|
Thank you! I can do testing and support users for this feature. Also, I do not think |
I'll try it out sometime.
In that case I am open for adding Stanza as an additional option for at least Finnish. If it goes well, I think we can add more languages and Stanza tokenizers. I will do everything on the Laravel and front-end side, and can also do Python if needed. (Honestly I am a bit worried about having parts of the code that I don't test/support completely.)
Currently I think the only thing needed on the Python/docker side is to make it installable like other language packages. I will experiment with simplemma for Czech, Latin, Ukrainian and Welsh in the future. It also has Hindi which was a requested language. I wanted to split up tokenizer.py for a while, because it keeps growing. Now it will be kind of necessary. Currently it should have 3 files: tokenizer, import and models(I'm not sure if this one can be separated). I will probably do it for v0.13 or v0.14. It might take a while for me to do my part, I will be working a bit less on linguacafe, and will work on parts of it that I want to use, because I feel a bit burned out. And thank you so much for working on this! Both Stanza and Simplemma are great tools for tokenizing, I didn't even know about them. |
I did some really quick mockups last night and was able to reduce it to 1.81 GB by changing the base to I will add the first change for v0.13 regardless of what happens with the tokenizers because there's no reason not to. While doing this I realized we can use Python 3.12 regardless of anything so sorry for wasting your time with that pointless inquiry. Later I will test how does size evolve as I replace more and more languages with the Stanza variants, and check if I can shrink Pytorch more. |
Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza? Edit: I think it was the former. Im a bit slow today and was confused. |
@@ -22,6 +24,8 @@ RUN pip install -U --no-cache-dir \ | |||
bottle \ | |||
#spacy | |||
spacy \ | |||
#stanza integration for spacy | |||
stanza \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spacy_stanza
should be installed as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to remove it, since there are at least two issues with the spacy_stanza
library I bumped into:
- Multi-word token expansion issue, misaligned tokens --> failed NER (German) explosion/spacy-stanza#70. As I understand this affects quality because of imprecise tokenization. Also, this generate verbose output I could not suppress.
stanza
version is pinned and it is not the latest.
Also, doing lemmatization using stanza directly is straightforward, see #255 (comment).
The code in this PR should be changed a bit to make it work. Currently, it is broken, since I wanted to check the image size and did not care about usable LinguaCafe at this stage).
Replace where possible to see the image size I ended up with, and also because the easy way to map the models to a language to test them was to ditch the Spacy counterparts anyways. And after doing so to see if space savings from shared dependencies could shrink this image I found that:
I'm now to test that languages other than Finish actually work, as in, to check if they actually load and can tokenize a paragraph. I will be grateful for any help so I have built a test on my fork, pull it with If the decrease in performance is not that big of a deal, if the rest of the languages can be worked around to be used, and if they all follow the pattern of using less RAM than Spacy counterparts, I can vouch for this to be our new default. But those ifs are doing quite the heavylifting. Edit: And yes I'm shunning away from compiling Pytorch until we exhaust alternatives, today I saw their |
A few things to keep in mind regarding replacing default tokenizers with stanza:
|
Thank you for the tests! This is VERY detailed list. I'll do a few other things this week, but on the weekend or next week I am ready to do my part adding more tokenizers to linguacafe starting with Finnish. |
You are welcome! It was fun and educational to work on this benchmark. Here is the release with results in CSV format: https://github.com/rominf/spacy-vs-stanza/releases/tag/v0.1.0. Feel free to ask any questions, but please note that from 14:00 UTC today until the morning of May 27th (so, ~one week), I will be offline. |
I did some rough DA and in summary:
I have been thinking about this and I'm still unsure on how to go about it, I think switching any of the existing tokenizers is technically a breaking change since the same phrase could get tokenized differently if there was a switch between importing two texts and I don't think the database would be happy with that. Re-tokenizing everything when "upgrading" to a bigger model is not a good idea either, so the potential use case for Simplemma as a test bed for people to mess around until they install the "main" model is probably better discarded. For Finnish specifically we could make it work as an extra language and remove it from the base image, but then it would be a gargantuan extra model of 2 GB. And there is still the matter of the languages exclusive to Stanza. Thoughts? |
It probably wouldn't be a problem, except for Japanese, Chinese and Thai.
I think the best option is to keep the current spacy languages as default, and add Stanza for Finnish as optional installable language, probably with the option to choose between Spacy and Stanza on the admin page, so already existing users won't have to install 2GB extra package. For new languages that spacy doesn't have, I think having stanza as an option for 20+ extra language is great! Probably using other tokenizers would take up large place as well (have no idea, just guessing). (Currently requested languages that I want to add next: Vietnamese, Tagalog, Swahili and Hindi.) If we decide to add switchable tokenizers, maybe we should also think through installable languages, and rename it to installable packages or something similar. I've got some tools linked like #280 and some TTS libraries, that I don't think we should add to the main package either. ( I don't plan on adding any of them anytime soon.) (Sorry if I wrote something confusing, I wrote this message very late.) |
I am working the importing system currently. After that(2-3 weeks) I want to work on stanza. Would someone like to do the python side (I would still do it myself if not)? |
I can do the Python side, I will be having a bit more spare time starting this week. Adding the option to install the extra-extra-large Finish model with Pytorch I can probably implement in one hour |
I think we should integrate it to the current api calls. If I remember currently, I just receive a simple array of the installed languages. I think we should change it to look like this:
This way users can install multiple tokenizers. For the install function we could use/pass a language and a tokenizer post variable. I am currently working on the tokenizer.py file. I'll rewrite the tokenizers themself if you want, or you can do it after I am finished. I'm changing a lot of things, so there would probably be a ton of conflicts. But I'm not touching the model functions, so you could work on those, save them somehow, and after I'm done you could just apply/copy-paste the functions to the latest tokenizer.py file. I will probably won't have time to work on this from Monday to Friday, and I think I will be finished with the tokenizer.py today or tomorrow. |
Is that really practical, given that the spacy model for Finish would always be present? Currently we only check for the extra languages, so that a default install returns an empty object. |
Sorry, I made a mistake, this is what I meant:
We should only handle extra languages on python, I'll handle the selection between stanza and spacy on the webserver side. But there will be languages where multiple packages can be installed in the future other than tokenizers, which I think should be handled with the same function. So maybe it would look like this:
We could use the current ["plain", "array"] format, but it would be more complicated on the webserver side to combine the array I get from python with the config file of already installed languages and tokenizers, because for selecting tokenizers I will have to use an array structure like the one above. Please feel free to ask anything or suggest an other method. I'm not sure if I explained it correctly. |
@sergiolaverde0 I think I am done with working on the tokenizer.py file (99%). The latest version is in the feature/websockets-vue branch. I will merge this into dev probably next week, but maybe even today. |
I've mostly finished working on job queues. There are a few small tasks left, I plan to work on stanza next weekend. |
Soo, I am done with adding job queues, and I am ready to test and implement stanza (while also working on smaller features). What image/repo should I test? |
I have noticed a non-trivial issue, which is that
EDIT: Never mind, it was writing this and finding the documentation to override that location. I haven't been able to do much more, the whole process has taken nearly two hours now. Sometimes I hate the Python dependency resolving system. |
That's fine. Please take your time. It's okay even if it takes months, or if you end up stopping working on it. You have helped so much with this project, I dont want anyone to feel bad about working on linguacafe. I'm not sure if I will start stanza tomorrow, maybe Ill just write manual for the queue and some small things. |
Alright, we will have to do some hoop jumping for the Stanza part specifically. For some reason doing
Since the process is so different from our existing packages the code will lose a bit of readability and most of it won't be reused, but it is also cleaner than anything else I can think of right now. Thoughts? |
Sounds good, at least I cannot think of a better alternative. I think after it's done, we should split up the code into multiple parts. |
@sergiolaverde0 Have you tried downloading some other Stanza models? I downloaded dozens of Stanza models and never ran into the issue of
There are not bug reports about slow downloads in issues repository: https://github.com/stanfordnlp/stanza/issues. I do not know the reason why it took two hours on your machine, but I suppose it might be something unrelated with It it possible to replace Spacy distributes model wheels via GitHub releases. If Stanza models were packaged as wheels, they wouldn't fit into PyPI: https://pypi.org/help/#file-size-limit, https://pypi.org/help/#project-size-limit. PyTorch uses its own PyPI index along with Apparently, there is no automation for uploading Stanza models to HuggingFace. It is possible to create GitHub repository similar to https://github.com/explosion/spacy-models with GitHub Actions that would periodically check for updates in https://github.com/stanfordnlp/stanza-resources repository. In case some file was updated, the script running in GitHub Actions will download the model from HuggingFace, build the wheel, and create the release with the wheel in the same GitHub repository. Then, PyPI index will be updated and served as GitHub Pages (see https://github.com/astariul/github-hosted-pypi which partially implements what I mean). Here is a proof of concept:
See https://github.com/rominf/test-github-hosted-pypi and https://github.com/rominf/test-stanza-models/releases/tag/stanza-fi-1.8.0. Dependencies between wheels can be modeled by processing Later, this might be contributed back to the main Stanza organization. Please ping me if you want to see this solution implemented or if I can contribute in some other way. |
I will probably fill a bug report, seeing how neither my network nor my CPU where actually stressed at any point yet:
This would give us a clean
It would be nice if upstream take it, but I'm not sure if this is allows to package all the different combinations of models they provide, since some languages have multiple pipelines. For our partial redistribution, however, it is just perfect. |
@sergiolaverde0 I will answer in two parts for better structuring and a faster response.
It looks very strange to me. I suggest putting the same command into the script and executing it with some profiler to understand why it takes so long. I had great experience with Scalene: https://github.com/plasma-umass/scalene (it is easy to use and has low overhead). Maybe the reason will become evident, otherwise it will probably be helpful for filling an issue anyway. Also, it is probably enough to keep just the tokenizer model; maybe it will reproduce the bug still, but work a bit faster. |
@sergiolaverde0 Now, regarding distributing models as wheels... Firstly, regarding your second question about whether it is possible to package all different combinations of models provided by Stanza, I did some manual packaging work to understand how it should look like. You can take a look at https://github.com/rominf/test-stanza-models and https://rominf.github.io/test-github-hosted-pypi/. To install the default package for Finnish (excluding depparse and ner), run:
To install a specific model, run something like:
Packages might be packaged as
Unfortunately, the current
For now, to load
In case Stanza won't accept models as wheel distributions, these arguments can be generated with a helper function using Secondly, speaking of maintenance burden, I would not worry about it much, since if everything is automated (this is what I want), it does not matter how many Stanza languages need to be packaged and how often they update. The maintenance would be required for bug fixes, merging Dependabot updates in case the project has some affected dependencies, and in case Stanza dramatically changes its packaging way (like changing the file structure of |
I wonder if these could be merged together, so that the second is simply the first one plus some optional dependencies:
This will keep most of the flexibility from the current system, while improving on the packaging. Different treebanks for the same language will probably require separate packages, so Italian for example would have:
I don't think this can be dodged, but is not too big of a deal.
I don't think this is that big of a deal, we can probably download it if it is missing as a temporal solution that could become permanent. At least that is what I intended to do before For upstream, this would be a breaking change but they might deem it worth it because of the improved workflow when releasing. One little regression I can think of is that models couldn't be so easily isolated like they are nowadays by setting |
Here is the packaging scheme I propose, visually explained. Note that I group models by packages as in Stanza, not by treebanks (these concepts are different; see the graph LR;
FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-model-backward-charlm-conll17]-.-FI_MODEL_BACKWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/backward_charlm/conll17.pt];
FI_MODEL_DEPPARSE_FTB_CHARLM_PIP[pip install rominf-stanza-fi-model-depparse-ftb-charlm]-.-FI_MODEL_DEPPARSE_FTB_CHARLM_FILE[~/stanza_resources/fi/depparse/ftb_charlm.pt];
FI_MODEL_DEPPARSE_TDT_CHARLM_PIP[pip install rominf-stanza-fi-model-depparse-tdt-charlm]-.-FI_MODEL_DEPPARSE_TDT_CHARLM_FILE[~/stanza_resources/fi/depparse/tdt_charlm.pt];
FI_MODEL_DEPPARSE_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-nocharlm]-.-FI_MODEL_DEPPARSE_TDT_NOCHARLM_FILE[~/stanza_resources/fi/depparse/tdt_nocharlm.pt];
FI_MODEL_POS_TDT_NOCHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-model-forward-charlm-conll17]-.-FI_MODEL_FORWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/forwward_charlm/conll17.pt];
FI_MODEL_LEMMA_FTB_NOCHARLM_PIP[pip install rominf-stanza-fi-model-lemma-ftb-nocharlm]-.-FI_MODEL_LEMMA_FTB_NOCHARLM_FILE[~/stanza_resources/fi/lemma/ftb_nocharlm.pt];
FI_MODEL_LEMMA_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-lemma-tdt-nocharlm]-.-FI_MODEL_LEMMA_TDT_NOCHARLM_FILE[~/stanza_resources/fi/lemma/tdt_nocharlm.pt];
FI_MODEL_MWT_TDT_PIP[pip install rominf-stanza-fi-model-mwt-tdt]-.-FI_MODEL_MWT_TDT_FILE[~/stanza_resources/fi/mwt/tdt.pt];
FI_MODEL_MWT_FTB_PIP[pip install rominf-stanza-fi-model-mwt-ftb]-.-FI_MODEL_MWT_FTB_FILE[~/stanza_resources/fi/mwt/ftb.pt];
FI_MODEL_NER_TURKU_PIP[pip install rominf-stanza-fi-model-ner-turku]-.-FI_MODEL_NER_TURKU_FILE[~/stanza_resources/fi/ner/turku.pt];
FI_MODEL_POS_FTB_CHARLM_PIP[pip install rominf-stanza-fi-model-pos-ftb-charlm]-.-FI_MODEL_POS_FTB_CHARLM_FILE[~/stanza_resources/fi/pos/ftb_charlm.pt];
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-nocharlm]-.-FI_MODEL_POS_TDT_NOCHARLM_FILE[~/stanza_resources/fi/pos/tdt_nocharlm.pt];
FI_MODEL_POS_TDT_NOCHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP[pip install rominf-stanza-fi-model-pos-tdt-charlm]-.-FI_MODEL_POS_TDT_CHARLM_FILE[~/stanza_resources/fi/pos/tdt_charlm.pt];
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_PRETRAIN_CONNLL17_PIP[pip install rominf-stanza-fi-model-pretrain-conll17]-.-FI_MODEL_PRETRAIN_CONNLL17_FILE[~/stanza_resources/fi/pretrain/conll17.pt];
FI_MODEL_TOKENIZE_TDT_PIP[pip install rominf-stanza-fi-model-tokenize-tdt]-.-FI_MODEL_TOKENIZE_TDT_FILE[~/stanza_resources/fi/tokenize/tdt.pt];
FI_MODEL_TOKENIZE_FTB_PIP[pip install rominf-stanza-fi-model-tokenize-ftb]-.-FI_MODEL_TOKENIZE_FTB_FILE[~/stanza_resources/fi/tokenize/ftb.pt];
FI_PACKAGE_DEFAULT_PIP[pip install rominf-stanza-fi-package-default]-->FI_MODEL_TOKENIZE_TDT_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_MWT_TDT_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_POS_TDT_CHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_DEPPARSE_TDT_CHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_NER_TURKU_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP[pip install rominf-stanza-fi-package-default-fast]-->FI_MODEL_TOKENIZE_TDT_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_MWT_TDT_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_POS_TDT_NOCHARLM_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_DEPPARSE_TDT_NOCHARLM_PIP;
FI_PACKAGE_DEFAULT_FAST_PIP-->FI_MODEL_NER_TURKU_PIP;
FI_PACKAGE_FTB_PIP[pip install rominf-stanza-fi-package-ftb]-->FI_MODEL_TOKENIZE_FTB_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_MWT_FTB_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_POS_FTB_CHARLM_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_LEMMA_FTB_NOCHARLM_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_DEPPARSE_FTB_CHARLM_PIP;
FI_PIP[pip install rominf-stanza-fi]-->FI_PACKAGE_DEFAULT_PIP;
It looks like you are proposing using extras. I thought about leveraging them as well, though I was thinking about a different packaging scheme (I intended them to specify Stanza packages ( graph LR;
FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-backward-charlm-conll17]-.-FI_MODEL_BACKWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/backward_charlm/conll17.pt];
FI_MODEL_DEPPARSE_FTB_CHARLM_PIP[pip install rominf-stanza-fi-depparse-ftb-charlm]-.-FI_MODEL_DEPPARSE_FTB_CHARLM_FILE[~/stanza_resources/fi/depparse/ftb_charlm.pt];
FI_MODEL_DEPPARSE_TDT_CHARLM_PIP[pip install rominf-stanza-fi-depparse-tdt-charlm]-.-FI_MODEL_DEPPARSE_TDT_CHARLM_FILE[~/stanza_resources/fi/depparse/tdt_charlm.pt];
FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP[pip install rominf-stanza-fi-forward-charlm-conll17]-.-FI_MODEL_FORWARD_CHARLM_CONNLL17_FILE[~/stanza_resources/fi/forwward_charlm/conll17.pt];
FI_MODEL_LEMMA_FTB_NOCHARLM_PIP[pip install rominf-stanza-fi-lemma-ftb-nocharlm]-.-FI_MODEL_LEMMA_FTB_NOCHARLM_FILE[~/stanza_resources/fi/lemma/ftb_nocharlm.pt];
FI_MODEL_LEMMA_TDT_NOCHARLM_PIP[pip install rominf-stanza-fi-lemma-tdt-nocharlm]-.-FI_MODEL_LEMMA_TDT_NOCHARLM_FILE[~/stanza_resources/fi/lemma/tdt_nocharlm.pt];
FI_MODEL_MWT_TDT_PIP[pip install rominf-stanza-fi-mwt-tdt]-.-FI_MODEL_MWT_TDT_FILE[~/stanza_resources/fi/mwt/tdt.pt];
FI_MODEL_MWT_FTB_PIP[pip install rominf-stanza-fi-mwt-ftb]-.-FI_MODEL_MWT_FTB_FILE[~/stanza_resources/fi/mwt/ftb.pt];
FI_MODEL_NER_TURKU_PIP[pip install rominf-stanza-fi-ner-turku]-.-FI_MODEL_NER_TURKU_FILE[~/stanza_resources/fi/ner/turku.pt];
FI_MODEL_POS_FTB_CHARLM_PIP[pip install rominf-stanza-fi-pos-ftb-charlm]-.-FI_MODEL_POS_FTB_CHARLM_FILE[~/stanza_resources/fi/pos/ftb_charlm.pt];
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_FTB_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP[pip install rominf-stanza-fi-pos-tdt-charlm]-.-FI_MODEL_POS_TDT_CHARLM_FILE[~/stanza_resources/fi/pos/tdt_charlm.pt];
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_PRETRAIN_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_FORWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_POS_TDT_CHARLM_PIP-->FI_MODEL_BACKWARD_CHARLM_CONNLL17_PIP;
FI_MODEL_PRETRAIN_CONNLL17_PIP[pip install rominf-stanza-fi-pretrain-conll17]-.-FI_MODEL_PRETRAIN_CONNLL17_FILE[~/stanza_resources/fi/pretrain/conll17.pt];
FI_MODEL_TOKENIZE_TDT_PIP[pip install rominf-stanza-fi-tokenize-tdt]-.-FI_MODEL_TOKENIZE_TDT_FILE[~/stanza_resources/fi/tokenize/tdt.pt];
FI_MODEL_TOKENIZE_FTB_PIP[pip install rominf-stanza-fi-tokenize-ftb]-.-FI_MODEL_TOKENIZE_FTB_FILE[~/stanza_resources/fi/tokenize/ftb.pt];
FI_PACKAGE_DEFAULT_PIP["pip install rominf-stanza-fi[default]"]-->FI_MODEL_TOKENIZE_TDT_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_MWT_TDT_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_LEMMA_TDT_NOCHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_POS_TDT_CHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_DEPPARSE_TDT_CHARLM_PIP;
FI_PACKAGE_DEFAULT_PIP-->FI_MODEL_NER_TURKU_PIP;
FI_PACKAGE_FTB_PIP["pip install rominf-stanza-fi[ftb]"]-->FI_MODEL_TOKENIZE_FTB_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_MWT_FTB_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_POS_FTB_CHARLM_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_LEMMA_FTB_NOCHARLM_PIP;
FI_PACKAGE_FTB_PIP-->FI_MODEL_DEPPARSE_FTB_CHARLM_PIP;
FI_PIP[pip install rominf-stanza-fi]-->FI_PACKAGE_DEFAULT_PIP;
Unfortunately, this does not work since it is not possible to make
As for downloading I found a way to not require Stanza changes to support wheel packages models at all. From a user perspective:
If model packages are installed with target option:
Please let me know what you think. |
Looks solid, but not very pretty. Upstream might complain about names like Wonder if an hybrid solution is possible, where each "base" package uses normal naming but down the line we still use extras, ending with something like:
Otherwise, this is as ready for upstream as it will get without their feedback on it. |
Thanks for your answer. Since the user might omit specifying extras, what will:
install? As a user, I would probably expect
install then? Variations of this question apply to all usages of extras. Also, is the last dash intentional here: |
That part is what I haven't decided on yet. At first I assumed no extras would contain only the tokenizer, since it makes sense as a minimalist default for our use case. So But I am not so sure that applies for upstream too, and you are right with it being counter intuitive with the current naming scheme. And the last dash was a typo, don't think too hard about it. Edit: I just remembered why I took the tokenizer as a sane default: it is the only processor without other requirements and is instead the requirements for the rest. |
Hi! I've just reread the last part of this thread after months, and it seems like it would be way too complicated for myself to handle this. Can you please help me with a few questions:
Hope you are doing well, and having a nice holiday! |
Hi @simjanos-dev, Looking back, I see that I was stuck in over-engineered solution. Feel free to discard everything related to Python packaging. Your suggestion makes total sense. Happy holidays! |
This PR is a request for comments about using stanza model for Finnish and is not meant to be merged in current state, hence it is draft.
Unfortunately, Finnish lemmatization is not very accurate. I ran slightly updated benchmark: https://github.com/aajanki/finnish-pos-accuracy and found that spacy lemmatization model used in LinguaCafe has
F1=0.842
, whereas default stanza model for Finnish givesF1=0.958
.I tried to use stanza with https://github.com/explosion/spacy-stanza adapter (see PR code). It works. Also, code changes are generalizable to other languages (stanza supports over 70 languages).
There is a huge downside though: the size of resulting docker image, which is mostly because NVIDIA drivers I guess, which are automatically downloaded with pytorch installation.
In conclusion, it is possible to significantly increase accuracy for Finnish (and probably some other languages) while not increasing code complexity at the cost of image size.
What do you think about Finnish lemmatization accuracy and introducing stanza?
Before (lemma is whole word – incorrect):
After (lemma is correct):