Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Document how to set fasttext model #106

Open
chris-ha458 opened this issue Jul 13, 2023 · 2 comments
Open

[Feature request] Document how to set fasttext model #106

chris-ha458 opened this issue Jul 13, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@chris-ha458
Copy link

Is your feature request related to a problem? Please describe.
There are multiple fasttext models available, and in principle, one could train their own.
Besides the one indicated by the README.md (lid.176.bin), the official page lists lid.176.ftz
On huggingface there is lid218eavailable and
there is also a recent independent lib201 model

Describe the solution you'd like

  1. I would like the README.md to mention that there are other models available
  2. I would like the code to provide a way to select a model through configuration.
  3. I would like the README.md to reflect how 2. would be implemented

Describe alternatives you've considered
I can download my own model and rename it into lid.176.bin. This is prone to confusion and unsatisfactory

Additional context

  • There seems to be some unexposed options to achieve this.
    It would be useful to modify them into a modular fashion and document it.
  • The code model.rs also seems to default to Path::new("lid.176.bin") but when absents tries to default to lid.208a.bin? which is unclear where it is obtainable. It's obvious certain efforts were made behind the curtain so I am hesitant to implment a solution on my own.
  • Since lid.176.bin is the most publically available, that could be the backup, while the user could provide/select their own model.
    [Feature request] Train a classifier to better classify languages  #21 might be fixed with this change.
@chris-ha458 chris-ha458 added the enhancement New feature or request label Jul 13, 2023
@Uinelj
Copy link
Member

Uinelj commented Jul 13, 2023

Hello @chris-ha458 , looking into it as we speak!

@Uinelj
Copy link
Member

Uinelj commented Jul 13, 2023

I would like the README.md to mention that there are other models available

I agree, we need to add some info about that, esp. since there are way more fasttext models now.

I would like the code to provide a way to select a model through configuration.

Is using ungoliant pipeline ... --lid-path <path to lid> not sufficient? I have used other fasttext models this way and it works!

I would like the README.md to reflect how 2. would be implemented

This can also be added to the readme.

The code model.rs also seems to default to Path::new("lid.176.bin") but when absents tries to default to lid.208a.bin? which is unclear where it is obtainable. It's obvious certain efforts were made behind the curtain so I am hesitant to implment a solution on my own.

This is a weird behaviour and it should be fixed. I'll replace lid.218a.bin by lid.176.bin, since it's readily available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants