From 27784157d2a0127832fdcc937296b18817b14682 Mon Sep 17 00:00:00 2001 From: "Julien \"uj\" Abadji" Date: Thu, 13 Jul 2023 09:55:08 +0200 Subject: [PATCH] add lid info (#106) --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 81cbed3..a1a552a 100644 --- a/README.md +++ b/README.md @@ -33,9 +33,18 @@ apt install -y libboost-all-dev libeigen3-dev and use `cargo install ungoliant --feature kenlm` or `cargo b --features kenlm` if you're building from source. -### Getting the language identification file (for fastText): +### Getting a language identification file (for fastText): + +By default, `ungoliant` expects the `lid.176.bin` model by meta. +Use `curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin` to get it. + +However, you can use the model you want: just point to its path using `ungoliant download --lid-path `. + +Other options include: + +- NLLB model (https://huggingface.co/facebook/fasttext-language-identification) +- OpenLID model (https://github.com/laurieburchell/open-lid-dataset) -Use `curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin`. ## Usage