Skip to content
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.

Using YouTokenToMe with pre-defined vocab and embeddings #84

Open
alexbalandi opened this issue Feb 16, 2021 · 2 comments
Open

Using YouTokenToMe with pre-defined vocab and embeddings #84

alexbalandi opened this issue Feb 16, 2021 · 2 comments

Comments

@alexbalandi
Copy link

I want to use YouTokenToMe for fast id encoding, but I need to do it with embeddings taken from here : https://nlp.h-its.org/bpemb/
Obviously, there is a pre-defined vocab there. Right now I don't see out-of-the-box way to "befriend" YouTokenToMe model with pre-defined vocab.
Are there any plans to implement something like build_from_vocab classmethod? If not, can I get any starter points on how to do it myself? Right now the model file looks a bit obscure to me, so I can't easily get started on building my own model file from vocab I have.

@kefirski
Copy link
Contributor

Hi @alexbalandi!

Right now, you can't use external vocab to define your bpe model.
We plan to support converting different subword formats into yttm format in the future, but it seems to be slightly hard to implement.

@alexbalandi
Copy link
Author

Hi @alexbalandi!

Right now, you can't use external vocab to define your bpe model.
We plan to support converting different subword formats into yttm format in the future, but it seems to be slightly hard to implement.

Thank you for quick answer!
Can I at least get some pointers at where to look so I could try to make ad hoc solution myself? Like what does each line in .model file from you tutorial mean? I could try to look source code, but I'm not proficient in c++ and honestly, any sources with code-less (or at least pseudo-code) explanation of how your model gets loaded from file and works would help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants