Replies: 3 comments 3 replies
-
Hi @l-olmos The short answer is nothing needs to be changed after you switch to a different tokenizer. Seq2SeqSharp process tokenized data set, so you could use or train any tokenizer you like, and just send those tokenized data set to Seq2SeqSharp for training or test. |
Beta Was this translation helpful? Give feedback.
2 replies
-
Can you please show a few of examples including input, output, and command line here ?
The input and output tokens are tokenized. But if your tokenizer was trained by SentencePiece, you could specify your sentence piece model in the command line and sent raw text as input to the tool.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: l-olmos ***@***.***>
Sent: Thursday, July 25, 2024 4:20:38 PM
To: zhongkaifu/Seq2SeqSharp ***@***.***>
Cc: Zhongkai Fu ***@***.***>; Comment ***@***.***>
Subject: Re: [zhongkaifu/Seq2SeqSharp] Different Tokenization (Discussion #90)
Never mind, I was able to understand it now. When calling Test, is that supposed to act as a call to the model? For example, if a model was trained on translating English to Chinese, calling Test with a random file with English sentences should output the translation of that in Chinese? If so, when calling Test, I am having trouble because my output shows the tokens, specifically " ", showing in my output file. It also is not performing well, even with a large training, but I am using a different tokenizer, what could be the issue? I had the same issue without using my own tokenizer as well.
—
Reply to this email directly, view it on GitHub<#90 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZRKAQQXQ6XK2ABPGGBBPLZOGB4NAVCNFSM6AAAAABLJTRMZ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMJVGM4TCNA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Hi @l-olmos |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, if I am interested in tokenizing something other than natural language, like a programming language, what would I need to change? Is there some way to add my own tokens easily?
Beta Was this translation helpful? Give feedback.
All reactions