Different Tokenization #90

l-olmos · 2024-07-23T04:43:06Z

l-olmos
Jul 23, 2024

Hello, if I am interested in tokenizing something other than natural language, like a programming language, what would I need to change? Is there some way to add my own tokens easily?

zhongkaifu · 2024-07-23T04:55:11Z

zhongkaifu
Jul 23, 2024
Maintainer

Hi @l-olmos

The short answer is nothing needs to be changed after you switch to a different tokenizer.

Seq2SeqSharp process tokenized data set, so you could use or train any tokenizer you like, and just send those tokenized data set to Seq2SeqSharp for training or test.

2 replies

l-olmos Jul 23, 2024
Author

I see. Thank you. So would my tokenized dataset be SrcVocab and TgtVocab or can I leave them in trainCorpus and validCorpus? Also, what is the reason behind having multiple files in validCorpus and only one in train?

l-olmos Jul 25, 2024
Author

Never mind, I was able to understand it now. When calling Test, is that supposed to act as a call to the model? For example, if a model was trained on translating English to Chinese, calling Test with a random file with English sentences should output the translation of that in Chinese? If so, when calling Test, I am having trouble because my output shows the tokens, specifically the <s></s> tag, showing in my output file. It also is not performing well, even with a large training, but I am using a different tokenizer, what could be the issue? I had the same issue without using my own tokenizer as well.

zhongkaifu · 2024-07-26T15:02:19Z

zhongkaifu
Jul 26, 2024
Maintainer

Can you please show a few of examples including input, output, and command line here ? The input and output tokens are tokenized. But if your tokenizer was trained by SentencePiece, you could specify your sentence piece model in the command line and sent raw text as input to the tool. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: l-olmos ***@***.***> Sent: Thursday, July 25, 2024 4:20:38 PM To: zhongkaifu/Seq2SeqSharp ***@***.***> Cc: Zhongkai Fu ***@***.***>; Comment ***@***.***> Subject: Re: [zhongkaifu/Seq2SeqSharp] Different Tokenization (Discussion #90) Never mind, I was able to understand it now. When calling Test, is that supposed to act as a call to the model? For example, if a model was trained on translating English to Chinese, calling Test with a random file with English sentences should output the translation of that in Chinese? If so, when calling Test, I am having trouble because my output shows the tokens, specifically " ", showing in my output file. It also is not performing well, even with a large training, but I am using a different tokenizer, what could be the issue? I had the same issue without using my own tokenizer as well. — Reply to this email directly, view it on GitHub<#90 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZRKAQQXQ6XK2ABPGGBBPLZOGB4NAVCNFSM6AAAAABLJTRMZ2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMJVGM4TCNA>. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

l-olmos Aug 20, 2024
Author

I was able to figure out that issue, however, I have another issue where if my training size is large, the training weights get corrupted. I don't have access to the debug output right now so I cannot provide it right now. I found one or more of the tensors were filled with NaN values causing the weights to corrupt before the model can be checkpointed, I tried tracing when this began but I could not find the cause, do you have any suggestions where I can look to fix this?

zhongkaifu · 2024-08-20T17:28:04Z

zhongkaifu
Aug 20, 2024
Maintainer

Hi @l-olmos
Can you please share your config file and command line for training, and log files ? Then I will take a look.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different Tokenization #90

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Different Tokenization #90

l-olmos Jul 23, 2024

Replies: 3 comments · 3 replies

zhongkaifu Jul 23, 2024 Maintainer

l-olmos Jul 23, 2024 Author

l-olmos Jul 25, 2024 Author

zhongkaifu Jul 26, 2024 Maintainer

l-olmos Aug 20, 2024 Author

zhongkaifu Aug 20, 2024 Maintainer

l-olmos
Jul 23, 2024

Replies: 3 comments 3 replies

zhongkaifu
Jul 23, 2024
Maintainer

l-olmos Jul 23, 2024
Author

l-olmos Jul 25, 2024
Author

zhongkaifu
Jul 26, 2024
Maintainer

l-olmos Aug 20, 2024
Author

zhongkaifu
Aug 20, 2024
Maintainer