-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about training code #30
Comments
Hello, the training process is as follows.
Then write the training data into a txt or json for training |
Can you please elaborate on the steps to do the following |
This involves some trivial data processing code, which we have not organized. But you can find the core content in our code. For example, for how to tokenize image, you can refer to
|
Hello, we provide some training data samples and related descriptions, please refer to https://github.com/OpenMOSS/AnyGPT?tab=readme-ov-file#pretraining-and-sft |
Thanks for your contributions to the opensource community. There is some confusion about the training code. In
anygpt/src/stage1_pretrain.py
, I only find that the image/speech/music data is loaded, but not tokenized by the corresponding tokenizers (like SEED or SpeechTokenizer). Where do you use them to tokenize these data in pretraining?The text was updated successfully, but these errors were encountered: