Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about training code #30

Open
Gaffey opened this issue Jul 16, 2024 · 4 comments
Open

Question about training code #30

Gaffey opened this issue Jul 16, 2024 · 4 comments

Comments

@Gaffey
Copy link

Gaffey commented Jul 16, 2024

Thanks for your contributions to the opensource community. There is some confusion about the training code. In anygpt/src/stage1_pretrain.py, I only find that the image/speech/music data is loaded, but not tokenized by the corresponding tokenizers (like SEED or SpeechTokenizer). Where do you use them to tokenize these data in pretraining?

@JunZhan2000
Copy link
Collaborator

Hello, the training process is as follows.

  1. First, you need to use the multimodal tokenizer to discretize the image, speech, and music to obtain a token sequence.

  2. Then call the function in https://github.com/OpenMOSS/AnyGPT/blob/main/anygpt/src/m_utils/anything2token.py to convert the token into a corresponding string (convenient for training with transformers).

  3. Finally, as mentioned in the paper, https://github.com/OpenMOSS/AnyGPT/blob/main/anygpt/src/m_utils/prompter.py#L128 provides a function that concatenates a non-text modal content X and a text text into a complete one for training sentence.

Then write the training data into a txt or json for training

@sambalshikhar
Copy link

sambalshikhar commented Jul 23, 2024

Can you please elaborate on the steps to do the following
"First, you need to use the multimodal tokenizer to discretize the image, speech, and music to obtain a token sequence."
As its a bit confusing where and what to run

@JunZhan2000
Copy link
Collaborator

Can you please elaborate on the steps to do the following "First, you need to use the multimodal tokenizer to discretize the image, speech, and music to obtain a token sequence." As its a bit confusing where and what to run

This involves some trivial data processing code, which we have not organized. But you can find the core content in our code. For example, for how to tokenize image, you can refer to

@JunZhan2000
Copy link
Collaborator

Hello, we provide some training data samples and related descriptions, please refer to https://github.com/OpenMOSS/AnyGPT?tab=readme-ov-file#pretraining-and-sft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants