Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can long text be splitted into short texts? #655

Open
CoinCheung opened this issue Jul 12, 2024 · 0 comments
Open

Can long text be splitted into short texts? #655

CoinCheung opened this issue Jul 12, 2024 · 0 comments
Labels
type/question An issue that's a question

Comments

@CoinCheung
Copy link

❓ The question

I generate train samples with dolma, and I found that some of the texts are really long, which can be 8k, but my max_seq_len is only 2k. In this case, will OLMa dataset split the 8k sample into 4 parts(each of which is 2k long), or only the first 2k tokens are kept while the remainings are dropped?

@CoinCheung CoinCheung added the type/question An issue that's a question label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant