Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

add no_break_image break mode: no partial image at the beginning of a… #740

Merged
merged 8 commits into from
Jul 18, 2023

Conversation

violet-zct
Copy link

…n examples

Patch Description
Describe your changes

Testing steps
Describe how you tested your changes

metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/data/document_to_sequence.py Outdated Show resolved Hide resolved
metaseq/tasks/streaming_language_modeling.py Outdated Show resolved Hide resolved
@ArmenAg
Copy link
Contributor

ArmenAg commented Jul 12, 2023

@violet-zct Could we add some tests that cover the behavior we expect? Similar to https://github.com/facebookresearch/metaseq/blob/main/cpu_tests/test_streaming_token_block_dataset.py

…token of an image 3. fix get_document_boundaries()
@violet-zct
Copy link
Author

  1. I pushed a cpu test for no_image_break: https://github.com/facebookresearch/metaseq/blob/cm3v2/metaseq/data/document_to_sequence.py#L295
    During my tests, I fix two corner cases in get_document_boundaries() of the cm3 dataset: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/metaseq/data/cm3_dataset.py#L208 and #L211, we probably can hit these corner cases when the document is very long.

  2. In addition, I found that your get_document_boundaries() can't pass the test: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/cpu_tests/test_cm3_dataset.py#L97, unless you remove + 1 in this line: https://github.com/facebookresearch/metaseq/blob/cm3v2_chunting_dev/cpu_tests/test_cm3_dataset.py#L97

  3. I fixed image spans by also including the special token at the end of a image.

@ArmenAg
Copy link
Contributor

ArmenAg commented Jul 17, 2023

Tests look great to me. Behavior is as expected. Let's land!

@violet-zct violet-zct merged commit 929cf1a into cm3v2 Jul 18, 2023
@violet-zct violet-zct deleted the cm3v2_chunting_dev branch July 18, 2023 05:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants