The usage of ShareCaptioner-Video #34

huangyf530 · 2024-08-03T09:45:14Z

During the usage of ShareCaptioner-Video, I hasve some questions about the code.

The model used in captioner/fast_captioner_lmdeploy.py and captioner/slide_captioner_lmdeploy.py uses the ChatTemplateConfig of internlm-xcomposer2-4khd. However, there is no registered template named internlm-xcomposer2-4khd and the BaseChatTemplate is used. I don't know if this is correct, and the template for caption generation is not listed in your paper. Additionally, I tried to use the template of internlm-xcomposer2 which gives better result, especially in slide-caption setting. So, the first question is which ChatTemplateConfig should I use? FYI, the version of lmdeploy is 0.5.2.post1, which is the newest.
```
model = pipeline(args.model_name, chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd'))
```
I noticed the prompt in your captioner/slide_captioner_lmdeploy.py for differential sliding-window captioning uses a hard-coded map for frame index and video timestamp. The first frame corresponds to 0 seconds, second corresponds to 2 seconds, etc. However, you mentioned in your paper that a Semantic-aware Key-frame Extraction is utilized to avoid redundant frames. Therefore, some frames may be extracted from the video and your hardcoded map can be wrong. For example, a 10 seconds video will get 6 frames coressponding to the timestamp of [0, 2, 4, 6, 8, 10]. Assuming after the Semantic-aware Key-frame Extraction, only 5 frames remain which correspond to the timestamp of [0, 4, 6, 8, 10]. Then, during the slide caption, the hardcoded timestamp will be [0, 2, 4, 6, 8] which contain mistakes. I don't know if this is ok or I should not utilize Semantic-aware Key-frame Extraction with slide caption?
```
query = "Here are the Video frame {} at {}.00 Second(s) and Video frame {} at {}.00 Second(s) of a video, describe what happend between them. What happend before is: {}".format(
            self.frame_ptr, int(self.frame_ptr * 2), self.frame_ptr + 1, int((self.frame_ptr + 1) * 2), self.caption_list[-1])
```

The text was updated successfully, but these errors were encountered:

wisdomikezogwo · 2024-08-04T03:55:49Z

+1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The usage of ShareCaptioner-Video #34

The usage of ShareCaptioner-Video #34

huangyf530 commented Aug 3, 2024

wisdomikezogwo commented Aug 4, 2024

The usage of ShareCaptioner-Video #34

The usage of ShareCaptioner-Video #34

Comments

huangyf530 commented Aug 3, 2024

wisdomikezogwo commented Aug 4, 2024