You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the usage of ShareCaptioner-Video, I hasve some questions about the code.
The model used in captioner/fast_captioner_lmdeploy.py and captioner/slide_captioner_lmdeploy.py uses the ChatTemplateConfig of internlm-xcomposer2-4khd. However, there is no registered template named internlm-xcomposer2-4khd and the BaseChatTemplate is used. I don't know if this is correct, and the template for caption generation is not listed in your paper. Additionally, I tried to use the template of internlm-xcomposer2 which gives better result, especially in slide-caption setting. So, the first question is which ChatTemplateConfig should I use? FYI, the version of lmdeploy is 0.5.2.post1, which is the newest.
I noticed the prompt in your captioner/slide_captioner_lmdeploy.py for differential sliding-window captioning uses a hard-coded map for frame index and video timestamp. The first frame corresponds to 0 seconds, second corresponds to 2 seconds, etc. However, you mentioned in your paper that a Semantic-aware Key-frame Extraction is utilized to avoid redundant frames. Therefore, some frames may be extracted from the video and your hardcoded map can be wrong. For example, a 10 seconds video will get 6 frames coressponding to the timestamp of [0, 2, 4, 6, 8, 10]. Assuming after the Semantic-aware Key-frame Extraction, only 5 frames remain which correspond to the timestamp of [0, 4, 6, 8, 10]. Then, during the slide caption, the hardcoded timestamp will be [0, 2, 4, 6, 8] which contain mistakes. I don't know if this is ok or I should not utilize Semantic-aware Key-frame Extraction with slide caption?
query="Here are the Video frame {} at {}.00 Second(s) and Video frame {} at {}.00 Second(s) of a video, describe what happend between them. What happend before is: {}".format(
self.frame_ptr, int(self.frame_ptr*2), self.frame_ptr+1, int((self.frame_ptr+1) *2), self.caption_list[-1])
The text was updated successfully, but these errors were encountered:
During the usage of ShareCaptioner-Video, I hasve some questions about the code.
captioner/fast_captioner_lmdeploy.py
andcaptioner/slide_captioner_lmdeploy.py
uses theChatTemplateConfig
ofinternlm-xcomposer2-4khd
. However, there is no registered template namedinternlm-xcomposer2-4khd
and theBaseChatTemplate
is used. I don't know if this is correct, and the template for caption generation is not listed in your paper. Additionally, I tried to use the template ofinternlm-xcomposer2
which gives better result, especially in slide-caption setting. So, the first question is whichChatTemplateConfig
should I use? FYI, the version oflmdeploy
is0.5.2.post1
, which is the newest.captioner/slide_captioner_lmdeploy.py
for differential sliding-window captioning uses a hard-coded map for frame index and video timestamp. The first frame corresponds to 0 seconds, second corresponds to 2 seconds, etc. However, you mentioned in your paper that a Semantic-aware Key-frame Extraction is utilized to avoid redundant frames. Therefore, some frames may be extracted from the video and your hardcoded map can be wrong. For example, a 10 seconds video will get 6 frames coressponding to the timestamp of [0, 2, 4, 6, 8, 10]. Assuming after the Semantic-aware Key-frame Extraction, only 5 frames remain which correspond to the timestamp of [0, 4, 6, 8, 10]. Then, during the slide caption, the hardcoded timestamp will be [0, 2, 4, 6, 8] which contain mistakes. I don't know if this is ok or I should not utilize Semantic-aware Key-frame Extraction with slide caption?The text was updated successfully, but these errors were encountered: