Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The usage of ShareCaptioner-Video #34

Open
huangyf530 opened this issue Aug 3, 2024 · 1 comment
Open

The usage of ShareCaptioner-Video #34

huangyf530 opened this issue Aug 3, 2024 · 1 comment

Comments

@huangyf530
Copy link

During the usage of ShareCaptioner-Video, I hasve some questions about the code.

  1. The model used in captioner/fast_captioner_lmdeploy.py and captioner/slide_captioner_lmdeploy.py uses the ChatTemplateConfig of internlm-xcomposer2-4khd. However, there is no registered template named internlm-xcomposer2-4khd and the BaseChatTemplate is used. I don't know if this is correct, and the template for caption generation is not listed in your paper. Additionally, I tried to use the template of internlm-xcomposer2 which gives better result, especially in slide-caption setting. So, the first question is which ChatTemplateConfig should I use? FYI, the version of lmdeploy is 0.5.2.post1, which is the newest.
    model = pipeline(args.model_name, chat_template_config=ChatTemplateConfig(model_name='internlm-xcomposer2-4khd'))
  2. I noticed the prompt in your captioner/slide_captioner_lmdeploy.py for differential sliding-window captioning uses a hard-coded map for frame index and video timestamp. The first frame corresponds to 0 seconds, second corresponds to 2 seconds, etc. However, you mentioned in your paper that a Semantic-aware Key-frame Extraction is utilized to avoid redundant frames. Therefore, some frames may be extracted from the video and your hardcoded map can be wrong. For example, a 10 seconds video will get 6 frames coressponding to the timestamp of [0, 2, 4, 6, 8, 10]. Assuming after the Semantic-aware Key-frame Extraction, only 5 frames remain which correspond to the timestamp of [0, 4, 6, 8, 10]. Then, during the slide caption, the hardcoded timestamp will be [0, 2, 4, 6, 8] which contain mistakes. I don't know if this is ok or I should not utilize Semantic-aware Key-frame Extraction with slide caption?
    query = "Here are the Video frame {} at {}.00 Second(s) and Video frame {} at {}.00 Second(s) of a video, describe what happend between them. What happend before is: {}".format(
                self.frame_ptr, int(self.frame_ptr * 2), self.frame_ptr + 1, int((self.frame_ptr + 1) * 2), self.caption_list[-1])
@wisdomikezogwo
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants