Skip to content

An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?

License

Notifications You must be signed in to change notification settings

Wiselnn570/VideoRoPE

Repository files navigation

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

🚀🚀🚀 Official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?

💡 Highlights

  • 🔥 Four Key Positional Encoding Schemes: We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose VideoRoPE including Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to satisfy all four properties.
  • 🔥 A Challenging Video Haystack Retrieval Benchmark: We introduce the challenging V-NIAH-D task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors.
  • 🔥 Excellent Performance: Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks.

📜 News

[2025/3/7] The V-NIAH-D benchmark and training data have been released on Huggingface.

[2025/3/7] The training code has been added to the repository, please check it out.

[2025/2/14] Code and Project Page are released!

🛠️ Usage

  • The implementation of videorope is emphasized with #!, and you can easily find it by pressing ctrl + F.
  • For transformer inference:
    with torch.inference_mode():
        generated_ids = model.generate(
          ..., 
          which_rope=which_rope,
          scale_factor=scale_factor
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        generated_text = output_text[0]
    
  • For vLLM inference:
    mm_data['which_rope'] = which_rope
    mm_data['scale_factor'] = scale_factor
    llm_inputs = {
        "prompt": prompt,
        "multi_modal_data": mm_data,
    }
    with torch.no_grad():
        outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
    generated_text = outputs[0].outputs[0].text
    

Train

To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset LLaVA-Video-178K for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes.

Once the data is prepared, one can fine-tune model following the training data format of LLaMA-Factory:

cd LLaMA-Factory
sh multi_gpu_sft_slurm.sh

It is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{wei2025videorope,
  title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
  author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2502.05173},
  year={2025}
}

❤️ Acknowledgments

  • transformers: the codebase we built upon. Thanks for their wonderful work.
  • vLLM: an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work.
  • Qwen2-VL: the amazing open-sourced multimodal large language model!
  • LLaMA-Factory: Wonderful job in facilitating LLMs & VLMs training.

About

An official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages