🚀🚀🚀 Official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?
- Authors: Xilin Wei*, Xiaoran Liu*, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
- Institutes: Fudan University; Shanghai AI Laboratory; Shanghai Innovation Institute
- Resources: [📖Paper] [🏠Project Page] [🤗Huggingface]
- 🔥 Four Key Positional Encoding Schemes: We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose VideoRoPE including Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to satisfy all four properties.
- 🔥 A Challenging Video Haystack Retrieval Benchmark: We introduce the challenging V-NIAH-D task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors.
- 🔥 Excellent Performance: Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks.
[2025/3/7] The V-NIAH-D benchmark and training data have been released on Huggingface.
[2025/3/7] The training code has been added to the repository, please check it out.
[2025/2/14] Code and Project Page are released!
- The implementation of videorope is emphasized with #!, and you can easily find it by pressing ctrl + F.
- For transformer inference:
with torch.inference_mode(): generated_ids = model.generate( ..., which_rope=which_rope, scale_factor=scale_factor ) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) generated_text = output_text[0]
- For vLLM inference:
mm_data['which_rope'] = which_rope mm_data['scale_factor'] = scale_factor llm_inputs = { "prompt": prompt, "multi_modal_data": mm_data, } with torch.no_grad(): outputs = llm.generate([llm_inputs], sampling_params=sampling_params) generated_text = outputs[0].outputs[0].text
To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset LLaVA-Video-178K for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes.
Once the data is prepared, one can fine-tune model following the training data format of LLaMA-Factory:
cd LLaMA-Factory
sh multi_gpu_sft_slurm.sh
It is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
@article{wei2025videorope,
title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
journal={arXiv preprint arXiv:2502.05173},
year={2025}
}
- transformers: the codebase we built upon. Thanks for their wonderful work.
- vLLM: an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work.
- Qwen2-VL: the amazing open-sourced multimodal large language model!
- LLaMA-Factory: Wonderful job in facilitating LLMs & VLMs training.