VideoRoPE: What Makes for Good Video Rotary Position Embedding?

🚀🚀🚀 Official implementation of VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Authors: Xilin Wei*, Xiaoran Liu*, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Institutes: Fudan University; Shanghai AI Laboratory; Shanghai Innovation Institute
Resources: [📖Paper] [🏠Project Page] [🤗Huggingface]

💡 Highlights

🔥 Four Key Positional Encoding Schemes: We present an analysis of four key properties essential for RoPE when applied to video. Motivated by this analysis, we propose VideoRoPE including Low-frequency Temporal Allocation (LTA), Diagonal Layout (DL), and Adjustable Temporal Spacing (ATS) to satisfy all four properties.
🔥 A Challenging Video Haystack Retrieval Benchmark: We introduce the challenging V-NIAH-D task to expose the drawbacks of current position embedding designs regarding frequency allocation. Our findings reveal that existing Video LLMs are easily misled to frequency-based distractors.
🔥 Excellent Performance: Extensive experiments demonstrate that VideoRoPE consistently achieves superior performance compared to other RoPE variants. For example, VideoRoPE outperforms previous M-RoPE on long video retrieval (+12.4 on V-NIAH, +12.4 on V-NIAH-D), video understanding (+2.9 on LongVideoBench, +4.5 on MLVU, +1.7 on Video-MME) and hallucination (+11.9 on VideoHallucer) benchmarks.

📜 News

[2025/3/7] The V-NIAH-D benchmark and training data have been released on Huggingface.

[2025/3/7] The training code has been added to the repository, please check it out.

[2025/2/14] Code and Project Page are released!

🛠️ Usage

The implementation of videorope is emphasized with #!, and you can easily find it by pressing ctrl + F.

For transformer inference:

with torch.inference_mode():
    generated_ids = model.generate(
      ..., 
      which_rope=which_rope,
      scale_factor=scale_factor
    )
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    generated_text = output_text[0]

For vLLM inference:

mm_data['which_rope'] = which_rope
mm_data['scale_factor'] = scale_factor
llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}
with torch.no_grad():
    outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

Train

To verify the superiority of VideoRoPE, we use the diverse and high-quality video dataset LLaVA-Video-178K for video fine-tuning. To balance training efficiency and long-video comprehension, we randomly select 136K videos with durations under 2 minutes and 18K videos with durations between 2 and 3 minutes.

Once the data is prepared, one can fine-tune model following the training data format of LLaMA-Factory:

cd LLaMA-Factory
sh multi_gpu_sft_slurm.sh

It is important to note that in order to align with the training format of Qwen2-VL, we mainly made adjustments to LLaMA-Factory/src/llamafactory/data/mm_plugin.py.

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{wei2025videorope,
  title={VideoRoPE: What Makes for Good Video Rotary Position Embedding?},
  author={Wei, Xilin and Liu, Xiaoran and Zang, Yuhang and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Tong, Jian and Duan, Haodong and Guo, Qipeng and Wang, Jiaqi and others},
  journal={arXiv preprint arXiv:2502.05173},
  year={2025}
}

❤️ Acknowledgments

transformers: the codebase we built upon. Thanks for their wonderful work.
vLLM: an excellent open-source codebase for high-throughput and memory-efficient inference. Thanks for their wonderful work.
Qwen2-VL: the amazing open-sourced multimodal large language model!
LLaMA-Factory: Wonderful job in facilitating LLMs & VLMs training.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LLaMA-Factory		LLaMA-Factory
assets/images		assets/images
videorope-transformer		videorope-transformer
videorope-vllm		videorope-vllm
vision_niah_d		vision_niah_d
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

💡 Highlights

📜 News

🛠️ Usage

Train

✒️ Citation

❤️ Acknowledgments

About

Releases

Packages

Languages

License

Wiselnn570/VideoRoPE

Folders and files

Latest commit

History

Repository files navigation

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

💡 Highlights

📜 News

🛠️ Usage

Train

✒️ Citation

❤️ Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages