CRAVE - Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency
Current video models such as Sora offer substantial improvement in generation quality compared to previous models, characterized by rich details and content. These models support longer text control (often over 200 characters) and longer duration (often over 5 seconds with the fps of 24). Compared with previous AIGC videos, these videos rarely encounter flicker issues that were commonly seen in previous models, and usually have more intricate prompts, more complex motion patterns and richer details. To evaluate the new generation of video generation models, we introduced CRAVE.
CRAVE evaluates AIGC video quality from three perspectives: traditional natural video quality assessment angles such as aesthetics and distortion, text-video semantic alignment via Multi-granularity TextTemporal (MTT) fusion, and the specific dynamic distortions in AIGC videos via Sparse-Dense Motion-aware (SDM) modules. The overall framework is illustrated in Figure 3. For traditional natural video quality assessment, we utilize Dover to assess individual videos from the aesthetic and technical perspectives given its success. For other aspects, we design effective sparse-dense motion-aware video dynamics modeling, and multi-granularity text-temporal fusion module that aligns textual semantics with complex spatio-temporal relationships in video clips.
pip install -r requirements.txt
Download pre-trained models here and put them into ckpts
.
Download model weights here and then put them into pretrained_weights
.
To run the demo in the code, please additionally install VideoGenEval and prepare related data.
python infer.py
This repo is built on BLIP and DOVER. We thank the authors for their nice work.