ETVA: Evaluating intertext alignments with detailed questions generation and answers

Accurately evaluating the semantic alignment between the text prompt and the generated video remains a challenge for text-to-video (T2V) generation. Existing text-to-video alignment metrics, such as Clipscore, produce only coarse-grained scores without detailed fine-grained alignment details that do not suit human preferences. To address this limitation, we propose an ETVA, a new method of evaluation of intertext alignment via fine-grained question generation and answers. First, the multi-agent system analyzes the prompts into a semantic scene graph to generate an atomic question. The auxiliary LLM then first acquires relevant common sense knowledge (e.g., physical laws), and then the video LLM answers questions generated through a multi-stage inference mechanism. Extensive experiments show that ETVA achieves Spearman's correlation coefficient of 58.47, which shows much higher correlation than existing metrics that achieved only 31.0. We also build a comprehensive benchmark specifically designed for intertext alignment assessments, with 2K diverse prompts and 12K atom questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key features and limitations, paving the way for next-generation T2V generation. All code and datasets will be published soon.