How we tested and evaluated AI-generated dance videos

Written by Mohamed Al Erew, Kari Johnson, and Levi Sumagaisai, CalMatters

Zion Harris (center) rehearses for the monthly dance showcase Jetée at Heart WeHo in West Hollywood, Los Angeles on September 19, 2024. Photography: CalMatters Photography: Alisha Jusevich

This article was originally published by CalMatters. Sign up for our newsletter.

Artificial intelligence models can generate lifelike video footage with simple text prompts. However, these tools still struggle to produce realistic videos of complex natural movements such as human dance.

When CalMatters and The Markup asked dancers and choreographers whether AI could disrupt their industry, most concluded that human dancers cannot be replaced.

Read our story: Our video tests prove that generative AI is still bad at dancing. see for yourself

It turns out they were right most of the time. We tested nine different cultural, contemporary, and popular dance styles using four commercially available generative AI video models, generating a total of 36 videos. The latest commercially available AI video generation models have produced convincingly lifelike videos of people dancing, but none of them have produced people performing the dances they are instructed to do.

In about one-third of the videos generated, the subjects appeared inconsistently from frame to frame, with abnormalities in their movements and limbs. The frequency and magnitude of observed issues has improved significantly compared to the first test in late 2024.

methodology

Defining the task

CalMatters and The Markup tested four commercial video generation models created by leading technology companies to create video clips of traditional and popular dances.

We limited our testing to consumer-grade, closed-source generative video tools. This is because these tools are the easiest to use for everyday users and tend to perform better than open source models. We tested OpenAI’s Sora 2, Google’s Veo 3.1, Kuaishou’s Kling 2.5, and MiniMax’s Hailou 2.3.

Prepare the prompt

We created nine video prompts that test different dances in a variety of environments, including dance floors, stages, bedrooms, studios, cultural events, public squares, and classrooms. We tested popular contemporary and traditional cultural dance styles, including the Macarena, Mashed Potato, Folklorico, and popular TikTok dances. Please refer to appendix For more information.

We varied the level of specificity to test whether identifying dances by name is sufficient to generate a video of the desired motion, or whether explicitly specifying the exact body movements would improve the output.

Before finalizing the list of prompts, I submitted them to ChatGPT for editing based on the Sora 2 prompt guide. look Limitations: Rapid optimization For more information.

Send prompts for video generation

Each prompt was sent once using each model’s default settings to produce a landscape video. The three prompts sent to Sora 2 were edited to remove words that would trigger OpenAI’s filters, blocking prompts that may have violated “guardrails regarding similarity to third-party content.” For example, Sora 2 flagged prompts that mentioned specific years, popular music artists, or banned words. One of the blocked prompts was a video of a politician dancing the Macarena. For that prompt, replacing “politician in a suit” with “man in a suit” got around the guardrail. In Veo 3.1, similar prompts were flagged when sent via Gemini or Flow, but not when sent directly to the Veo 3.1 API.

Evaluate the generated video

We evaluated the generated videos based on six different criteria regarding prompt alignment and video consistency.

Did the main character dance?
Did the protagonist perform a specific dance that we requested?
Did the main subject maintain the same appearance throughout the video?
Did the main subject matter create realistic movements based on human physiology?
Did the scene and setting match the prompt?
Did the camera match the indicated camera angle and position?

Each of the above criteria was rated as pass or fail by one reviewer, with assistance from a second reviewer if necessary. The generated cultural dance videos were reviewed for accuracy by dancers familiar with the dance.

result

Of the 36 videos generated, all but one featured dancing. One video produced by Kling 2.5 did not show her dancing, but instead showed her lower half doing a side lunge.

There was no video produced of the actual dance we requested. Regarding the Cahuilla Band of Indians’ bird dances, tribe member Emily Clark said, “None of these depictions come close to bird dances, in my opinion.” Although the Horton Dance video did not show the specific dance moves we requested, choreographer Emma Andre said she felt the Veo 3.1 depiction was “surprisingly lifelike.”

For the remaining pop culture dances, we compared the generated videos to videos found on YouTube to assess whether the dances were accurate.

11 of the 36 videos showed inconsistent movement or appearance. This includes sudden changes in the structure of clothing, hair, and limbs, such as the head rotating on an axis separate from the body or limbs liquefying and reconfiguring.

Please refer to appendix See full results and video.

Restrictions

Image to video generation

No images were used to provide instructions to the model. Image-to-video generation involves uploading a static image along with a text prompt and generating a dynamic video from both. Image-to-video generation is an advertised use case for models that generate dance videos from user-submitted images.

Dance videos with multiple themes

We did not request videos featuring multiple dancers, even though some dances are often performed in groups. To avoid ambiguity as to whether the evaluation failure was due to problems in generating complex human movements or realistic multi-subject videos, we restricted the video prompt to showcasing a single dancer.

Rapid optimization

I haven’t optimized the prompts for each model. Each company publishes its own instant guide. (See guidelines for Veo 3.1, Hailou 2.5, Kling 2.3, and Sora 2.) Instead, we used ChatGPT 5 to standardize prompts across models to align with the Sora 2 prompt guide. Optimizing the model’s prompts by following specific guides could have yielded more accurate results.

We’ve also worked hard to improve the quality of our videos by providing detailed step-by-step instructions for each dance. However, these instructions did not produce more accurate videos than those produced with simpler prompts.

Human motion generation model

We did not test generative models that focused on human movement generation. These models are used to generate and capture natural human movement in animation and video games. Researchers are using large datasets, including popular dance videos on TikTok, to train the most advanced academic models in the field. These models may perform better than the consumer models we tested, but they require technical expertise and significant computational resources to run.

sample size

Our evaluation is limited to videos generated for nine prompts. It is not a comprehensive evaluation of the models used. Some video generation benchmarks, such as Tencent’s AI Lab, use hundreds of prompts to test features such as complex motion, multiple subjects, and creative styles.

Acknowledgment

We would like to thank Yuhang Yang (University of Science and Technology of China) and Xiaodong Cun (University of Great Bay) for reviewing early drafts of this methodology.