Imagine a figure approaching from a distance. Before you see their face or hear their voice, you must instantly decide whether they are a friend or a threat. While humans effortlessly read subtle body language to fuel this survival instinct, artificial intelligence (AI) continues to struggle. Historically, AI has focused on recognizing basic emotions (such as happiness) and physical behaviors (such as walking), ignoring social intentions, which are social signals directed toward others. For service robots and AI agents, knowing whether a person poses a threat is much more important than simply identifying a person’s emotions.
Now, researchers have established a new benchmark for ’embodied social intent’, uncovered how humans signal threats, and uncovered a critical ‘coordination gap’ between human cognition and AI.
To study how humans transmit these signals, researchers at Tohoku University recorded 160 motion-capture performances by 80 performers from Japan and Taiwan. The cast had to rely on purely nonverbal body language to communicate friendly or hostile intentions to “imaginary aliens” who had just landed on Earth and had no knowledge of human culture or language.
Common friendly behaviors passed down to aliens included bending over to show politeness and humility, and opening your body and arms outstretched in greeting. In the case of hostile interactions, performers took threatening actions such as throwing objects to scare away the aliens.
The researchers also enlisted the help of 77 observers from Japan, Taiwan, and China who watched all 160 videos and determined whether the videos were friendly or hostile. Interestingly, Taiwanese performers tended to make large, forceful movements to show hostility. Their fast movements, including a lot of physical power, made their hostile interactions easy for all viewers to understand. However, Japanese acting was different.
Their adversarial movements were smaller and more controlled, and contained a tenth of the movement energy than the Taiwanese clips. Japanese viewers detected these subtle signals significantly better (76% accuracy) than Taiwanese and Chinese viewers (69% and 65%).
When testing the AI model (ST-GCN), researchers discovered a significant blind spot. Although the AI achieved 69% accuracy, it still could not “think” like a human (Figure 2). Human observers across the three cultures (Figure 3) showed high agreement with each other (correlation >0.79). However, the AI’s judgments barely matched human perceptions (correlation was only 0.26). Humans use cognitive “counter planning” to infer hidden mental goals behind actions. However, the AI merely matched physical patterns and failed to recognize the heavy social meaning behind subtle passive-aggressive movements. For example, suppose someone is standing still with their arms folded tightly and their body slightly turned away. The AI barely notices movement and treats it as harmless. Humans instantly read that as “retreat.” Simply put, the movements that confuse human observers and the movements that confuse AI are completely different.
This “alignment gap” poses a safety risk to human-machine interaction. A system that accurately categorizes high-energy threats but fails to notice low-energy hostilities may be unable to defuse sensitive conflicts. Bridging this gap will require AI that is not only accurate, but also perceptually consistent with human social cognition, capable of interpreting not just how people move, but what those movements mean.
- Publication details:
title: Enemy or ally? Benchmarking Human Perception and ST-GCN Decoding of Embodied Social Intentions
author: Miao Chen, Zhang Dai, Victor Schneider, Kanta Ozawa, Tsai Yangyang, Ken Fujiwara, Yoshifumi Kitamura, Jiahui Zeng
meeting: 2026 International Conference on Automatic Facial and Gesture Recognition (FG)
