This paper was accepted at the “Principled Design for Trustworthy AI — Interpretability, Robustness, and Safety across Modalities Workshop” at ICLR 2026.
What makes a particular image unsafe? Systematically distinguishing between benign and problematic images is a difficult problem, as subtle changes to images, such as derogatory gestures or symbols, can drastically change the safety impact. However, existing image safety datasets are coarse and ambiguous, providing only broad safety labels without isolating the specific features that cause these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images that differ only in features related to a given safety policy. This will flip the safety label. Leverage image editing models to make targeted changes to images and modify safety labels without changing non-safety-related details. Using SafetyPairs, we build a new safety benchmark. This serves as a powerful source of evaluation data that highlights weaknesses in the visual language model’s ability to distinguish between subtly different images. Beyond the evaluation, we found that our pipeline serves as an effective data augmentation strategy to improve the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse classification of nine safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.
- † Georgia Institute of Technology, USA
- ** Work I did while at Apple
- ‡ Equivalent senior authorship
