SafetyPairs: Separation of safety-critical image features through counterfactual image generation

This paper was accepted at the “Principled Design for Trustworthy AI — Interpretability, Robustness, and Safety across Modalities Workshop” at ICLR 2026.

What makes a particular image unsafe? Systematically distinguishing between benign and problematic images is a difficult problem, as subtle changes to images, such as derogatory gestures or symbols, can drastically change the safety impact. However, existing image safety datasets are coarse and ambiguous, providing only broad safety labels without isolating the specific features that cause these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images that differ only in features related to a given safety policy. This will flip the safety label. Leverage image editing models to make targeted changes to images and modify safety labels without changing non-safety-related details. Using SafetyPairs, we build a new safety benchmark. This serves as a powerful source of evaluation data that highlights weaknesses in the visual language model’s ability to distinguish between subtly different images. Beyond the evaluation, we found that our pipeline serves as an effective data augmentation strategy to improve the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse classification of nine safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.

† Georgia Institute of Technology, USA
** Work I did while at Apple
‡ Equivalent senior authorship

Source link

創建binance帳戶 commented on MEGA sconto del 34% su Amazon: Your article helped me a lot, is there any more re
binance registrering commented on Global Industrial Automation Services Market Size to Reach: Your point of view caught my eye and was very inte
binance commented on WestMetric Defends Controversial On-Page SEO Services for the Era of AI: I don't think the title of your article matches th
创建个人账户 commented on AI in CMO Strategy: Transforming Marketing Leadership: Can you be more specific about the content of your
binance account creation commented on The rise of Artificial Intelligence in Film & TV: Thank you for your sharing. I am worried that I la

SafetyPairs: Separation of safety-critical image features through counterfactual image generation

RECENT POSTS

Hewlett Packard Enterprise (HPE) stock valuation after record profits and accelerated AI infrastructure growth

Retired Baby Boomers Spending Lots of Time Online, Using Tech or AI

Machine learning models could improve accuracy of liquid biopsy results

Related Posts