
Computer vision focuses on enabling devices to interpret and understand visual information from the world. This includes a variety of tasks such as image recognition, object detection, and visual search, and the goal is to develop models that can effectively process and analyze visual data. These models are often trained on large datasets that contain noisy labels and diverse data quality. Despite their capabilities, these models may not be able to produce results that match human aesthetic preferences, including visual appeal, style, and cultural background. This mismatch can lead to a suboptimal user experience, especially in visual search systems where the quality of the captured images is critical.
A major challenge in computer vision is aligning visual models with human aesthetic preferences. Although visual models are powerful, they often fail to produce visually appealing results that meet users' expectations for aesthetics, style, and cultural context. This mismatch leads to suboptimal user experience with visual search systems. State-of-the-art visual models such as CLIP and LDM, trained on large image-text pair datasets, perform well in semantic matching but may prioritize images that do not match the user's intent. For example, the model may find images that perfectly match the search query but lack aesthetic appeal, or provide harmful results that violate the principles of responsible AI. Existing benchmarks of search systems often require more attention to aesthetics and responsible AI assessment.
Advanced search systems incorporate multi-stage aesthetic models as re-ranking and filtering. These systems mainly focus on low-level features such as saturation and often require the assistance of high-level style and cultural context. Large and noisy datasets further complicate achieving consistent aesthetic adjustments. In industrial applications such as Google and Bing search, these issues are mitigated using multi-stage approaches. However, these methods introduce additional latency model bias and require more maintenance resources. Integrating human preferences into model features and simplifying search into end-to-end systems is a worthwhile research goal, especially for on-device applications and large-scale API services.
Researchers from Southeast University, Tsinghua University, Fudan University, and Microsoft introduced a preference-based reinforcement learning method to fine-tune visual models. The approach integrates the inference capabilities of large-scale language models (LLMs) with aesthetic models to better match human aesthetic sensibilities. Their method leverages LLMs to rephrase search queries to enhance the aesthetic expectations embedded within them. This refined query is then used along with a published aesthetic model to re-rank retrieved images. Combining high-level conceptual understanding with low-level visual appeal results in more aesthetically pleasing image sequences that match human aesthetic sensibilities.
The researchers' approach has several steps. First, they use the powerful reasoning capabilities of LLMs to augment search queries with implicit aesthetic expectations. This rephrased query significantly improves the aesthetic quality of search results. Second, they use the public aesthetic model to re-rank the images retrieved by the vision model. Finally, they fine-tune the vision model using a preference-based reinforcement learning method adapted from DPO. This method aligns the model to aesthetic sequences and ensures that the retrieved images meet human aesthetic standards. To evaluate performance, the researchers developed a new HPIR dataset to benchmark the alignment with human aesthetic sensibilities. They also used GPT-4V as a judge to simulate user preferences and validate the robustness of their model.
Experiments demonstrated that the aesthetic alignment of visual models was significantly improved. The researchers used the HPIR dataset to benchmark the effectiveness of their method. The results showed that it performed better in terms of aesthetic behavior across a range of criteria, outperforming existing benchmarks. For example, the model's aesthetic alignment accuracy improved by 10% compared to the baseline. The researchers also tested their method on traditional search benchmarks, including ImageNet1K, MSCOCO, and Flickr30K, and reported competitive results. Although their model performed slightly worse than state-of-the-art models on some benchmarks, it significantly improved the aesthetic quality of search results, making it a valuable contribution to the field.

In conclusion, this work addresses the important problem of aligning visual models to human aesthetic preferences by introducing an innovative reinforcement learning approach. The method integrates the insights of LLM inference and aesthetic models, providing a robust solution for enhancing visual search systems. By leveraging the inference capabilities of LLM and fine-tuning the visual model with preference-based reinforcement learning, the researchers developed a method to significantly improve the aesthetic tuning of search models. This approach not only improves the quality of retrieved images but also ensures that they match human values and preferences, making it a promising solution for future developments in computer vision and visual search systems.
Please check paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.
participate Telegram Channel and LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 44k+ ML Subreddit

Sana Hassan, a Consulting Intern at Marktechpost and a dual degree student at Indian Institute of Technology Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, she brings a fresh perspective to the intersection of AI and real-world solutions.
