Chain of Savage (COT) inference in Vision Language Models (VLMs) is important for improving interpretability and reliability. However, current training recipes often rely on datasets dominated by short annotations with minimal grounds. In this work, we show that training the VLM with a short answer is insufficient generalization regarding inference tasks that require more detailed explanations. To address this limitation, we propose a two-stage post-training strategy to extend the use of short answer data for enhanced COT inference. First, we increase the short answers with COT inference generated by the GPT-4o, and enhance the COT functionality of VLM through fine tuning. Second, we utilize short answers as rewards for the results of reinforcement learning. Specifically, short answers are used as indicators of correctness for constructing positive (correct) and negative (false) pairs from model-generated inference chains. These pairs are used to adjust the inference of the model via direct priority optimization. Our experiments show a significant improvement in COT inference of the benchmark dataset and an enhanced generalization for direct response prediction. This work provides important data resources for VLM COT training and demonstrates the effectiveness of the outcome rewards of multimodal models after training.
- †Work done at Apple
- ‡Carnegie Mellon University