Visual language models show widespread visual deficits on neuropsychological tests

Machine Learning


  • Moravec, H. P. Mind Children: The Future of Robot and Human Intelligence (Harvard Univ. Press, 1988).

  • Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017) https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).

  • Binz, M. et al. A foundation model to predict and capture human cognition. Nature 644, 1002–1009 (2025).

    Article 

    Google Scholar 

  • Cherian, A. et al. Evaluating large vision-and-language models on children’s mathematical Olympiads. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/1cc12fb3d4033ad72d33a51f1d0ab5d0-Paper-Datasets_and_Benchmarks_Track.pdf (2024).

  • Ichien, N., Ivanova, A., Webb, T. W., Griffiths, T. L. & Binz, M. Higher cognition in large language models. In Proc. Annual Meeting of the Cognitive Science Society https://escholarship.org/uc/item/3d81x7j8 (2024).

  • Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf (2022).

  • Lu, J., Batra, D., Parikh, D. & Lee, S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) https://papers.nips.cc/paper_files/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf (2019).

  • Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 8748–8763 (PMLR, 2021).

  • Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120, e2218523120 (2023).

    Article 

    Google Scholar 

  • Wu, W. et al. GPT4Vis: What can GPT-4 do for zero-shot visual recognition? Preprint at https://arxiv.org/abs/2311.15732 (2024).

  • Yue, X. et al. MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52733.2024.00913 (2024).

  • Buschoff, L. M., Akata, E., Bethge, M. & Schulz, E. Visual cognition in multimodal large language models. Nat. Mach. Intell. 7, 96–106 (2025).

    Article 

    Google Scholar 

  • Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R. & Nguyen, A. T. Vision language models are blind. In Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision https://doi.org/10.1007/978-981-96-0917-8_17 (Springer, 2024).

  • Huang, K.-H. et al. Why vision language models struggle with visual arithmetic? Towards enhanced chart and geometry understanding. Preprint at https://arxiv.org/abs/2502.11492 (2025).

  • Wang, Z. et al. Visually descriptive language model for vector graphics reasoning. Preprint at https://arxiv.org/abs/2404.06479 (2024).

  • Qiu, W. & Di, X. OCC-MLLM: empowering multimodal large language model for the understanding of occluded objects. Preprint at https://arxiv.org/abs/2410.01261 (2024).

  • Yang, S. & Di, X. OCC-MLLM-Alpha: empowering multi-modal large language model for the understanding of occluded objects with self-supervised test-time learning. Preprint at https://arxiv.org/abs/2410.01861 (2024).

  • Tong, S. et al. Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https;//doi.org/10.1109/CVPR52733.2024.00914 (IEEE, 2024).

  • Deng, N. et al. Tables as texts or images: evaluating the table reasoning ability of LLMs and MLLMs. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 407–426 (Association for Computational Linguistics, 2024).

  • Masry, A., Long, D., Tan, J. Q., Joty, S. & Hoque, E. ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Findings Association for Computational Linguistics: ACL 2022 (eds Muresan, S. et al.) 2263–2279 (Association for Computational Linguistics, 2022).

  • Mathew, M., Karatzas, D. & Jawahar, C. V. DocVQA: a dataset for VQA on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2199–2208 (IEEE, 2021).

  • Bowers, J. S. et al. Deep problems with neural network models of human vision. Behav. Brain Sci. 46, e385 (2023).

    Article 

    Google Scholar 

  • Humphreys, G. W. & Riddoch, J. M. Birmingham Object Recognition Battery. APA PsycNet https://doi.org/10.1037/t13731-000 (1993).

  • Warrington, E. K. & James, M. Visual Object and Space Perception Battery. Pearson Clinical UK https://www.pearsonclinical.co.uk/store/ukassessments/en/Store/Professional-Assessments/Cognition-%26-Neuro/Visual-Object-and-Space-Perception-Battery/p/P100009236.html (1991).

  • Torfs, K., Vancleef, K., Lafosse, C., Wagemans, J. & de-Wit, L. The Leuven Perceptual Organization Screening Test (L-POST), an online test to assess mid-level visual perception. Behav. Res. Methods 46, 472–487 (2014).

    Article 

    Google Scholar 

  • Biscione, V. et al. MindSet: Vision. A toolbox for testing DNNs on key psychological experiments. Preprint at https://arxiv.org/abs/2404.05290 (2024).

  • Duan, H. et al. VLMEvalKit: an open-source toolKit for evaluating large multi-modality models. In Proc. 32nd ACM International Conference on Multimedia 11198–11201 (Association for Computing Machinery, 2024).

  • Hooper, H. Hooper Visual Organization Test Manual (Western Psychological Services, 1983).

  • de-Wit, L., Huygelier, H., Hallen, R. V. der, Chamberlain, R. & Wagemans, J. Developing the Leuven embedded figures test (L-EFT): testing the stimulus features that influence embedding. PeerJ 5, e2862 (2017).

    Article 

    Google Scholar 

  • Dalrymple, K. A., Elison, J. T. & Duchaine, B. Face-specific and domain-general visual processing deficits in children with developmental prosopagnosia. Q. J. Exp. Psychol. 2006 70, 259–275 (2017).

    Article 

    Google Scholar 

  • Jacobson, N. S. & Truax, P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J. Consult. Clin. Psychol. 59, 12–19 (1991).

    Article 

    Google Scholar 

  • Kendall, P. C. & Grove, W. M. Normative comparisons in therapy outcome. Behav. Assess. 10, 147–158 (1988).

    Google Scholar 

  • Daw, N. W. in Visual Development (ed Daw, N. W.) Ch. 1 (Springer, 2014).

  • Bosking, W. H., Crowley, J. C. & Fitzpatrick, D. Spatial coding of position and orientation in primary visual cortex. Nat. Neurosci. 5, 874–882 (2002).

    Article 

    Google Scholar 

  • Kamitani, Y. & Tong, F. Decoding the visual and subjective contents of the human brain. Nat. Neurosci. 8, 679–685 (2005).

    Article 

    Google Scholar 

  • Schwarzkopf, D. S. & Rees, G. Subjective size perception depends on central visual cortical magnification in human V1. PLoS One 8, e60550 (2013).

    Article 

    Google Scholar 

  • Sperandio, I. & Chouinard, P. A. The mechanisms of size constancy. Multisens. Res. https://doi.org/10.1163/22134808-00002483 (2015).

  • Victor, J. D., Purpura, K., Katz, E. & Mao, B. Population encoding of spatial frequency, orientation, and color in macaque V1. J. Neurophysiol. 72, 2151–2166 (1994).

    Article 

    Google Scholar 

  • Anderson, B. L. Mid-level vision. Curr. Biol. 30, R105–R109 (2020).

    Article 

    Google Scholar 

  • Koffka, K. Principles of Gestalt Psychology (Harcourt Brace and Company, 1935).

  • Persike, M. & Meinhardt, G. Contour integration with corners. Vision Res. 127, 132–140 (2016).

    Article 

    Google Scholar 

  • Cox, D. D. Do we understand high-level vision?. Curr. Opin. Neurobiol. 25, 187–193 (2014).

    Article 

    Google Scholar 

  • Schwartz, J. H., Kandel, E. R., Jessell, T. M., Siegelbaum, S. A. & Hudspeth, A. J. Principles of Neural Science (Elsevier, 1991).

  • Ullman, S. High-Level Vision: Object Recognition and Visual Cognition (MIT, 1996).

  • Amir, O., Biederman, I. & Hayworth, K. J. Sensitivity to nonaccidental properties across various shape dimensions. Vision Res. 62, 35–43 (2012).

    Article 

    Google Scholar 

  • Amir, O., Biederman, I., Herald, S. B., Shah, M. P. & Mintz, T. H. Greater sensitivity to nonaccidental than metric shape properties in preschool children. Vision Res. 97, 83–88 (2014).

    Article 

    Google Scholar 

  • Azad, S., Jain, Y., Garg, R., Vineet, V. & Rawat, Y. Understanding depth and height perception in large visual-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3650–3659 (2025).

  • Wang, C., Jia, R., Liu, X. & Song, D. Benchmarking zero-shot robustness of multimodal foundation models: a pilot study. Preprint at https://arxiv.org/abs/2403.10499 (2024).

  • Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). Preprint at https://arxiv.org/abs/2309.17421v2 (2023).

  • Bordes, F. et al. An introduction to vision-language modeling. Preprint at https://arxiv.org/abs/2405.17247 (2024).

  • Jung, K.-H. Uncover this tech term: foundation model. Korean J. Radiol. 24, 1038–1041 (2023).

    Article 

    Google Scholar 

  • Tsimpoukelli, M. et al. Multimodal few-shot learning with frozen language models. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) https://proceedings.neurips.cc/paper/2021/file/01b7575c38dac42f3cfb7d500438b875-Paper.pdf (2021).

  • Groen, I. I. A., Silson, E. H. & Baker, C. I. Contributions of low- and high-level properties to neural processing of visual scenes in the human brain. Philos. Trans. R. Soc. B Biol. Sci. 372, 20160102 (2017).

    Article 

    Google Scholar 

  • Zhai, X., Mustafa, B., Kolesnikov, A. & Beyer, L. Sigmoid loss for language image pre-training. In Proc. IEEE/CVF International Conference on Computer Vision 11975–11986 (2023).

  • Campbell, D. et al. Understanding the limits of vision language models through the lens of the binding problem. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) https://proceedings.neurips.cc/paper_files/paper/2024/file/cdcc6d47c1627350014a3076112ab824-Paper-Conference.pdf (2024).

  • Greff, K., Steenkiste, S. van & Schmidhuber, J. On the binding problem in artificial neural networks. Preprint at https://arxiv.org/abs/2012.05208 (2020).

  • Treisman, A. The binding problem. Curr. Opin. Neurobiol. 6, 171–178 (1996).

    Article 

    Google Scholar 

  • Frankland, S., Webb, T., Lewis, R. & Cohen, J. No coincidence, George: processing limits in cognitive function reflect the curse of generalization. Preprint at OSF https://osf.io/preprints/psyarxiv/cjuxb_v2 (2021).

  • Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning 2790–2799 (PMLR, 2019).

  • Hyeon-Woo, N., Ye-Bin, M., Choi, W., Hyun, L. & Oh, T.-H. VLM’s eye examination: instruct and inspect visual competency of vision language models. Preprint at https://arxiv.org/abs/2409.14759 (2024).

  • Zanella, M. & Ben Ayed, I. Low-rank few-shot adaptation of vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1593–1603 (2024).

  • Zhang, R. et al. Tip-adapter: training-free adaption of clip for few-shot classification. In European Conference on Computer Vision 493–510 (Springer, 2022).

  • Jiang, D. et al. From CLIP to DINO: visual encoders shout in multi-modal large language models. Preprint at https://arxiv.org/abs/2310.08825 (2024).

  • Jiao, Q., Chen, D., Huang, Y., Li, Y. & Shen, Y. From training-free to adaptive: empirical insights into MLLMs’ understanding of detection information. Preprint at https://arxiv.org/abs/2401.17981 (2024).

  • Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).

  • Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://doi.org/10.48550/arxiv.2304.07193 (2024).

  • Wicherts, J. M. et al. Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. Front. Psychol. 7, 1832 (2016).

    Article 

    Google Scholar 

  • He, J. et al. Does prompt formatting have any impact on LLM performance? Preprint at https://arxiv.org/abs/2411.10541 (2024).

  • Vatsal, S. & Dubey, H. A survey of prompt engineering methods in large language models for different NLP tasks. Preprint at https://arxiv.org/abs/2407.12994 (2024).

  • Jiang, N., Kachinthaya, A., Petryk, S. & Gandelsman, Y. Interpreting and editing vision-language representations to mitigate hallucinations. Preprint at https://arxiv.org/abs/2410.02762(2024).

  • Petsiuk, V., Das, A. & Saenko, K. RISE: randomized input sampling for explanation of black-box models. Preprint at https://arxiv.org/abs/1806.07421 (2018).

  • OpenAI. Introducing OpenAI o1. https://openai.com/o1/ (2024).

  • Shao, H. et al. Visual CoT: advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf (2024).

  • Lindsey, J. et al. On the biology of a large language model. Transformer Circuits Thread https://transformer-circuits.pub/2025/attribution-graphs/biology.html (2025).

  • Sheybani, S., Maini, S. S., Dendukuri, A., Tiganj, Z. & Smith, L. B. ModelVsBaby: a developmentally motivated benchmark of out-of-distribution object recognition. Preprint at https://osf.io/preprints/psyarxiv/83gae_v1 (2024).

  • Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F. & Sun, M. Exploring perceptual limitation of multimodal large language models. Preprint at https://arxiv.org/abs/2402.07384v1 (2024).

  • Ward, E. J. Exploring perceptual illusions in deep neural networks. Vis. Sci. Soc. Annu. Meet. Abstr. 19, 34b (2019).

    Google Scholar 

  • Zhang, H. & Yoshida, S. Exploring deep neural networks in simulating human vision through five optical illusions. Appl. Sci. 14, 3429 (2024).

    Article 

    Google Scholar 

  • Zhang, Y., Pan, J., Zhou, Y., Pan, R. & Chai, J. Grounding visual illusions in language: do vision-language models perceive illusions like humans? In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 5718–5728 (Association for Computational Linguistics, 2023).

  • Duchaine, B. & Nakayama, K. The Cambridge face memory test: results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia 44, 576–585 (2006).

    Article 

    Google Scholar 

  • Kingdom, F. A. A. & Prins, N. in Psychophysics 2nd edn (eds Kingdom, F. A. A. & Prins, N.) Ch. 3 (Academic, 2016).

  • Peirce, J. et al. PsychoPy2: Experiments in behavior made easy. Behav. Res. Methods 51, 195–203 (2019).

    Article 

    Google Scholar 

  • Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article 

    Google Scholar 

  • Tangtartharakul, G. Gene-Tangt/neuropsych_vlm_bench: First release. Zenodo https://doi.org/10.5281/ZENODO.16809513 (2025).



  • Source link