Visual language models show widespread visual deficits on neuropsychological tests

Moravec, H. P. Mind Children: The Future of Robot and Human Intelligence (Harvard Univ. Press, 1988).

Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017) https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).

Binz, M. et al. A foundation model to predict and capture human cognition. Nature 644, 1002–1009 (2025).

Article

Google Scholar

Cherian, A. et al. Evaluating large vision-and-language models on children’s mathematical Olympiads. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/1cc12fb3d4033ad72d33a51f1d0ab5d0-Paper-Datasets_and_Benchmarks_Track.pdf (2024).

Ichien, N., Ivanova, A., Webb, T. W., Griffiths, T. L. & Binz, M. Higher cognition in large language models. In Proc. Annual Meeting of the Cognitive Science Society https://escholarship.org/uc/item/3d81x7j8 (2024).

Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf (2022).

Lu, J., Batra, D., Parikh, D. & Lee, S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) https://papers.nips.cc/paper_files/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf (2019).

Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 8748–8763 (PMLR, 2021).

Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120, e2218523120 (2023).

Article

Google Scholar

Wu, W. et al. GPT4Vis: What can GPT-4 do for zero-shot visual recognition? Preprint at https://arxiv.org/abs/2311.15732 (2024).

Yue, X. et al. MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52733.2024.00913 (2024).

Buschoff, L. M., Akata, E., Bethge, M. & Schulz, E. Visual cognition in multimodal large language models. Nat. Mach. Intell. 7, 96–106 (2025).

Article

Google Scholar

Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R. & Nguyen, A. T. Vision language models are blind. In Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision https://doi.org/10.1007/978-981-96-0917-8_17 (Springer, 2024).

Huang, K.-H. et al. Why vision language models struggle with visual arithmetic? Towards enhanced chart and geometry understanding. Preprint at https://arxiv.org/abs/2502.11492 (2025).

Wang, Z. et al. Visually descriptive language model for vector graphics reasoning. Preprint at https://arxiv.org/abs/2404.06479 (2024).

Qiu, W. & Di, X. OCC-MLLM: empowering multimodal large language model for the understanding of occluded objects. Preprint at https://arxiv.org/abs/2410.01261 (2024).

Yang, S. & Di, X. OCC-MLLM-Alpha: empowering multi-modal large language model for the understanding of occluded objects with self-supervised test-time learning. Preprint at https://arxiv.org/abs/2410.01861 (2024).

Tong, S. et al. Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https;//doi.org/10.1109/CVPR52733.2024.00914 (IEEE, 2024).

Deng, N. et al. Tables as texts or images: evaluating the table reasoning ability of LLMs and MLLMs. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 407–426 (Association for Computational Linguistics, 2024).

Masry, A., Long, D., Tan, J. Q., Joty, S. & Hoque, E. ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Findings Association for Computational Linguistics: ACL 2022 (eds Muresan, S. et al.) 2263–2279 (Association for Computational Linguistics, 2022).

Mathew, M., Karatzas, D. & Jawahar, C. V. DocVQA: a dataset for VQA on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2199–2208 (IEEE, 2021).

Bowers, J. S. et al. Deep problems with neural network models of human vision. Behav. Brain Sci. 46, e385 (2023).

Article

Google Scholar

Humphreys, G. W. & Riddoch, J. M. Birmingham Object Recognition Battery. APA PsycNet https://doi.org/10.1037/t13731-000 (1993).

Warrington, E. K. & James, M. Visual Object and Space Perception Battery. Pearson Clinical UK https://www.pearsonclinical.co.uk/store/ukassessments/en/Store/Professional-Assessments/Cognition-%26-Neuro/Visual-Object-and-Space-Perception-Battery/p/P100009236.html (1991).

Torfs, K., Vancleef, K., Lafosse, C., Wagemans, J. & de-Wit, L. The Leuven Perceptual Organization Screening Test (L-POST), an online test to assess mid-level visual perception. Behav. Res. Methods 46, 472–487 (2014).

Article

Google Scholar

Biscione, V. et al. MindSet: Vision. A toolbox for testing DNNs on key psychological experiments. Preprint at https://arxiv.org/abs/2404.05290 (2024).

Duan, H. et al. VLMEvalKit: an open-source toolKit for evaluating large multi-modality models. In Proc. 32nd ACM International Conference on Multimedia 11198–11201 (Association for Computing Machinery, 2024).

Hooper, H. Hooper Visual Organization Test Manual (Western Psychological Services, 1983).

de-Wit, L., Huygelier, H., Hallen, R. V. der, Chamberlain, R. & Wagemans, J. Developing the Leuven embedded figures test (L-EFT): testing the stimulus features that influence embedding. PeerJ 5, e2862 (2017).

Article

Google Scholar

Dalrymple, K. A., Elison, J. T. & Duchaine, B. Face-specific and domain-general visual processing deficits in children with developmental prosopagnosia. Q. J. Exp. Psychol. 2006 70, 259–275 (2017).

Article

Google Scholar

Jacobson, N. S. & Truax, P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J. Consult. Clin. Psychol. 59, 12–19 (1991).

Article

Google Scholar

Kendall, P. C. & Grove, W. M. Normative comparisons in therapy outcome. Behav. Assess. 10, 147–158 (1988).

Google Scholar

Daw, N. W. in Visual Development (ed Daw, N. W.) Ch. 1 (Springer, 2014).

Bosking, W. H., Crowley, J. C. & Fitzpatrick, D. Spatial coding of position and orientation in primary visual cortex. Nat. Neurosci. 5, 874–882 (2002).

Article

Google Scholar

Kamitani, Y. & Tong, F. Decoding the visual and subjective contents of the human brain. Nat. Neurosci. 8, 679–685 (2005).

Article

Google Scholar

Schwarzkopf, D. S. & Rees, G. Subjective size perception depends on central visual cortical magnification in human V1. PLoS One 8, e60550 (2013).

Article

Google Scholar

Sperandio, I. & Chouinard, P. A. The mechanisms of size constancy. Multisens. Res. https://doi.org/10.1163/22134808-00002483 (2015).

Victor, J. D., Purpura, K., Katz, E. & Mao, B. Population encoding of spatial frequency, orientation, and color in macaque V1. J. Neurophysiol. 72, 2151–2166 (1994).

Article

Google Scholar

Anderson, B. L. Mid-level vision. Curr. Biol. 30, R105–R109 (2020).

Article

Google Scholar

Koffka, K. Principles of Gestalt Psychology (Harcourt Brace and Company, 1935).

Persike, M. & Meinhardt, G. Contour integration with corners. Vision Res. 127, 132–140 (2016).

Article

Google Scholar

Cox, D. D. Do we understand high-level vision?. Curr. Opin. Neurobiol. 25, 187–193 (2014).

Article

Google Scholar

Schwartz, J. H., Kandel, E. R., Jessell, T. M., Siegelbaum, S. A. & Hudspeth, A. J. Principles of Neural Science (Elsevier, 1991).

Ullman, S. High-Level Vision: Object Recognition and Visual Cognition (MIT, 1996).

Amir, O., Biederman, I. & Hayworth, K. J. Sensitivity to nonaccidental properties across various shape dimensions. Vision Res. 62, 35–43 (2012).

Article

Google Scholar

Amir, O., Biederman, I., Herald, S. B., Shah, M. P. & Mintz, T. H. Greater sensitivity to nonaccidental than metric shape properties in preschool children. Vision Res. 97, 83–88 (2014).

Article

Google Scholar

Azad, S., Jain, Y., Garg, R., Vineet, V. & Rawat, Y. Understanding depth and height perception in large visual-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3650–3659 (2025).

Wang, C., Jia, R., Liu, X. & Song, D. Benchmarking zero-shot robustness of multimodal foundation models: a pilot study. Preprint at https://arxiv.org/abs/2403.10499 (2024).

Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). Preprint at https://arxiv.org/abs/2309.17421v2 (2023).

Bordes, F. et al. An introduction to vision-language modeling. Preprint at https://arxiv.org/abs/2405.17247 (2024).

Jung, K.-H. Uncover this tech term: foundation model. Korean J. Radiol. 24, 1038–1041 (2023).

Article

Google Scholar

Tsimpoukelli, M. et al. Multimodal few-shot learning with frozen language models. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) https://proceedings.neurips.cc/paper/2021/file/01b7575c38dac42f3cfb7d500438b875-Paper.pdf (2021).

Groen, I. I. A., Silson, E. H. & Baker, C. I. Contributions of low- and high-level properties to neural processing of visual scenes in the human brain. Philos. Trans. R. Soc. B Biol. Sci. 372, 20160102 (2017).

Article

Google Scholar

Zhai, X., Mustafa, B., Kolesnikov, A. & Beyer, L. Sigmoid loss for language image pre-training. In Proc. IEEE/CVF International Conference on Computer Vision 11975–11986 (2023).

Campbell, D. et al. Understanding the limits of vision language models through the lens of the binding problem. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) https://proceedings.neurips.cc/paper_files/paper/2024/file/cdcc6d47c1627350014a3076112ab824-Paper-Conference.pdf (2024).

Greff, K., Steenkiste, S. van & Schmidhuber, J. On the binding problem in artificial neural networks. Preprint at https://arxiv.org/abs/2012.05208 (2020).

Treisman, A. The binding problem. Curr. Opin. Neurobiol. 6, 171–178 (1996).

Article

Google Scholar

Frankland, S., Webb, T., Lewis, R. & Cohen, J. No coincidence, George: processing limits in cognitive function reflect the curse of generalization. Preprint at OSF https://osf.io/preprints/psyarxiv/cjuxb_v2 (2021).

Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning 2790–2799 (PMLR, 2019).

Hyeon-Woo, N., Ye-Bin, M., Choi, W., Hyun, L. & Oh, T.-H. VLM’s eye examination: instruct and inspect visual competency of vision language models. Preprint at https://arxiv.org/abs/2409.14759 (2024).

Zanella, M. & Ben Ayed, I. Low-rank few-shot adaptation of vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1593–1603 (2024).

Zhang, R. et al. Tip-adapter: training-free adaption of clip for few-shot classification. In European Conference on Computer Vision 493–510 (Springer, 2022).

Jiang, D. et al. From CLIP to DINO: visual encoders shout in multi-modal large language models. Preprint at https://arxiv.org/abs/2310.08825 (2024).

Jiao, Q., Chen, D., Huang, Y., Li, Y. & Shen, Y. From training-free to adaptive: empirical insights into MLLMs’ understanding of detection information. Preprint at https://arxiv.org/abs/2401.17981 (2024).

Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).

Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://doi.org/10.48550/arxiv.2304.07193 (2024).

Wicherts, J. M. et al. Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. Front. Psychol. 7, 1832 (2016).

Article

Google Scholar

He, J. et al. Does prompt formatting have any impact on LLM performance? Preprint at https://arxiv.org/abs/2411.10541 (2024).

Vatsal, S. & Dubey, H. A survey of prompt engineering methods in large language models for different NLP tasks. Preprint at https://arxiv.org/abs/2407.12994 (2024).

Jiang, N., Kachinthaya, A., Petryk, S. & Gandelsman, Y. Interpreting and editing vision-language representations to mitigate hallucinations. Preprint at https://arxiv.org/abs/2410.02762(2024).

Petsiuk, V., Das, A. & Saenko, K. RISE: randomized input sampling for explanation of black-box models. Preprint at https://arxiv.org/abs/1806.07421 (2018).

OpenAI. Introducing OpenAI o1. https://openai.com/o1/ (2024).

Shao, H. et al. Visual CoT: advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf (2024).

Lindsey, J. et al. On the biology of a large language model. Transformer Circuits Thread https://transformer-circuits.pub/2025/attribution-graphs/biology.html (2025).

Sheybani, S., Maini, S. S., Dendukuri, A., Tiganj, Z. & Smith, L. B. ModelVsBaby: a developmentally motivated benchmark of out-of-distribution object recognition. Preprint at https://osf.io/preprints/psyarxiv/83gae_v1 (2024).

Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F. & Sun, M. Exploring perceptual limitation of multimodal large language models. Preprint at https://arxiv.org/abs/2402.07384v1 (2024).

Ward, E. J. Exploring perceptual illusions in deep neural networks. Vis. Sci. Soc. Annu. Meet. Abstr. 19, 34b (2019).

Google Scholar

Zhang, H. & Yoshida, S. Exploring deep neural networks in simulating human vision through five optical illusions. Appl. Sci. 14, 3429 (2024).

Article

Google Scholar

Zhang, Y., Pan, J., Zhou, Y., Pan, R. & Chai, J. Grounding visual illusions in language: do vision-language models perceive illusions like humans? In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 5718–5728 (Association for Computational Linguistics, 2023).

Duchaine, B. & Nakayama, K. The Cambridge face memory test: results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia 44, 576–585 (2006).

Article

Google Scholar

Kingdom, F. A. A. & Prins, N. in Psychophysics 2nd edn (eds Kingdom, F. A. A. & Prins, N.) Ch. 3 (Academic, 2016).

Peirce, J. et al. PsychoPy2: Experiments in behavior made easy. Behav. Res. Methods 51, 195–203 (2019).

Article

Google Scholar

Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

Article

Google Scholar

Tangtartharakul, G. Gene-Tangt/neuropsych_vlm_bench: First release. Zenodo https://doi.org/10.5281/ZENODO.16809513 (2025).

Source link

binance register commented on Everyone’s A System Designer With Heterogeneous Integration: Thanks for sharing. I read many of your blog posts
注册 commented on AI Startups Face Procurement Hurdles for Enterprise SAAS Sales: Your point of view caught my eye and was very inte
创建Binance账户 commented on Google Pixel 8 Pro vs Samsung Galaxy S23 Ultra: I don't think the title of your article matches th
binance registrering commented on Cover Story: Shaping Automation Trends in 2024: Your point of view caught my eye and was very inte
gratis binance-konto commented on What Is Generative AI: A super-Simple Explanation Anyone Can Understand: Your article helped me a lot, is there any more re

Visual language models show widespread visual deficits on neuropsychological tests

RECENT POSTS

Shopee cuts hundreds of developer jobs worldwide as it pivots to AI

Orca Opti releases free AI as data sovereignty crackdown looms

From retrofit to AI: Akkodis powers digital innovation through industrial aerospace applications at ILA Berlin 2026

Related Posts