Moravec, H. P. Mind Children: The Future of Robot and Human Intelligence (Harvard Univ. Press, 1988).
Vaswani, A. et al. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017) https://papers.nips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (2017).
Binz, M. et al. A foundation model to predict and capture human cognition. Nature 644, 1002–1009 (2025).
Google Scholar
Cherian, A. et al. Evaluating large vision-and-language models on children’s mathematical Olympiads. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/1cc12fb3d4033ad72d33a51f1d0ab5d0-Paper-Datasets_and_Benchmarks_Track.pdf (2024).
Ichien, N., Ivanova, A., Webb, T. W., Griffiths, T. L. & Binz, M. Higher cognition in large language models. In Proc. Annual Meeting of the Cognitive Science Society https://escholarship.org/uc/item/3d81x7j8 (2024).
Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. In 36th Conference on Neural Information Processing Systems (NeurIPS 2022) https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf (2022).
Lu, J., Batra, D., Parikh, D. & Lee, S. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) https://papers.nips.cc/paper_files/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf (2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120, e2218523120 (2023).
Google Scholar
Wu, W. et al. GPT4Vis: What can GPT-4 do for zero-shot visual recognition? Preprint at https://arxiv.org/abs/2311.15732 (2024).
Yue, X. et al. MMMU: a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR52733.2024.00913 (2024).
Buschoff, L. M., Akata, E., Bethge, M. & Schulz, E. Visual cognition in multimodal large language models. Nat. Mach. Intell. 7, 96–106 (2025).
Google Scholar
Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R. & Nguyen, A. T. Vision language models are blind. In Computer Vision – ACCV 2024: 17th Asian Conference on Computer Vision https://doi.org/10.1007/978-981-96-0917-8_17 (Springer, 2024).
Huang, K.-H. et al. Why vision language models struggle with visual arithmetic? Towards enhanced chart and geometry understanding. Preprint at https://arxiv.org/abs/2502.11492 (2025).
Wang, Z. et al. Visually descriptive language model for vector graphics reasoning. Preprint at https://arxiv.org/abs/2404.06479 (2024).
Qiu, W. & Di, X. OCC-MLLM: empowering multimodal large language model for the understanding of occluded objects. Preprint at https://arxiv.org/abs/2410.01261 (2024).
Yang, S. & Di, X. OCC-MLLM-Alpha: empowering multi-modal large language model for the understanding of occluded objects with self-supervised test-time learning. Preprint at https://arxiv.org/abs/2410.01861 (2024).
Tong, S. et al. Eyes wide shut? Exploring the visual shortcomings of multimodal LLMs. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https;//doi.org/10.1109/CVPR52733.2024.00914 (IEEE, 2024).
Deng, N. et al. Tables as texts or images: evaluating the table reasoning ability of LLMs and MLLMs. In Findings of the Association for Computational Linguistics: ACL 2024 (eds Ku, L.-W., Martins, A. & Srikumar, V.) 407–426 (Association for Computational Linguistics, 2024).
Masry, A., Long, D., Tan, J. Q., Joty, S. & Hoque, E. ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In Findings Association for Computational Linguistics: ACL 2022 (eds Muresan, S. et al.) 2263–2279 (Association for Computational Linguistics, 2022).
Mathew, M., Karatzas, D. & Jawahar, C. V. DocVQA: a dataset for VQA on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2199–2208 (IEEE, 2021).
Bowers, J. S. et al. Deep problems with neural network models of human vision. Behav. Brain Sci. 46, e385 (2023).
Google Scholar
Humphreys, G. W. & Riddoch, J. M. Birmingham Object Recognition Battery. APA PsycNet https://doi.org/10.1037/t13731-000 (1993).
Warrington, E. K. & James, M. Visual Object and Space Perception Battery. Pearson Clinical UK https://www.pearsonclinical.co.uk/store/ukassessments/en/Store/Professional-Assessments/Cognition-%26-Neuro/Visual-Object-and-Space-Perception-Battery/p/P100009236.html (1991).
Torfs, K., Vancleef, K., Lafosse, C., Wagemans, J. & de-Wit, L. The Leuven Perceptual Organization Screening Test (L-POST), an online test to assess mid-level visual perception. Behav. Res. Methods 46, 472–487 (2014).
Google Scholar
Biscione, V. et al. MindSet: Vision. A toolbox for testing DNNs on key psychological experiments. Preprint at https://arxiv.org/abs/2404.05290 (2024).
Duan, H. et al. VLMEvalKit: an open-source toolKit for evaluating large multi-modality models. In Proc. 32nd ACM International Conference on Multimedia 11198–11201 (Association for Computing Machinery, 2024).
Hooper, H. Hooper Visual Organization Test Manual (Western Psychological Services, 1983).
de-Wit, L., Huygelier, H., Hallen, R. V. der, Chamberlain, R. & Wagemans, J. Developing the Leuven embedded figures test (L-EFT): testing the stimulus features that influence embedding. PeerJ 5, e2862 (2017).
Google Scholar
Dalrymple, K. A., Elison, J. T. & Duchaine, B. Face-specific and domain-general visual processing deficits in children with developmental prosopagnosia. Q. J. Exp. Psychol. 2006 70, 259–275 (2017).
Google Scholar
Jacobson, N. S. & Truax, P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J. Consult. Clin. Psychol. 59, 12–19 (1991).
Google Scholar
Kendall, P. C. & Grove, W. M. Normative comparisons in therapy outcome. Behav. Assess. 10, 147–158 (1988).
Daw, N. W. in Visual Development (ed Daw, N. W.) Ch. 1 (Springer, 2014).
Bosking, W. H., Crowley, J. C. & Fitzpatrick, D. Spatial coding of position and orientation in primary visual cortex. Nat. Neurosci. 5, 874–882 (2002).
Google Scholar
Kamitani, Y. & Tong, F. Decoding the visual and subjective contents of the human brain. Nat. Neurosci. 8, 679–685 (2005).
Google Scholar
Schwarzkopf, D. S. & Rees, G. Subjective size perception depends on central visual cortical magnification in human V1. PLoS One 8, e60550 (2013).
Google Scholar
Sperandio, I. & Chouinard, P. A. The mechanisms of size constancy. Multisens. Res. https://doi.org/10.1163/22134808-00002483 (2015).
Victor, J. D., Purpura, K., Katz, E. & Mao, B. Population encoding of spatial frequency, orientation, and color in macaque V1. J. Neurophysiol. 72, 2151–2166 (1994).
Google Scholar
Anderson, B. L. Mid-level vision. Curr. Biol. 30, R105–R109 (2020).
Google Scholar
Koffka, K. Principles of Gestalt Psychology (Harcourt Brace and Company, 1935).
Persike, M. & Meinhardt, G. Contour integration with corners. Vision Res. 127, 132–140 (2016).
Google Scholar
Cox, D. D. Do we understand high-level vision?. Curr. Opin. Neurobiol. 25, 187–193 (2014).
Google Scholar
Schwartz, J. H., Kandel, E. R., Jessell, T. M., Siegelbaum, S. A. & Hudspeth, A. J. Principles of Neural Science (Elsevier, 1991).
Ullman, S. High-Level Vision: Object Recognition and Visual Cognition (MIT, 1996).
Amir, O., Biederman, I. & Hayworth, K. J. Sensitivity to nonaccidental properties across various shape dimensions. Vision Res. 62, 35–43 (2012).
Google Scholar
Amir, O., Biederman, I., Herald, S. B., Shah, M. P. & Mintz, T. H. Greater sensitivity to nonaccidental than metric shape properties in preschool children. Vision Res. 97, 83–88 (2014).
Google Scholar
Azad, S., Jain, Y., Garg, R., Vineet, V. & Rawat, Y. Understanding depth and height perception in large visual-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 3650–3659 (2025).
Wang, C., Jia, R., Liu, X. & Song, D. Benchmarking zero-shot robustness of multimodal foundation models: a pilot study. Preprint at https://arxiv.org/abs/2403.10499 (2024).
Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). Preprint at https://arxiv.org/abs/2309.17421v2 (2023).
Bordes, F. et al. An introduction to vision-language modeling. Preprint at https://arxiv.org/abs/2405.17247 (2024).
Jung, K.-H. Uncover this tech term: foundation model. Korean J. Radiol. 24, 1038–1041 (2023).
Google Scholar
Tsimpoukelli, M. et al. Multimodal few-shot learning with frozen language models. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) https://proceedings.neurips.cc/paper/2021/file/01b7575c38dac42f3cfb7d500438b875-Paper.pdf (2021).
Groen, I. I. A., Silson, E. H. & Baker, C. I. Contributions of low- and high-level properties to neural processing of visual scenes in the human brain. Philos. Trans. R. Soc. B Biol. Sci. 372, 20160102 (2017).
Google Scholar
Zhai, X., Mustafa, B., Kolesnikov, A. & Beyer, L. Sigmoid loss for language image pre-training. In Proc. IEEE/CVF International Conference on Computer Vision 11975–11986 (2023).
Campbell, D. et al. Understanding the limits of vision language models through the lens of the binding problem. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) https://proceedings.neurips.cc/paper_files/paper/2024/file/cdcc6d47c1627350014a3076112ab824-Paper-Conference.pdf (2024).
Greff, K., Steenkiste, S. van & Schmidhuber, J. On the binding problem in artificial neural networks. Preprint at https://arxiv.org/abs/2012.05208 (2020).
Treisman, A. The binding problem. Curr. Opin. Neurobiol. 6, 171–178 (1996).
Google Scholar
Frankland, S., Webb, T., Lewis, R. & Cohen, J. No coincidence, George: processing limits in cognitive function reflect the curse of generalization. Preprint at OSF https://osf.io/preprints/psyarxiv/cjuxb_v2 (2021).
Houlsby, N. et al. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning 2790–2799 (PMLR, 2019).
Hyeon-Woo, N., Ye-Bin, M., Choi, W., Hyun, L. & Oh, T.-H. VLM’s eye examination: instruct and inspect visual competency of vision language models. Preprint at https://arxiv.org/abs/2409.14759 (2024).
Zanella, M. & Ben Ayed, I. Low-rank few-shot adaptation of vision-language models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1593–1603 (2024).
Zhang, R. et al. Tip-adapter: training-free adaption of clip for few-shot classification. In European Conference on Computer Vision 493–510 (Springer, 2022).
Jiang, D. et al. From CLIP to DINO: visual encoders shout in multi-modal large language models. Preprint at https://arxiv.org/abs/2310.08825 (2024).
Jiao, Q., Chen, D., Huang, Y., Li, Y. & Shen, Y. From training-free to adaptive: empirical insights into MLLMs’ understanding of detection information. Preprint at https://arxiv.org/abs/2401.17981 (2024).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).
Oquab, M. et al. DINOv2: learning robust visual features without supervision. Trans. Mach. Learn. Res. https://doi.org/10.48550/arxiv.2304.07193 (2024).
Wicherts, J. M. et al. Degrees of freedom in planning, running, analyzing, and reporting psychological studies: a checklist to avoid p-hacking. Front. Psychol. 7, 1832 (2016).
Google Scholar
He, J. et al. Does prompt formatting have any impact on LLM performance? Preprint at https://arxiv.org/abs/2411.10541 (2024).
Vatsal, S. & Dubey, H. A survey of prompt engineering methods in large language models for different NLP tasks. Preprint at https://arxiv.org/abs/2407.12994 (2024).
Jiang, N., Kachinthaya, A., Petryk, S. & Gandelsman, Y. Interpreting and editing vision-language representations to mitigate hallucinations. Preprint at https://arxiv.org/abs/2410.02762(2024).
Petsiuk, V., Das, A. & Saenko, K. RISE: randomized input sampling for explanation of black-box models. Preprint at https://arxiv.org/abs/1806.07421 (2018).
OpenAI. Introducing OpenAI o1. https://openai.com/o1/ (2024).
Shao, H. et al. Visual CoT: advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks https://proceedings.neurips.cc/paper_files/paper/2024/file/0ff38d72a2e0aa6dbe42de83a17b2223-Paper-Datasets_and_Benchmarks_Track.pdf (2024).
Lindsey, J. et al. On the biology of a large language model. Transformer Circuits Thread https://transformer-circuits.pub/2025/attribution-graphs/biology.html (2025).
Sheybani, S., Maini, S. S., Dendukuri, A., Tiganj, Z. & Smith, L. B. ModelVsBaby: a developmentally motivated benchmark of out-of-distribution object recognition. Preprint at https://osf.io/preprints/psyarxiv/83gae_v1 (2024).
Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F. & Sun, M. Exploring perceptual limitation of multimodal large language models. Preprint at https://arxiv.org/abs/2402.07384v1 (2024).
Ward, E. J. Exploring perceptual illusions in deep neural networks. Vis. Sci. Soc. Annu. Meet. Abstr. 19, 34b (2019).
Zhang, H. & Yoshida, S. Exploring deep neural networks in simulating human vision through five optical illusions. Appl. Sci. 14, 3429 (2024).
Google Scholar
Zhang, Y., Pan, J., Zhou, Y., Pan, R. & Chai, J. Grounding visual illusions in language: do vision-language models perceive illusions like humans? In Proc. 2023 Conference on Empirical Methods in Natural Language Processing (eds Bouamor, H. et al.) 5718–5728 (Association for Computational Linguistics, 2023).
Duchaine, B. & Nakayama, K. The Cambridge face memory test: results for neurologically intact individuals and an investigation of its validity using inverted face stimuli and prosopagnosic participants. Neuropsychologia 44, 576–585 (2006).
Google Scholar
Kingdom, F. A. A. & Prins, N. in Psychophysics 2nd edn (eds Kingdom, F. A. A. & Prins, N.) Ch. 3 (Academic, 2016).
Peirce, J. et al. PsychoPy2: Experiments in behavior made easy. Behav. Res. Methods 51, 195–203 (2019).
Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Google Scholar
Tangtartharakul, G. Gene-Tangt/neuropsych_vlm_bench: First release. Zenodo https://doi.org/10.5281/ZENODO.16809513 (2025).
