Do Multimodal Large Language Models See Like Humans?

Jiaying Lin^*, Shuquan Ye^*, Rynson W. H. Lau

City University of Hong Kong
^*Joint first authors

Do multimodal large language models (MLLMs) perceive the world in the same way as humans do? Our research reveals the significant gap between current MLLMs and human visual system (HVS) across multiple aspects:

Swipe to see more.

Abstract

Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.

TL;DR

Current benchmarks lack the ability to assess the alignment between MLLMs and HVS. We are the first to systematically study and evaluate MLLMs from this perspective, by introducing diverse questions related to HVS.

(a) We propose large-scale and comprehensive HVSBench, with a robust evaluation protocol. (b) We evaluate 13 leading MLLMs with HVSBench across 5 fields, revealing room for improvement and offering insights for developing HVS-aligned MLLMs. The shown predictions are from GPT-4o. For brevity, only 2 out of 10 human scanpaths are visualized. Questions and predictions are simplified.

HVSBench

HVSBench curated over 85K multimodal samples, spanning 5 distinct fields in HVS, including:
1. Prominence. Test whether the regions MLLMs focus on align with those that are prominent to human perception.
2. Subitizing. Test whether the number of visually prominent objects for MLLMs matches human perception.
3. Prioritizing. Assess if the order of importance assigned by MLLMs to objects reflects human viewing priorities.
4. Free-Viewing. Check if MLLMs can mirror the human attention shift (i.e., sequence of locations that the HSV attends to) in an image during free viewing.
5. Searching. Test if MLLMs can follow a similar sequence of gazes as humans when searching for a specific object in an image. To ensure both quality and variability, we design our benchmark based on a curated collection of large-scale and high-quality datasets focusing on the HVS derived from real-world human studies and employed human verification.

Samples of 13 question types in HVSBench with simplified text. Showing GT ranks and scanpath plots for better visuals.

Results

Still significant room for improvement

Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results.

HVSBench Leaderboard, highlighting that current open-source and proprietary MLLMs still underperform in aligning with HVS. For example, GPT4-o falls short in several metrics, such as Prominence, Prioritizing, and Searching accuracy, where it performs worse than several open-source models such as Qwen2, demonstrating the potential of recent open-source MLLMs. In Free-viewing, only LLaVA-OV and Gemini-1.5-Flash~ outperform random guessing in terms of accuracy. The situation improves slightly in the field of Searching. Specifically, both LLaVA-OV and Qwen2, and two proprietary MLLMs, greatly outperform random guessing in accuracy. However, almost all methods perform poorly on scanpath similarity metrics (``M-Dir'' and ``M-Pos'' in Free-viewing (Q9) and Searching (Q12)), underperforming random guessing, except for Qwen2 in searching.

MLLMs' explanation on its choice

To understand why GPT-4o errs on the prioritization task (bottom left) and to identify its focus of attention, we asked it to explain its reasoning. It highlights an over-reliance on physical attributes such as object size and position, overly associating these factors with salience. It also reveals a limitation in GPT-4o's attention mechanism: it did not account for social or human-centric cues, which might elevate the prominence of smaller yet contextually significant objects or people.

GPT-4o considers: "The van is large and occupies a significant portion of the image. Its size and central positioning make it visually dominant. The person near the van is secondary in prominence, as they are close to the van but are relatively smaller in size. The person standing on the right is smaller and less centrally located, reducing its visual dominance. The person on the far right is further isolated, diminishing their salience. Based on size, position, and visual weight, the van appears to be the most salient object in this image."

Larger models align better

Larger MLLMs generally outperform smaller ones across all metrics. It suggests that increasing model size leads to better alignment with HVS for MLLMs.

PO, SU, PI, FV and SE means Prominence, Subitizing, Prioritizing, Free-viewing, Searching, respectively.

Human captions or descriptions improve alignment?

It is possible that certain annotations, like human-generated captions and detailed descriptions, implicitly reflect the HVS. Therefore, we evaluated whether adding detailed descriptions ("Detail.") or short captions ("Cap.") could improve the performance of the base GPT-4o model. The results suggest that providing more context through human captions or detailed descriptions can lead to better performance in some evaluation criteria, aligning MLLMs with the HVS cannot be achieved merely by integrating human-generated captions and summaries, especially in Prioritizing, Free-viewing and Searching fields.

Field-specific hints improve alignment?

Since definitions and prior knowledge of the fields also provide cues for the HVS, we examine whether adding Field-Specific Hints can enhance model performance. For example, the hint for the Prominence field is: "The detection of salient objects aims to simulate the human visual perception system by identifying and localizing the most visually striking object(s) in a scene. Previous research suggests that the most salient object is the one that attracts the highest proportion of fixations, as indicated by the agreement between fixation patterns and saliency judgments. In general, there are two primary priors: objects closer to the viewer are perceived as more salient, and salient objects often appear near the center of the scene. Additionally, cues such as color contrast, spatial bias, and depth contrast also influence saliency." Our results show an improvement in the Prominence, Subitizing, and Prioritizing fields but a noticeable decrease in performance for Free-viewing and Searching. This decline suggests that too much contextual information may hinder the model's focus on the raw visual features necessary for free-viewing tasks. A possible explanation is that prior knowledge might cause the model to focus immediately on specific parts of the image, disrupting the natural temporal sequence of the human gaze. This could lead to fixation sequences that are less representative of human free-viewing or searching.

Possible applications

We design a Cropping-Based Prominence Enhancement, aiming to show one of the applications by a better HVS-aligned MLLM in the prominent field. Specifically, we examine how GPT-4o crops the image to enhance the prominence of one object: a photo. GPT-4o with a task-specific hint generates a reasonable analysis and successfully crops the image to highlight the photo, compared to the result without hint, demonstrating better alignment with HVS. This can be directly applied to automated design, context-aware content generation, and visual storytelling. Such improvements in alignment with HVS are crucial, not only for enhancing the performance of vision models, but also for facilitating emerging applications with broad societal and technological impact, e.g., perception in autonomous vehicles and robots, human-robot collaboration, assistive tools for the visually impaired, and even tasks like saliency ranking.

BibTeX

@InProceedings{Lin_HVSBench,
      author    = {Lin, Jiaying and Ye, Shuquan and Lau, Rynson W.H.},
      title     = {Do Multimodal Large Language Models See Like Humans?},
      booktitle = {arXiv},
      year      = {2024},
  }