This newsletter dives into the cutting-edge of multimodal research, exploring new models, benchmarks, and training strategies that are pushing the boundaries of image and text understanding. From sentiment analysis to scientific reasoning and pixel-level grounding, we'll cover a range of exciting developments that are shaping the future of AI. Get ready to explore the innovative ways researchers are leveraging the power of multimodal foundation models!
LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier by T. Chay-intr, Y. Chen, K. Viriyayudhakorn, T. Theeramunkong https://arxiv.org/abs/2502.02938
Caption: This image illustrates a simplified example of multimodal sentiment analysis. It presents an image (smiley face), text ("I am happy! #happyday"), and the predicted sentiment label ("Positive"), showcasing the type of data used to train and evaluate models like LLaVAC. This visual representation aligns with LLaVAC's approach of using structured prompts with image, text, and sentiment labels for fine-tuning.
Researchers have introduced LLaVAC, a novel method that leverages the Large Language and Vision Assistant (LLaVA) for multimodal sentiment analysis (MSA). Unlike traditional MSA approaches that rely on complex feature fusion from pre-trained models like BERT and CLIP, LLaVAC simplifies the process by fine-tuning LLaVA with a structured prompt incorporating unimodal (image and text) and multimodal sentiment labels. This structured prompting guides the model to classify sentiment effectively. The key innovation lies in using unimodal labels as context for multimodal label prediction, enhancing the model's understanding of sentiment across different modalities.
The methodology involves designing a prompt that presents image, text, and multimodal data alongside their corresponding sentiment labels (positive, negative, neutral). This structured format constrains LLaVA's output to these predefined sentiment polarities, effectively transforming it into a classifier. The model is then fine-tuned using this prompt-response structure, enabling it to learn the relationships between image, text, and their combined sentiment. This approach simplifies the MSA pipeline, eliminating the need for complex feature engineering and fusion strategies typically employed in previous works.
Evaluated on the MVSA-Single dataset using three different data split methods, LLaVAC significantly outperforms existing state-of-the-art methods. Across all three splits, LLaVAC achieved the highest accuracy and weighted F1-score. For instance, on the random split, LLaVAC achieved an impressive 79.46% accuracy and a 79.00% weighted F1-score, surpassing all baseline models. An ablation study further confirmed the importance of incorporating unimodal labels, revealing a performance drop when only the multimodal label was used for fine-tuning. This study highlights the potential of MLLMs like LLaVA for classification tasks in MSA, offering a practical and accessible solution for this complex task.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning by Yibo Yan, Shen Wang, Jiahao Huo, Jingheng Ye, Zhendong Chu, Xuming Hu, Philip S. Yu, Carla Gomes, Bart Selman, Qingsong Wen https://arxiv.org/abs/2502.02871
Caption: This figure illustrates a roadmap for developing Multimodal Large Language Models (MLLMs) for scientific reasoning, progressing through four stages: Broad Knowledge & Recognition, Analogical Reasoning & Generalization, Insightful Inference, and Creative Hypothesis Generation. It connects these reasoning capabilities with current paradigms in MLLM development, such as Data Integration, Knowledge Retrieval, Contextual Understanding, Pattern Recognition, and Simulation & Hypothesis Testing, highlighting the path towards Artificial General Intelligence (AGI).
Scientific reasoning, the process of applying logic and evidence to explore scientific phenomena, is crucial for knowledge advancement. While current models show promise, they struggle with generalization and multimodal perception. This position paper argues that Multimodal Large Language Models (MLLMs), capable of integrating text, images, and other modalities, offer a significant opportunity to advance scientific reasoning across diverse fields like mathematics, physics, chemistry, and biology. The authors propose that MLLMs' ability to process complex multimodal data unlocks unprecedented opportunities for scientific discovery.
The paper outlines a four-stage research roadmap for scientific reasoning capabilities: Broad Knowledge and Recognition (foundational understanding through retrieval-based reasoning), Analogical Reasoning and Generalization (drawing connections across domains using relational and analogical thinking), Insightful Inference (deducing complex outcomes from minimal data using predictive reasoning), and Creative Hypothesis Generation (generating innovative hypotheses and exploring uncharted territories using generative reasoning). This roadmap aims to guide the development of MLLMs towards achieving Artificial General Intelligence (AGI).
The authors delve into how MLLMs are currently applied in scientific reasoning, highlighting key paradigms: Data Integration, Knowledge Retrieval, Contextual Understanding, Pattern Recognition, and Simulation and Hypothesis Testing. Despite these advancements, challenges remain, including data diversity across domains, achieving reasoning depth in complex tasks, mitigating error propagation across modalities, managing hallucinations (both harmful and potentially beneficial), and addressing ethical and interpretability concerns. To address these, the paper proposes several perspectives for future research, including developing unified scientific MLLMs, improving multimodal datasets, integrating expert knowledge and explainability, expanding scientific reasoning to generative tasks, exploring the potential of hallucinations as a creative tool, building interactive feedback systems, fostering agent-based collaboration, and evolving reasoning schemes.
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? by Mennatullah Siam https://arxiv.org/abs/2502.04192
Caption: This diagram illustrates the PixFoundation method for extracting pixel-level grounding information from vanilla Multi-Modal Large Language Models (MLLMs). It shows how the model processes a series of images, automatically selects relevant masks based on textual prompts, and ultimately identifies the image with the best highlighted feature (in this case, the match flame). This approach avoids explicit pixel-level training, instead leveraging the MLLM's inherent reasoning capabilities to achieve competitive grounding performance.
While Multi-modal large language models (MLLMs) show promise in various vision-related tasks, their pixel-level understanding remains a challenge. Current research focuses on training MLLMs with pixel-level grounding supervision on massive labelled datasets. This paper questions this approach, arguing that such training may hinder performance in other areas, like visual question answering (VQA). The author introduces two new benchmarks, PixMMVP and PixCV-Bench, built by augmenting existing datasets with pixel-level annotations and referring expressions, to rigorously evaluate pixel-level MLLMs.
Evaluating several state-of-the-art pixel-level MLLMs on these benchmarks, including LISA, GLAMM, OMG-Llava, and Llava-G, the author finds that these models often underperform in VQA compared to vanilla MLLMs (those not trained with pixel-level grounding). Surprisingly, some pixel-level MLLMs even show degraded grounding performance compared to simpler methods. To address these limitations, the author proposes PixFoundation, a method for extracting pixel-level grounding information from vanilla MLLMs. PixFoundation leverages the observation that grounding information often emerges not in the exact noun phrase of the object, but in related output tokens describing its appearance, location, or context. An oracle upper bound, PixFoundation†, based on selecting the attention map with the highest Intersection over Union (IoU) with the ground truth mask, is also proposed.
Results show that PixFoundation, without any pixel-level training, achieves competitive performance in visual grounding, even outperforming some pixel-level MLLMs. The evaluation metric used is S = 1 / (2 / (1 / max(A,A†) + 1 / max(M,M†))), where A and A† represent VQA accuracy with and without explicit segmentation output, and M and M† represent mean IoU with and without explicit segmentation prompting. Further analysis reveals that grounding often emerges in the last 40% of the output text in Llava 1.5 variants and the last 10% in Cambrian-1, suggesting that the model's reasoning process influences when grounding occurs. These findings challenge the current paradigm of pixel-level MLLM training and suggest that focusing on improving the reasoning capabilities of vanilla MLLMs, combined with methods like PixFoundation, might be a more effective path towards robust pixel-level understanding.
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion by Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov https://arxiv.org/abs/2502.04263
Caption: This figure illustrates the concept of intra-modal misalignment in Vision-Language Models (VLMs), where similarities within a single modality are inaccurate. The left side depicts this misalignment, while the right side contrasts a standard intra-modal approach with the proposed inter-modal approach, which leverages modality inversion (OTI/OVI) to improve intra-modal similarity calculations by transforming the task into an inter-modal one. The bar graphs demonstrate the improved similarity scores achieved by the inter-modal approach.
Vision-Language Models (VLMs) like CLIP excel at aligning image and text embeddings. However, using these models for intra-modal tasks (e.g., image-to-image retrieval) by individually leveraging their encoders might be suboptimal. This paper argues that the inter-modal contrastive loss, central to VLM training, neglects intra-modal relationships, leading to "intra-modal misalignment"—inaccurate similarities within a single modality (image-image or text-text).
The authors propose transforming intra-modal tasks into inter-modal ones using modality inversion, adapting Optimization-based Textual Inversion (OTI) and introducing Optimization-based Visual Inversion (OVI). These techniques map features from their native modality to the complementary one without requiring external data or training a separate mapping network. The idea is to leverage CLIP's strong inter-modal alignment to improve intra-modal similarity calculations.
Experiments across numerous datasets demonstrate the effectiveness of this inter-modal approach, consistently outperforming intra-modal baselines. Applying modality inversion to a natively inter-modal task like zero-shot image classification decreases performance, reinforcing that the benefit comes from leveraging inter-modal alignment for intra-modal problems. The paper also investigates the role of intra-modal constraints and the modality gap. Using SLIP, a VLM trained with both inter- and intra-modal losses, the authors show that incorporating an intra-modal term during pre-training reduces the performance boost from modality inversion. Similarly, fine-tuning CLIP to close the modality gap, achieved by increasing the temperature parameter τ in the contrastive loss ($L_{CLIP} = \frac{1}{N} \sum_{n=1}^{N} (-log \frac{exp(c(\psi_n, \psi_n^+)/\tau)}{\sum_{m=1}^{N} exp(c(\psi_n, \psi_m)/\tau)} -log \frac{exp(c(\psi_n^+, \psi_n)/\tau)}{\sum_{m=1}^{N} exp(c(\psi_n^+, \psi_m)/\tau)})$), also diminishes the impact of modality inversion, linking modality gap and intra-modal misalignment.
LR0.FM: Low-Resolution Zero-shot Classification Benchmark For Foundation Models by Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat https://arxiv.org/abs/2502.03950
Caption: This image illustrates the LR-TK0 (LR-Zero-Shot Tokens) architecture for enhancing the low-resolution robustness of visual-language foundation models. Trainable LR-specific tokens are added to the spatial tokens of a frozen foundation model (teacher), and the student model is trained via self-supervised distillation on synthetic high-resolution data, comparing its output with the teacher's output on both high- and low-resolution inputs using a CLIP loss.
While visual-language foundation models (FMs) excel in zero-shot generalization, their performance on low-resolution (LR) images remains underexplored. This paper introduces LR0.FM, a benchmark evaluating the impact of low resolution on zero-shot classification performance across various FMs, backbones, and datasets. The authors also propose a new metric, Weighted Aggregated Robustness (WAR), to better evaluate performance across resolutions and datasets. WAR is calculated as WAR-n= ∑Datasets |Γ<sup>D</sup><sub>n</sub>|×w<sub>D</sub> / ∑Datasets |w<sub>D</sub>|, where w<sub>D</sub> is the dataset weight and Γ<sup>D</sup><sub>n</sub> is the dataset-specific improved robustness score for resolution n×n. Improved relative robustness Γ<sup>D</sup><sub>n</sub> is calculated as Γ<sup>D</sup><sub>n</sub> = γ<sup>D</sup><sub>n</sub> × (1 - e<sup>-α(E<sub>D</sub>)²</sup>), where γ<sup>D</sup><sub>n</sub> is the traditional relative robustness, E<sub>D</sub> is the accuracy gap between high-resolution and random prediction accuracy, and α is a hyperparameter.
Key findings from LR0.FM reveal that model size correlates positively with robustness to resolution degradation, dataset quality is more important than size, and fine-tuned/higher-resolution models are less robust against LR. Interestingly, FMs often make semantically reasonable predictions even at LR, suggesting that the lack of detail primarily affects initial layers. Based on this, the authors introduce LR-TK0 (LR-Zero-Shot Tokens), adding trainable LR-specific tokens to the spatial tokens of the frozen FM. This bridges the gap between high-resolution (HR) and LR domains through self-supervised distillation on a synthetic HR dataset.
LR-TK0 demonstrates consistent improvements in robustness at low resolutions, particularly for MetaCLIP, with minimal impact on high-resolution performance. Compared to super-resolution methods, LR-TK0 offers a more effective and efficient solution.
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment by Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao https://arxiv.org/abs/2502.04328
Caption: This diagram illustrates Ola's progressive modality alignment strategy, starting with text-image training, then text-video training, and finally vision-audio bridging. This staged approach allows Ola to efficiently learn cross-modal representations and achieve strong performance in image, video, and audio understanding tasks. The orange arrows and "Ola" boxes with flame icons represent the progressive integration and alignment of each modality within the model.
Introducing Ola, a new omni-modal large language model (LLM) designed for competitive performance across image, video, and audio understanding tasks. Ola addresses the performance gap between open-source omni-modal solutions and specialized LLMs through a progressive modality alignment strategy and architectural innovations. Ola's training begins with image and text, progressively incorporating speech data (linking language and audio) and then video data (connecting all modalities). This staged approach allows for efficient use of cross-modal alignment data. Ola also uses a dual encoder approach for audio (Whisper-v3 for speech, BEATs for music) and a Local-Global Attention Pooling layer to efficiently downsample visual features, calculating downsampled features f<sub>global</sub> and combining them with original features f using a learned importance score π: f = Concat[f, f<sub>global</sub>], π = Softmax(MLP(f)). Sentence-wise streaming decoding for text and speech generation enhances real-time interaction.
Evaluated on various benchmarks, Ola demonstrates impressive performance across modalities, outperforming existing open omni-modal LLMs and achieving competitive results against specialized models. Ablation studies confirm the benefits of progressive training and cross-modal video-audio data.
This newsletter highlighted several key advancements in multimodal image and text foundation models. We've seen innovative training strategies like LLaVAC's structured prompting for sentiment analysis and Ola's progressive modality alignment for omni-modal understanding. New benchmarks like LR0.FM expose the limitations of current models in handling low-resolution images, while clever techniques like PixFoundation and modality inversion offer promising solutions to improve performance in pixel-level grounding and intra-modal tasks. These diverse approaches demonstrate the ongoing evolution of the field, pushing towards more robust, versatile, and efficient multimodal AI models capable of tackling increasingly complex real-world challenges.