This newsletter explores recent breakthroughs and challenges in the world of multimodal image and text foundation models. From leveraging Large Language Models (LLMs) for semantic segmentation to understanding the intricacies of visual attention, we'll cover key developments shaping the future of this rapidly evolving field. We'll also examine novel techniques for fine-tuning these models in low-data scenarios, highlighting the delicate balance between specialization and generalization.
MASTER: Multimodal Segmentation with Text Prompts by Fuyang Liu, Shun Lu, Jilin Mei, Yu Hu https://arxiv.org/abs/2503.04199
Autonomous driving relies heavily on precise scene understanding, a task made challenging by fluctuating weather and lighting. RGB-Thermal fusion, combining data from both modalities, offers a robust solution. However, traditional methods often rely on complex, inflexible fusion modules. This paper introduces MASTER (Multimodal Segmentation with TExt Prompts), a novel architecture using the power of LLMs for a streamlined and adaptable approach to RGB-Thermal fusion.
MASTER employs a dual-path Vision Transformer (ViT) to extract features from RGB and thermal images. These features are then projected into a language feature space, aligning them with embeddings generated from user-defined text prompts. The key innovation lies in the use of an LLM as the fusion module. The aligned image features and text embeddings are fed into the LLM, which generates learned codebook tokens, C<sub>out</sub>, encoding rich semantic information about the scene. This process can be represented as:
C<sub>out</sub> = M(p<sub>v→T</sub>(F<sub>rgb</sub>), p<sub>v→T</sub>(F<sub>thr</sub>), X<sub>txt</sub>, C<sub>in</sub>)
where M is the LLM, p<sub>v→T</sub> is the vision-to-language projection, F<sub>rgb</sub> and F<sub>thr</sub> are the image features, X<sub>txt</sub> are the text prompt tokens, and C<sub>in</sub> are initialized codebook tokens. A lightweight decoder, similar to that of the Segment Anything Model (SAM), uses these enriched tokens to produce the final segmentation masks.
Evaluated on the MFNet benchmark, MASTER achieved a state-of-the-art mean Intersection over Union (mIoU) of 62.5%, significantly surpassing existing methods. It showed particular strength in segmenting small objects like guardrails and bumps, with improvements of +14.07% and +3.36% in IoU, respectively. Qualitative analysis further demonstrated MASTER's advantages, producing more continuous segmentations, better capturing small object shapes, and reducing noise compared to state-of-the-art methods like CMNeXt. These improvements underscore the LLM's ability to effectively integrate multimodal information guided by text prompts, resulting in more accurate and robust scene understanding.
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights by Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, Xunliang Cai https://arxiv.org/abs/2503.04167
Caption: Two geometry problems, identical except for the label positions of vertices G and H, demonstrate how subtle visual changes can alter the correct answer. This example highlights the need for multimodal math models to genuinely "see" and interpret visual information, a challenge addressed by the newly introduced HC-M3D dataset, which features such image variations to test models' visual reliance. The original image, with H at the bottom right, has a correct answer of 97 degrees while the revised image, with G at the bottom right, has a correct answer of 107 degrees.
While Large Vision-Language Models (LVLMs) hold promise for multimodal mathematical reasoning, this paper questions the actual role of visual information in these models. The authors argue that the contribution of visual patterns is often minimal, with models relying heavily on textual cues and answer options. Experiments with shuffling or removing images from training datasets of prominent mathematical LVLMs like G-LLaVA, MathLLaVA, MAVIS, and MultiMath revealed negligible impact on performance, with some models even improving when visual input was perturbed or absent. This suggests an over-reliance on text and underutilization of visual information.
This overestimation of visual reliance is attributed to two factors: overly informative text and answer options that leak the solution. To address this, the authors introduce the HC-M3D dataset, comprising samples designed to necessitate visual reliance. A key feature is the inclusion of similar yet distinct images that alter the correct answer.
Leading models evaluated on HC-M3D showed a significant inability to detect crucial subtle visual differences. In over half the cases, predictions remained unchanged after image alteration, indicating a failure to effectively use visual information. Furthermore, enhancing general Visual Question Answering (VQA) capabilities by combining image encoders like CLIP, SigLip, and DINO did not improve mathematical reasoning, often even decreasing performance while improving general VQA. This discrepancy highlights the unique challenges of mathematical reasoning, suggesting that monochromatic, low-information-density images in math problems contribute to this difficulty. This underscores the need for datasets emphasizing visual information, improved image encoders, and better loss functions to enhance visual reliance in mathematical reasoning models.
Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation by Aishik Konwer, Zhijian Yang, Erhan Bas, Cao Xiao, Prateek Prasanna, Parminder Bhatia, Taha Kass-Hout https://arxiv.org/abs/2503.04639
Caption: This diagram illustrates a novel framework for enhancing the Segment Anything Model (SAM) for medical image segmentation using unsupervised prompts generated from BiomedCLIP, MedVInT, and GPT-4. It incorporates a preference-based alignment module inspired by Direct Preference Optimization (DPO) to simulate human feedback and refine segmentation quality without extensive annotation. This approach significantly improves performance in low-data scenarios, as demonstrated by the comparison with ground truth segmentations and subsequent ratings used for model optimization.
While SAM shows promise in medical image segmentation, its reliance on supervised prompts limits its application in low-annotation scenarios. This paper presents an enhanced SAM framework utilizing annotation-efficient, unsupervised prompts generated through contrastive language-image pretraining (CLIP) and visual question answering (VQA). This approach combines semantic, locational, and disease/organ information, improving segmentation without expert intervention. The framework leverages BiomedCLIP and MedVInT for visual and textual feature extraction, respectively, with GPT-4 providing disease information. These features are fed into a prompt encoder, and then, along with image embeddings, into a mask decoder.
To address limited annotated data, the researchers incorporated a preference-based alignment module using Direct Preference Optimization (DPO). This module simulates human feedback by generating segmentation candidates and rating them based on ground truth overlap. The model is trained using a DPO-inspired loss function:
L<sub>DPO</sub>(π<sub>ψ</sub>; π<sub>fine</sub>) = -E<sub>(I,Y1,Y2,Y3,Y4)~D</sub> [β<sub>1</sub> log (π<sub>ψ</sub>(Y<sub>2</sub>|I)/π<sub>fine</sub>(Y<sub>2</sub>|I)) + β<sub>2</sub> log (π<sub>ψ</sub>(Y<sub>1</sub>|I)/π<sub>fine</sub>(Y<sub>1</sub>|I)) - β<sub>2</sub> log (π<sub>ψ</sub>(Y<sub>3</sub>|I)/π<sub>fine</sub>(Y<sub>3</sub>|I)) - β<sub>1</sub> log (π<sub>ψ</sub>(Y<sub>4</sub>|I)/π<sub>fine</sub>(Y<sub>4</sub>|I))]
where π<sub>ψ</sub> and π<sub>fine</sub> are the final and initially fine-tuned model parameters, I is the image-prompt pair, Y<sub>i</sub> are the ranked segmentation candidates, and β<sub>i</sub> are weights. This allows learning from simulated preferences without explicit reward functions or extensive domain knowledge.
Evaluated on lung, breast tumor, and abdominal organ segmentation datasets, the method consistently outperformed state-of-the-art methods in limited data settings. For instance, on the Chest X-ray dataset with 20% of the data, it achieved a Dice score of 78.87, significantly outperforming other methods. This demonstrates the effectiveness of unsupervised prompting and preference alignment.
See What You Are Told: Visual Attention Sink in Large Multimodal Models by Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang https://arxiv.org/abs/2503.03321
Caption: The image illustrates the Visual Attention Redistribution (VAR) method. (a) shows the selection of image-centric heads based on the visual non-sink ratio (r). (b) depicts how VAR redistributes attention weights from visual sink tokens (marked with stars) to other image tokens within the selected heads, effectively utilizing the "attention budget".
Large Multimodal Models (LMMs) use attention mechanisms to link text tokens to visual information. However, they often fixate on irrelevant visual tokens, dubbed visual sink tokens, arising from high activation in specific hidden state dimensions. Masking these tokens has minimal impact on output, suggesting wasted "surplus attention." Researchers developed Visual Attention Redistribution (VAR) to redirect this attention.
VAR identifies "image-centric heads" using the visual non-sink ratio (r):
r<sup>l,h</sup> = (Σ<sub>j∈I<sub>vis</sub>\Ž<sup>vis</sup></sub> α<sup>l,h</sup><sub>i,j</sub>) / (Σ<sub>j∈I<sub>vis</sub></sub> α<sup>l,h</sup><sub>i,j</sub>)
where α<sup>l,h</sup><sub>i,j</sub> is the attention weight from visual token j to text token i in layer l and head h. I<sub>vis</sub> is the set of visual tokens, and Ž<sup>vis</sup> is the set of visual sink tokens. VAR redistributes attention from sink tokens to relevant tokens within these heads.
VAR consistently improved performance across various vision-language tasks. On the GQA benchmark, a smaller LLaVA model with VAR outperformed a larger model without VAR, suggesting that improving attention allocation significantly boosts capabilities. This opens up exciting avenues for future research into attention mechanisms within LMMs.
Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings by Sneh Pillai https://arxiv.org/abs/2503.03202
Caption: This graph compares the retrieval performance (Recall@1, Recall@5, Recall@10) of different loss strategies for multimodal alignment in a low-data setting. The variance-based strategy significantly outperforms fixed, entropy-based, and cosine-spread strategies, demonstrating its effectiveness in improving image-text retrieval accuracy. The consistent high performance of R@10 across all strategies suggests a potential upper limit imposed by the dataset size or model capacity.
Training vision-language models in data-scarce scenarios often leads to overfitting and unstable training. This paper introduces a variance-aware loss scheduling approach, dynamically adjusting the contrastive loss weight based on the variability of alignment predictions. This allows the model to focus on uncertain areas.
The method uses a symmetric contrastive loss function:
L<sub>total</sub>(t) = w<sub>I</sub>(t) L<sub>I2T</sub> + w<sub>T</sub>(t) L<sub>T2I</sub>
where L<sub>I2T</sub> and L<sub>T2I</sub> are the image-to-text and text-to-image losses, and w<sub>I</sub>(t) and w<sub>T</sub>(t) are dynamically adjusted weights based on the variance of cosine similarity scores between matched pairs.
Evaluated on Flickr8k in a low-data setting, variance-aware scheduling significantly improved image-text retrieval accuracy compared to fixed-weight and other adaptive weighting strategies. It achieved 12% and 10% relative improvements in Recall@1 for image-to-text and text-to-image retrieval, respectively. t-SNE visualizations showed more distinct multimodal embeddings. In robustness tests with noisy data, the variance-guided loss maintained higher recall, demonstrating its effectiveness in low-data regimes.
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model by Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du https://arxiv.org/abs/2503.04543
Caption: This image presents a taxonomy of MLLM fine-tuning strategies, categorized into Selective, Additive, and Reparameterization methods. It visually illustrates how each approach modifies the MLLM architecture and parameters, and provides examples of downstream tasks like VQA and diagram understanding used for evaluating specialization and generalization performance. The image also highlights the trade-offs between specialization and generalization stability associated with different tuning methods.
Fine-tuning MLLMs for specific tasks presents two challenges: Task-Expert Specialization (adapting to new distributions) and Open-World Stabilization (preventing catastrophic forgetting). This paper reviews MLLM tuning methodologies, classifying them into Selective Tuning, Additive Tuning, and Reparameterization Tuning.
Benchmarking these strategies across tasks and architectures like LLaVA-OV and VILA using metrics like Specialization Improvement (E) and Stabilization Forgetting (F) revealed key findings. Full Layer Selective Tuning achieved high specialization but suffered from overfitting and forgetting. LoRA mitigated forgetting but showed limited adaptation. Selective Tuning of Top and Last layers offered a balance, with top-layer tuning excelling in preserving generalization.
These findings led to several tuning principles. Full tuning presents a trade-off between specialization and generalization. LoRA offers better stabilization but limited adaptation. Top LLM layers encode vision-text interaction, while final layers handle output style. Vision projector adaptation is crucial for visual domain shifts. The paper further suggests future research directions like Federated MLLM Tuning and Large and Small MLLM Collaboration.
This newsletter has showcased a diverse range of advancements in multimodal image and text foundation models. From novel architectures like MASTER leveraging LLMs for enhanced segmentation to the nuanced understanding of visual attention sinks and the development of efficient fine-tuning strategies like variance-aware loss scheduling and selective tuning, the field is rapidly progressing. The challenges of balancing specialization and generalization in downstream tasks remain a key focus, driving innovation in training methodologies and architectural designs. The development of specialized datasets like HC-M3D further highlights the need for more robust evaluation and a deeper understanding of the interplay between visual and textual information in these powerful models. The future of multimodal AI promises exciting developments, with ongoing research paving the way for more sophisticated and adaptable models capable of tackling complex real-world problems.