This newsletter dives into the latest advancements in multimodal image and text foundation models. We'll explore how these powerful models are being applied across diverse domains, from remote sensing and scientific poster summarization to cultural understanding and industrial defect detection. We'll also examine the challenges they face, including generalization, data bias, and the need for robust evaluation metrics. Get ready to explore the cutting edge of multimodal AI.
PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching by Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Huan Zhou, Shuo Zhang, Weixing Liu https://arxiv.org/abs/2502.18104
Caption: The PromptMID architecture leverages pre-trained diffusion and visual foundation models to generate modality-invariant descriptors for optical and SAR images. It incorporates text prompts based on land use classification to guide the diffusion model and employs a multi-scale aware aggregation module (MSAA) to fuse features, ultimately improving cross-modal matching performance. The architecture processes SAR and optical images separately, extracting features using diffusion models guided by text prompts and visual foundation models, respectively, before combining these features for matching.
Matching optical and synthetic aperture radar (SAR) images is a critical task in remote sensing, but existing methods often struggle with generalization to unseen domains due to the inherent differences in imaging modalities. These differences manifest as geometric distortions and non-linear radiation differences (NRDs), posing significant challenges for traditional handcrafted methods, which often lack robustness to complex transformations. While learning-based methods offer greater potential, they frequently overfit to training data, limiting their applicability to new domains. Furthermore, directly applying powerful foundation models trained on natural imagery to remote sensing data often yields suboptimal results due to the inherent domain shift.
PromptMID introduces a novel approach to this challenge, leveraging the power of pre-trained diffusion and visual foundation models (VFMs) to construct modality-invariant descriptors. The innovation lies in incorporating land use classification information as prior knowledge through text prompts, guiding the diffusion model in extracting diffusion features as modality-invariant representations. Specifically, the intermediate features of the diffusion decoder are utilized as multi-scale latent diffusion features in the process of mapping from text and SAR images to optical images. For optical images, PromptMID extracts coarse-grained features from frozen VFMs and fine-grained features from learnable VGG models, integrating both to construct multi-scale representations.
A key component of PromptMID is the multi-scale aware aggregation module (MSAA), which effectively fuses features across different scales, enhancing global representation capability. A Convolutional Block Attention Module (CBAM) further refines the features in both channel and spatial dimensions, suppressing irrelevant information and enhancing feature representation. The descriptors for optical (Dº) and SAR (Dˢ) images are mathematically represented as:
Dº = {dᵢ = MSAA(VFMs(pᵢ, OPT))}, i=1,...,M
Dˢ = {dⱼ = MSAA(D(pⱼ, SAR, Prompts))}, j=1,...,N
where D and VFMs represent pre-trained diffusion models and visual foundation models, respectively; MSAA denotes the multi-scale aware aggregation module, and Prompts denotes text prompts based on land use classification.
Extensive experiments on four distinct optical-SAR datasets, including seen and unseen domains, demonstrate PromptMID's superior performance. Achieving near-perfect matching success rates across all datasets, PromptMID significantly outperforms state-of-the-art methods, showcasing its strong cross-domain generalization capabilities. Ablation studies further confirm the effectiveness of each component, highlighting the synergy between diffusion models, VFMs, and the feature aggregation module. PromptMID represents a significant step towards more robust and reliable multimodal image matching in remote sensing.
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization by Rohit Saxena, Pasquale Minervini, Frank Keller https://arxiv.org/abs/2502.17540
Caption: This image illustrates the SEGMENT & SUMMARIZE approach for scientific poster summarization. It shows a poster being segmented, with each segment then summarized by a multimodal large language model. These individual summaries are then combined to create a final abstract of the poster's content.
Scientific posters are a rich medium for conveying complex research findings, but effectively summarizing these information-dense visuals remains a challenge. POSTERSUM introduces a new benchmark dataset specifically designed to advance multimodal vision-language models for scientific poster summarization. This large-scale dataset, comprising 16,305 scientific posters paired with their corresponding research paper abstracts, offers a robust platform for evaluating and improving model performance. The posters, sourced from prominent machine learning conferences, exhibit diverse visual complexities, including tables, charts, equations, and dense text, mirroring the real-world challenges of scientific communication.
Benchmarking state-of-the-art Multimodal Large Language Models (MLLMs) on POSTERSUM reveals significant limitations in their current ability to process and summarize these intricate visuals. Even leading closed-source models struggle with this task, highlighting the need for innovative approaches. In response, the researchers propose SEGMENT & SUMMARIZE, a hierarchical method that first segments the poster into coherent regions. A multimodal large language model then generates localized summaries for each segment. Finally, a text-based large language model combines these individual summaries into a cohesive abstract, mirroring the structure of a research paper abstract. This divide-and-conquer strategy allows the model to focus on fine-grained details within each region before integrating the information into a comprehensive summary.
The SEGMENT & SUMMARIZE method significantly outperforms existing MLLMs on automated metrics, demonstrating the effectiveness of this hierarchical approach in capturing and integrating information from diverse visual elements. Analysis further reveals the crucial role of processing non-textual elements like figures and tables in generating accurate and comprehensive summaries. The research also highlights the ongoing challenge of evaluating factuality in generated summaries, emphasizing the need for specialized metrics tailored to scientific text. POSTERSUM and the SEGMENT & SUMMARIZE method provide a valuable foundation for future research in multimodal scientific poster understanding, paving the way for more sophisticated models that can effectively process and summarize complex scientific content.
What are Foundation Models Cooking in the Post-Soviet World? by Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu https://arxiv.org/abs/2502.18583
Caption: This figure displays the performance of two large language models, Llama and Qwen, on identifying the origin countries of various dishes from the Post-Soviet region using text and visual question answering (QA and VQA) in both Russian and Ukrainian. The heatmaps represent the probability of the model predicting a specific country given a dish, revealing a tendency to over-predict countries associated with the question's language, particularly Russia, and highlighting the challenges these models face in understanding nuanced cultural knowledge.
This study explores the cultural awareness of foundation models, specifically their understanding of Post-Soviet food culture. The researchers created BORSH (Benchmark Of Regional dishes), a multimodal dataset encompassing over 1,900 dishes from the region, spanning Russian and Ukrainian languages, along with images and origin countries. Food, a cornerstone of cultural identity, serves as a lens to examine how well foundation models represent the diverse culinary traditions of the Post-Soviet states.
Evaluating leading models on text-based and visual question answering (QA and VQA), the researchers found a significant bias: models often over-predict countries linked to the language of the question. For instance, when asked in Russian about a Moldovan dish, models frequently and incorrectly attribute it to Russia. This bias stems from misleading dish-country co-occurrences in the training data, compounded by linguistic complexities like Russian-Ukrainian code-mixing (surzhyk). This highlights the impact of data biases and linguistic nuances on model performance.
Expanding beyond QA, the study also evaluated models on dish description generation. Models generated textual descriptions, which were then used to generate images using a text-to-image model. Comparing these generated images with real dish images offered a unique perspective on cultural understanding. Interestingly, performance on this task correlated weakly with QA/VQA accuracy, suggesting that QA alone is insufficient for evaluating cultural understanding. BORSH introduces a valuable resource for future research, underscoring the need for more nuanced evaluation methods that move beyond simple QA towards a more holistic assessment of cultural knowledge. This work highlights the ongoing challenge of ensuring foundation models accurately reflect the diversity of global cultures.
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models by Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott https://arxiv.org/abs/2502.19409
Caption: This figure compares the SimRate (semantic similarity rate) of different models on the next-scene description task across varying context lengths (number of scenes). ImageChain consistently outperforms baseline models (MLLM-FT, VisualContext, FinalScene) across context lengths C2, C3, and C4-7, demonstrating the effectiveness of its sequential modeling approach.
While MLLMs excel at understanding individual images, they often falter when processing image sequences, failing to grasp the temporal relationships between frames. IMAGECHAIN addresses this limitation by modeling visual sequences as multi-turn conversations. By interleaving images with textual descriptions, IMAGECHAIN creates a structured dialogue that explicitly captures temporal dependencies and narrative flow. This framework is optimized for next-scene description, where the model predicts the subsequent scene based on preceding visual and textual context.
The IMAGECHAIN methodology transforms image sequences into fixed-size representations, which are then paired with scene descriptions and structured into a multi-turn conversation. The model is fine-tuned using standard next-token prediction over this conversational context, minimizing the cross-entropy loss: L<sub>IC</sub> = Σ<sup>N</sup><sub>i=1</sub> Σ<sup>T</sup><sub>τ=1</sub> log p(w<sub>i</sub> | w<sub>1:i-1</sub>, {V<sub>τ</sub>}<sup>T</sup><sub>τ=1</sub>), where w<sub>i</sub> are text tokens, V<sub>τ</sub> are visual embeddings, N is the total number of tokens, and T is the number of turns.
Introducing a new dataset, StoryFrames, derived from StoryBench, the researchers provide a valuable resource for evaluating sequential image-text reasoning. Experiments demonstrate IMAGECHAIN's substantial improvement over standard MLLMs on next-scene description, measured by SimRate, a semantic similarity metric. Furthermore, IMAGECHAIN exhibits robust zero-shot out-of-domain performance in diverse applications like comics and robotics. This work highlights the importance of explicit sequential modeling for enhancing temporal reasoning in MLLMs.
A Comprehensive Survey on Composed Image Retrieval by Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie https://arxiv.org/abs/2502.18495
Caption: This diagram outlines the structure of a survey on Composed Image Retrieval (CIR). It categorizes CIR methods into supervised, zero-shot, and related tasks, further detailing key components within each category, such as feature extraction and image-text fusion techniques, and covers benchmark datasets and experimental results. Finally, it addresses future research directions in the field.
Composed Image Retrieval (CIR) allows users to search using a multimodal query comprising a reference image and modification text. This survey provides a comprehensive overview of the rapidly evolving CIR field, categorizing methods into supervised and zero-shot learning paradigms. Supervised methods leverage annotated triplets (<reference image, modification text, target image>), while zero-shot methods utilize readily available data or employ training-free modular combinations.
Supervised CIR models typically involve four components: feature extraction, image-text fusion, target matching, and data augmentation. These components employ various techniques, ranging from traditional encoders to VLP models like CLIP and BLIP, and utilize diverse fusion and matching strategies. Zero-shot CIR (ZS-CIR) methods circumvent the need for annotated triplets, leveraging textual inversion, pseudo-triplet generation, or training-free methods using LLMs and VLPs. Some methods employ spherical linear interpolation (Slerp) for image-text fusion: Slerp(v, t; α) = (sin((1 – α)θ) / sin(0)) * v + (sin(αθ) / sin(0)) * t, where v and t are image and text embeddings, and α is a balancing scalar.
Experimental results demonstrate the effectiveness of VLP-based supervised methods and the potential of zero-shot methods, particularly those based on pseudo-triplets. The survey highlights future research directions including the development of larger datasets, exploring LLM-based fusion, and improving robustness and efficiency.
A Survey on Foundation-Model-Based Industrial Defect Detection by Tianle Yang, Luyao Chang, Jiadong Yan, Juntao Li, Zhi Wang, Ke Zhang https://arxiv.org/abs/2502.19106
Caption: This diagram illustrates the two main approaches to industrial defect detection: Foundation Model (FM) based methods and Non-Foundation Model (NFM) approaches. FMs leverage pre-trained models like SAM, CLIP, and GPT, fine-tuning them on industrial image data to locate and segment defects, while NFMs utilize task-specific model designs trained on preprocessed image data. Both approaches aim to identify anomaly regions and categorize defects within industrial images.
Industrial defect detection is undergoing a transformation with the advent of foundation models (FMs). This survey compares FM-based methods with traditional non-foundation model (NFM) approaches, highlighting the strengths and challenges of each. FMs, pre-trained on massive datasets, bring rich prior knowledge to the task, enabling few-shot and zero-shot learning. NFMs, while less powerful in data-scarce scenarios, offer computational efficiency, making them suitable for resource-constrained environments.
The survey categorizes FM methods based on the underlying model: SAM-based methods for segmentation, CLIP-based methods for image-text matching, and GPT-based methods for natural language understanding and adaptive learning. A comparative analysis reveals that FMs excel in few-shot and zero-shot scenarios, aligning well with practical industrial needs. NFMs, however, currently achieve higher accuracy on some benchmarks, highlighting the ongoing need for performance improvements in FM-based methods.
The survey also identifies key challenges and future directions, including improving accuracy on single-scene datasets, increasing inference speed, and enhancing 3D detection performance. Addressing these challenges will unlock the full potential of FMs for industrial defect detection, paving the way for more robust and efficient quality control processes.
This newsletter has showcased the diverse applications and ongoing challenges in the field of multimodal image and text foundation models. From revolutionizing image matching and scientific poster summarization to exploring cultural understanding and industrial defect detection, these models are transforming how we interact with and extract information from multimodal data. While challenges remain, including data bias, evaluation metrics, and computational efficiency, the rapid pace of innovation suggests a bright future for multimodal AI. The development of specialized datasets like POSTERSUM and BORSH, coupled with innovative frameworks like PromptMID and IMAGECHAIN, demonstrates the continuous push towards more robust, accurate, and culturally aware multimodal models. The focus on zero-shot learning and efficient model architectures further underscores the drive towards practical real-world applications.