This newsletter explores the cutting edge of multimodal image and text foundation models, covering new benchmarks, innovative training methods, and strategies for overcoming limitations like language bias and temporal reasoning. We'll delve into research addressing the challenges of applying these powerful models to specialized domains like medical imaging and ensuring their safe deployment in real-world applications. Get ready for a deep dive into the latest breakthroughs and persistent challenges in this exciting field.
KPL: Training-Free Medical Knowledge Mining of Vision-Language Models by Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, Zuozhu Liu https://arxiv.org/abs/2501.11231
Caption: This diagram illustrates the Knowledge Proxy Learning (KPL) framework for zero-shot image classification. It shows how KPL leverages a knowledge base and CLIP encoders to generate text and multimodal proxies, ultimately producing similarity logits for classification. The process involves encoding input image data, retrieving relevant knowledge, and optimizing proxies to bridge the gap between visual and textual representations.
Visual Language Models (VLMs) like CLIP show promise in image recognition, but their zero-shot classification performance, especially in specialized domains like medical imaging, often falls short. This paper introduces Knowledge Proxy Learning (KPL), a novel training-free method to mine knowledge from CLIP and enhance its zero-shot classification abilities. KPL tackles two key challenges: the inadequacy of representing image classes with single names and the modal gap between CLIP's visual and text encoders. Existing methods, while attempting to enrich class descriptions using Large Language Models (LLMs), often lack specific domain knowledge, resulting in suboptimal performance. Furthermore, current proxy learning methods for zero-shot image classification are unstable when applied to medical datasets.
KPL addresses these issues with a two-step proxy optimization process. First, Text Proxy Optimization retrieves image-relevant knowledge descriptions from a knowledge-enhanced base constructed using LLMs. This enriches the semantic text proxies, moving beyond single class names to more comprehensive descriptions. Second, Multimodal Proxy Learning leverages these enriched descriptions and the input images, encoded via CLIP, to generate multimodal proxies. Critically, instead of relying on the Sinkhorn algorithm, KPL utilizes a Stable GreenkHorn (SG) algorithm for refining pseudo labels, addressing the instability observed in previous methods. The optimization problem for Multimodal Proxy Learning is defined as:
WKPL(D,C) = arg min Ex~D[d(QF,G(x; D, C), PF(x, W))],
where WKPL represents the learned multimodal proxies, D is the image dataset, C are the class names, QF,G generates pseudo labels using CLIP and the data, PF calculates class distributions based on learned proxies, and d is the KL divergence.
Evaluated on five medical image datasets (Shenzhen, IDRID, MalariaCell, Cataract, and Montgomery) and four natural image datasets (CUB, Places365, Oxford Pets, and ImageNet), KPL consistently outperformed all baselines across all datasets and backbones. On the MalariaCell dataset, KPL achieved remarkable results with a ViT-L@336px backbone, reaching 80.86% accuracy—a significant 50.83% improvement over baseline CLIP accuracy. These findings highlight the potential of mining knowledge from CLIP for medical image classification and other specialized domains. The effectiveness of KPL with domain-specific CLIP models like BioMedCLIP further underscores its adaptability and broader applicability.
MSTS: A Multimodal Safety Test Suite for Vision-Language Models by Paul Röttger, et al. https://arxiv.org/abs/2501.10057
Caption: This image displays the 40 fine-grained hazard categories used in the Multimodal Safety Test Suite (MSTS) to evaluate the safety of vision-language models (VLMs). These categories, spanning violent crimes, non-violent crimes, sex-related crimes, suicide & self-harm, and other harmful content, are used to create multimodal prompts combining images and text to assess VLM responses to potentially unsafe scenarios. This structured categorization allows for a nuanced evaluation of VLM safety across diverse hazard types.
As VLMs become increasingly integrated into consumer applications, ensuring their safety is paramount. This paper introduces the Multimodal Safety Test Suite (MSTS), a crucial benchmark designed to evaluate the safety of VLMs, specifically addressing the unique risks posed by multimodal inputs. MSTS comprises 400 prompts across 40 fine-grained hazard categories, meticulously designed to assess VLMs' responses to potentially harmful scenarios. These prompts are crafted with simple and explicit language, reflecting real-world interactions with chat assistants.
Evaluating ten state-of-the-art VLMs, including both open-source and commercial models, the study revealed significant safety issues in several open-source VLMs, with some exhibiting unsafe responses to up to 14% of prompts. Interestingly, some open-source models appeared "safe by accident," providing safe responses only due to their inability to comprehend the prompts. Commercial VLMs, in contrast, demonstrated significantly higher safety levels. The study also highlighted the impact of multimodality, showing that models were generally safer when presented with text-only prompts, underscoring the challenges posed by multimodal inputs.
Further emphasizing the importance of multilingual safety, MSTS was translated into ten languages. Results indicated that open-source models were generally less safe in non-English languages, with unsafe responses increasing significantly. Finally, the study explored the automation of VLM safety assessments. However, even the best safety classifiers exhibited limitations, emphasizing the continued need for human expertise in this critical area. MSTS provides a standardized and comprehensive benchmark for evaluating VLM safety across languages and modalities, highlighting the need for careful design and rigorous testing to mitigate potential risks.
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection by Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou https://arxiv.org/abs/2501.10787
Caption: The LD-DETR model architecture processes video and text input through separate encoders and aligns them using Distill Align. The aligned features are then fused by the Convolutional Fuser and decoded by the Loop Decoder, ultimately producing predictions for moment retrieval (video segments and confidence scores) and highlight detection (relevance scores).
Video moment retrieval and highlight detection aim to locate specific video content based on text queries. Existing models often struggle with overlapping semantic information, inefficient local feature extraction, and inadequate decoding of multimodal features. LD-DETR, a novel Transformer model, addresses these challenges through three key innovations: Distill Align, Convolutional Fuser, and Loop Decoder.
Distill Align mitigates the impact of overlapping semantic information by distilling the similarity matrix into an identity matrix during contrastive learning. The formula is: S<sub>v2t</sub> = αS<sub>v2tm</sub> + (1 - α)I, where S<sub>v2t</sub> is the video-to-text similarity matrix, S<sub>v2tm</sub> is the momentum video-to-text similarity matrix, α is the distillation coefficient, and I is the identity matrix. Convolutional Fuser improves local feature extraction by employing stacked convolutional layers to process multimodal information. Loop Decoder enhances the decoding process by feeding the Transformer Decoder's output back into itself as the query, allowing for more thorough information decoding without overfitting.
Evaluated on four benchmark datasets (QVHighlight, Charades-STA, TACOS, and TVSum), LD-DETR outperformed state-of-the-art methods. Ablation studies further validated the effectiveness of each component, demonstrating significant performance gains. Distill Align improved multimodal alignment, Convolutional Fuser effectively captured local information, and Loop Decoder enhanced the decoding of multimodal information without overfitting. LD-DETR represents a notable advancement in video moment retrieval and highlight detection.
ComposeAnyone: Controllable Layout-to-Human Generation with Decoupled Multimodal Conditions by Shiyue Zhang, et al. https://arxiv.org/abs/2501.12173
Caption: The image illustrates the architecture of ComposeAnyone, a novel multimodal human image generation method. It shows the data flow from input conditions (reference image, hand-drawn layout, and fine-grained text prompt) through encoders (VAE and CLIP), a denoising U-Net, and a decoder to the final generated image. The lower portion details the cross-attention mechanism that aligns features from the layout (L) and the different human components (K) to generate attention maps (A) for each component.
Generating realistic human images is crucial for various applications. ComposeAnyone introduces a novel, controllable layout-to-human generation method that integrates decoupled multimodal conditions, offering enhanced control over the generation process. The key innovation lies in using "hand-drawn layouts," allowing users to define spatial arrangements of human components with simple geometric shapes, providing an intuitive and flexible control mechanism. ComposeAnyone supports flexible multimodal input, allowing users to describe each component with either text or reference images, facilitating non-paired inputs and enabling pixel-level fusion of multiple reference images.
The method employs a data-decoupled pipeline integrating text, reference images, and hand-drawn layouts. By spatially aligning latent features extracted using VAE and CLIP encoders, the model generates human images that adhere closely to the provided multimodal conditions. Attention modulation during inference further enhances spatial coherence and textual consistency. The newly introduced ComposeHuman dataset supports this framework, providing a multimodal dataset containing human images, hand-drawn layouts, textual descriptions, and component assemblies.
Extensive experiments demonstrate ComposeAnyone's superiority over existing methods in both layout-guided text-to-human generation and subject-driven human generation tasks, achieving higher VLM rates, SSIM, FID, and CLIP scores. Ablation studies confirmed the importance of cross-attention modulation and classifier-free guidance for enhancing performance. ComposeAnyone represents a significant advance in human image generation, offering a powerful tool for creating realistic and customizable images, though limitations related to training data accuracy and potential biases from pre-trained models remain.
FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization by Zhaopeng Gu, et al. https://arxiv.org/abs/2501.10067
Caption: FiLo++ enhances zero-/few-shot anomaly detection by combining Fused Fine-Grained Descriptions (FusDes) and Deformable Localization (DefLoc). FusDes uses LLMs and runtime prompt filtering to generate detailed anomaly descriptions, while DefLoc utilizes Grounding DINO and a Multi-scale Deformable Cross-modal Interaction (MDCI) module for precise anomaly localization, integrating positional information and leveraging a memory bank for few-shot learning. This architecture allows FiLo++ to achieve state-of-the-art performance on MVTec-AD and VisA datasets for both image-level and pixel-level anomaly detection.
Zero-/few-shot anomaly detection is crucial for scenarios requiring rapid adaptation. FiLo++ addresses limitations of existing multimodal methods by introducing two key components: Fused Fine-Grained Descriptions (FusDes) and Deformable Localization (DefLoc). FusDes leverages LLMs to generate detailed anomaly descriptions for specific object categories, going beyond generic "normal" vs. "abnormal" labels, and employs runtime prompt filtering for better image-text alignment. DefLoc tackles precise localization using Grounding DINO, positional information integration, and a Multi-scale Deformable Cross-modal Interaction (MDCI) module for enhanced accuracy with various anomaly shapes and sizes. A position-enhanced patch matching approach further refines few-shot performance.
FiLo++ calculates a global anomaly score using the formula:
S<sub>global</sub> = softmax(G· [T<sub>n</sub>, T<sub>a</sub>]<sup>T</sup>) + max(M),
where G is the global image feature, T<sub>n</sub> and T<sub>a</sub> are the filtered normal and abnormal text features, and M is the anomaly map from DefLoc.
Experimental results on MVTec-AD and VisA datasets demonstrate FiLo++'s superior performance in both zero-shot and few-shot settings, achieving state-of-the-art AUC scores for image-level and pixel-level detection. Ablation studies confirm the individual contributions of FusDes, position enhancement, and MDCI, highlighting their effectiveness in improving detection and localization.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding by Yilun Zhao, et al. https://arxiv.org/abs/2501.12380
Caption: This bar chart compares the performance of 32 multimodal foundation models on the MMVU benchmark, measuring their ability to answer expert-level questions about specialized-domain videos. It shows both Chain-of-Thought (dark blue) and Direct Answer (light blue) accuracy, highlighting the general effectiveness of CoT while also revealing significant performance gaps compared to human experts. Models like GPT-4o and Gemini 2.0 Flash demonstrate high performance, but still fall short of human capabilities in this challenging video understanding task.
MMVU (Measuring Expert-Level Multi-Discipline Video Understanding) introduces a challenging benchmark to evaluate foundation models' ability to comprehend and reason about specialized-domain videos. Unlike benchmarks focused on general video comprehension, MMVU requires models to apply domain-specific knowledge and perform expert-level reasoning across disciplines like healthcare, engineering, and scientific research.
Comprising 3,000 expert-annotated question-answer examples based on 1,529 specialized videos, MMVU spans 27 subjects across four core disciplines. The rigorous annotation process, guided by authoritative textbooks, ensures the benchmark's quality and complexity. Evaluation of 32 frontier multimodal foundation models revealed a significant gap between model performance and human expertise. While models like o1 showed strong performance, they still fell short of human capabilities. Chain-of-Thought (CoT) reasoning generally improved performance, but its impact varied across models. System-2 capable models, such as o1 and Gemini 2.0 Flash Thinking, showcased significant advantages, demonstrating the potential of increased test-time compute and long CoT.
Qualitative analysis of model errors revealed limitations in visual perception, domain knowledge application, over-reliance on textual information, and logical reasoning. These findings underscore the need for future research to focus on improving multimodal reasoning by effectively integrating visual information and domain-specific knowledge.
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! by Mohamed Fazli Imam, Chenyang Lyu, Alham Fikri Aji https://arxiv.org/abs/2501.10674
This paper introduces the TemporalVQA benchmark, designed to evaluate MLLMs' abilities in temporal understanding, a crucial aspect of real-world comprehension. The benchmark focuses on Temporal Order Understanding (determining the sequence of events in video frames) and Time-lapse Estimation (estimating the time difference between images).
Evaluations of advanced MLLMs, including GPT-4o and Gemini-1.5-Pro, revealed significant challenges. GPT-4o achieved only 43.8% average consistent accuracy in temporal order tasks and 70% in time-lapse estimation. Open-source models performed even less effectively. These results highlight limitations in visual temporal understanding and reasoning, emphasizing the need for improvement. The study's limitations include the dataset size and the focus on only two types of temporal reasoning tasks. Future work could expand the dataset and incorporate more diverse temporal reasoning challenges.
MASS: Overcoming Language Bias in Image-Text Matching by Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu https://arxiv.org/abs/2501.11469
Caption: This image illustrates the core concept of the Multimodal Association Score (MASS) framework for debiasing image-text matching. It shows two images, each paired with text descriptions, and the process of calculating PMI by comparing the image-conditioned text likelihood with the text-only likelihood. This comparison helps to highlight the true association between image and text, reducing the influence of language priors.
While pretrained visual-language models excel in multimodal tasks like image-text retrieval, language bias remains a significant challenge. Multimodal ASsociation Score (MASS) offers a solution by reducing reliance on language priors without requiring additional training. MASS leverages image-conditional and text-only likelihoods from pretrained models, calculating the pointwise mutual information (PMI) between image and text tokens to create a debiased similarity score. The formula for MASS is:
SMASS (c, x) ≈ (1/l) * Σt<l log [pθ(xt|x<t, c) / pθ(xt|x<t, c∅)]
where c is the image, x is the text, l is the text length, pθ represents the model's probability distribution, and c∅ is a null image.
Evaluated on benchmarks designed to expose language bias, MASS consistently outperformed CLIP and raw token likelihood, significantly reducing bias while maintaining strong performance on linguistic compositionality benchmarks. MASS offers a promising solution for enhancing image-text matching accuracy and fairness, though other forms of bias, like visual bias, warrant further investigation.
This newsletter has highlighted the rapid advancements and persistent challenges in the field of multimodal image and text foundation models. From novel training-free approaches like KPL for medical imaging to the development of crucial safety benchmarks like MSTS, the field is actively addressing the complexities of deploying these powerful models responsibly. The introduction of LD-DETR for enhanced video understanding, ComposeAnyone for controllable human image generation, and FiLo++ for anomaly detection showcases the innovative approaches being pursued. However, the limitations revealed by benchmarks like MMVU and TemporalVQA underscore the need for continued research into robust temporal reasoning, domain-specific knowledge integration, and debiasing strategies like MASS. The ongoing development in this field promises exciting future breakthroughs.