Hello Elman,
This newsletter dives into the latest advancements and challenges in the world of multimodal image and text foundation models. We'll explore new architectures, training paradigms, and benchmarks designed to enhance the capabilities of these powerful models, while also addressing critical issues like negation handling, reasoning quality, and the fight against misinformation. Prepare for a deep dive into the cutting edge of multimodal AI research.
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models by Mayank Vatsa, Aparna Bharati, Surbhi Mittal, Richa Singh https://arxiv.org/abs/2502.09645
Caption: This image presents a taxonomy of negation constructs, categorized into Syntactic, Morphological, Lexical/Semantic, and Non-verbal types, to illustrate the diverse ways negation is expressed. Each category is further exemplified with concrete examples across different modalities (image, text), highlighting the challenges these variations pose for multimodal foundation models. This framework aims to guide future research in developing more robust AI systems capable of accurately interpreting negation.
Multimodal foundation models, despite their prowess in tasks like translation and image captioning, consistently stumble over the concept of negation. These models, trained by associating concepts across modalities like text, images, and audio, often misinterpret negation, particularly in multilingual and cross-cultural settings. This weakness has significant implications for real-world applications, potentially leading to errors in medical image analysis, legal document processing, and even contributing to the spread of misinformation in chatbot responses. The diverse and nuanced ways negation is expressed across different languages further exacerbates this challenge.
This paper proposes a comprehensive taxonomy of negation constructs to better understand the challenges they pose to foundation models. This taxonomy categorizes negations into four main types: Syntactic (e.g., double negation, negative concord), Morphological (e.g., affixal negation, negative prefixes), Lexical and Semantic (e.g., negative pronouns, negative polarity items), and Prosodic, Paralinguistic, and Pragmatic (e.g., non-verbal cues, tonal shifts). This framework highlights the wide range of linguistic and cultural variations in expressing negation, emphasizing the need for more nuanced approaches in model training and evaluation. Furthermore, it identifies key research challenges, questioning how language-specific structures, cultural contexts, and different types of negation (double, nested, idiomatic) affect model performance. It also explores the potential of leveraging multimodal inputs and transfer learning to improve negation handling in under-resourced languages.
Crucially, the authors advocate for developing specialized benchmarks to rigorously evaluate a model's ability to handle negation across various modalities and languages, arguing that current approaches are insufficient to address the complexities of negation. These benchmarks should include tasks that reflect real-world applications, allowing for a practical assessment of model performance. They also suggest incorporating language-specific tokenization, fine-grained attention mechanisms, and multimodal hybrid architectures to enhance a model's ability to capture and interpret negation accurately. Improving negation understanding is not only crucial for enhanced model performance but also essential for building more reliable and trustworthy AI systems.
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding by Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu https://arxiv.org/abs/2502.09906
Caption: The image depicts the architecture of Insect-LLaVA, a multimodal conversational model for insect understanding. An input image of an insect is processed by a vision encoder, and the resulting visual tokens are projected into the same embedding space as the language tokens from the instruction. These combined tokens are then fed into a large language model to generate a response, such as identifying the insect's genus.
While multimodal conversational generative AI excels in various vision and language tasks, current models often lack specific domain knowledge, such as understanding insects. This gap is particularly significant for precision agriculture, where insect identification is crucial for sustainable development. This paper introduces Insect-LLaVA, a novel multimodal conversational model designed to address this gap and promote visual understanding in the insect domain. Central to this work is the introduction of a new large-scale dataset, Multimodal Insect, which includes Visual Insect Instruction Data. This dataset significantly expands upon the authors' previous Insect-1M dataset, providing a wealth of information for training foundation models.
The Multimodal Insect Dataset contains one million densely labeled insect images, spanning the entire taxonomic hierarchy from Class and Order to Genus and Species, each paired with a detailed textual description. The dataset also includes visual instruction data tailored for training conversational models. This makes it significantly larger and more diverse than previous insect datasets, offering a richer training resource. The Insect-LLaVA model leverages the architecture of LLaVA, incorporating a pre-trained vision encoder, a multi-layer perception connector (projection), and a large language model (Vicuna). The vision encoder extracts meaningful features from insect images. To overcome limitations of existing vision encoders trained on general image datasets, the authors introduce a novel Insect Foundation Model. This model employs self-supervised contrastive learning with a new Patch-wise Relevant Attention mechanism, focusing on learning micro-features that distinguish insect species, addressing the challenge of subtle visual differences. A Description Consistency loss further enhances learning from the detailed textual descriptions.
The training objective for the conversational generative model is formulated as an auto-regressive task, maximizing the log-likelihood of generating the correct answer given the image and instruction history: θ* = arg max<sub>θ</sub> E<sub>Xa, I, Xinstruct</sub> log p(Xa|I, Xinstruct) = arg max<sub>θ</sub> E<sub>Xa, I, Xinstruct</sub> Σ<sup>L</sup><sub>i=1</sub> log p<sub>θ</sub> (X<sub>i</sub>|I, Xinstruct<sub><i</sub>, Xa<sub><i</sub>), where Xa is the answer, I is the image, Xinstruct is the instruction, θ are the model parameters, and L is the sequence length. The training process involves two stages: pre-training to align visual and language features, and fine-tuning on the insect instruction data.
Insect-LLaVA was evaluated on new Visual Insect Question Answering (Insect-VQA) benchmarks, achieving state-of-the-art performance and significantly outperforming existing models like LLaVA. The Insect Foundation Model also demonstrated superior performance on insect classification and detection tasks. This work represents a significant advance in applying AI to precision agriculture, enabling more nuanced and accurate visual insect understanding for more sustainable agricultural practices.
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation by Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari https://arxiv.org/abs/2502.08826
Caption: This diagram illustrates the architecture of a Multimodal Retrieval-Augmented Generation (RAG) system. It depicts the flow from an input query, through multimodal encoding and retrieval strategies (including MIPS variants), fusion mechanisms, and finally to the generation of a response by a multimodal LLM, evaluated against ground truth. The system leverages cross-attention and various training strategies to align and integrate information from different modalities, aiming to enhance LLM capabilities with external knowledge.
Large Language Models (LLMs), while impressive, are limited by their static knowledge, leading to hallucinations and outdated information. Retrieval-Augmented Generation (RAG) addresses these limitations by incorporating external knowledge. This survey explores Multimodal RAG, which expands this concept by integrating diverse modalities like text, images, audio, and video, enabling richer context and improved reasoning for MLLMs. While traditional RAG focuses on text, Multimodal RAG grapples with the complexities of cross-modal alignment and reasoning, presenting both opportunities and unique challenges.
The survey provides a thorough analysis of Multimodal RAG, encompassing datasets, benchmarks, evaluation metrics, and key innovations. It highlights relevant datasets like MS-COCO, Flickr30K, and LAION-400M for image-text tasks, and benchmarks like M²RAG and VisDoMBench for evaluating visual reasoning and dynamic retrieval. It also discusses the various evaluation metrics used, including CIDEr, SPICE, BERTScore, and CLIP Score, emphasizing the need for combined metrics from vision-language models, generative AI, and retrieval systems to comprehensively capture Multimodal RAG performance.
Key innovations in retrieval strategies are examined, including efficient search methods like Maximum Inner Product Search (MIPS) and modality-centric retrieval approaches. The survey also delves into fusion mechanisms, such as score fusion, attention-based methods, and unified frameworks, which aim to align and integrate information from different modalities. Augmentation techniques like context enrichment and adaptive retrieval are discussed, highlighting their role in refining retrieved data. The survey also examines generation techniques, including in-context learning, reasoning strategies like Chain-of-Thought, and instruction tuning, which enhance the model's ability to generate coherent and contextually relevant outputs.
Furthermore, the survey addresses training strategies, focusing on alignment through contrastive learning with losses like InfoNCE: $L_{InfoNCE} = -log\frac{exp(sim(z_i, z_j)/T)}{\sum_{k=1}^{K} exp(sim(z_i, z_k)/T)}$, and robustness enhancements to mitigate noise and modality biases. It explores the diverse applications of Multimodal RAG across various domains, highlighting examples like MMED-RAG and RULE in healthcare. Finally, the survey identifies open challenges and future directions, including improving generalization and explainability, enhancing reasoning and retrieval performance, addressing retrieval biases, and developing more robust evaluation methods. The integration of knowledge graphs and the development of unified embedding spaces are also highlighted as promising future research avenues.
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency by Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li https://arxiv.org/abs/2502.09621
Caption: This infographic details the MME-CoT benchmark for evaluating Chain-of-Thought (CoT) reasoning in Large Multimodal Models (LMMs). It showcases example questions across six domains (math, science, OCR, logic, spatial-temporal, and general scenes) and illustrates the three key evaluation aspects: CoT Quality, Robustness, and Efficiency. The infographic also visually represents the scoring of essential steps within the benchmark and highlights the importance of analyzing the CoT process itself, rather than just the final answer.
While Large Multimodal Models (LMMs) excel in various visual tasks, their reasoning capabilities, particularly with Chain-of-Thought (CoT) prompting, remain under-scrutinized. This paper introduces MME-CoT, a comprehensive benchmark designed to evaluate CoT reasoning in LMMs across six domains: math, science, OCR, logic, space-time, and general scenes. Unlike outcome-focused benchmarks, MME-CoT analyzes the fine-grained CoT process, assessing the quality, robustness, and efficiency of the reasoning steps through a novel evaluation suite incorporating three key aspects: CoT Quality (Recall and Precision), CoT Robustness (Stability and Efficacy), and CoT Efficiency (Relevance Rate and Reflection Quality).
The MME-CoT dataset consists of 1,130 questions with carefully curated key step annotations and reference image captions. CoT Quality is assessed using Recall (proportion of ground-truth solution steps present) and Precision (accuracy of generated steps). CoT Robustness is evaluated by comparing performance on perception and reasoning tasks using both direct and CoT prompts. CoT Efficiency is measured by Relevance Rate (proportion of relevant content in the reasoning) and Reflection Quality (validity and contribution of reflection steps).
Evaluation results reveal that models with reflection mechanisms, such as Kimi k1.5, exhibit superior CoT quality. Surprisingly, CoT prompting often degrades performance on perception-heavy tasks, suggesting harmful "overthinking." While reflection enhances quality, a significant portion of reflection steps are ineffective, hindering efficiency. A deeper analysis reveals four dominant error types: Ineffective Reflection, Incompleteness, Repetition, and Interference, with Ineffective Reflection being the most prevalent. These findings highlight the need for more focused and efficient reflection mechanisms in LMMs and provide valuable insights for future research.
From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLMs by Mingxiao Li, Fang Qu, Zhanpeng Chen, Na Su, Zhizhou Zhong, Ziyang Chen, Nan Du, Xiaolong Li https://arxiv.org/abs/2502.09093
Caption: The diagram illustrates the Vision Dynamic Embedding-Guided Pre-training (VDEP) architecture, showcasing the flow of image and text data through the model. It highlights the key components, including the vision encoder, MLP, language model, and the dual loss functions (Image L2 Loss and Text Cross entropy Loss) used for multimodal alignment. The dynamic embeddings, generated by the MLP after the vision encoder, supervise the image hidden states within the language model, facilitating a more balanced and effective learning process.
MLLMs, while proficient in perceptual tasks, often lack precise multimodal alignment. This paper introduces Vision Dynamic Embedding-Guided Pre-training (VDEP), a hybrid autoregressive training paradigm designed to enhance alignment without architectural changes. VDEP reinterprets multimodal alignment as an information recovery process, focusing on reconstructing detailed visual features. It leverages dynamic embeddings from the MLP following the visual encoder to supervise image hidden states and integrates image tokens into autoregressive training, balancing the attention distribution between image and text tokens. The core idea is to maximize mutual information I(Xᵢ; X̂) between image representations (Xᵢ) and predicted text representations (X̂), formulated as I(Xᵢ; X̂) = H(Xᵢ) – H(Xᵢ|X̂). By minimizing the L2 loss between image embeddings and the LLM-generated hidden vector, Lᵢ = ||Xᵢ - X̂||₂, VDEP effectively reconstructs semantic information from the image.
The training process combines VDEP mode with the original LLava mode in a hybrid strategy. Experiments across 13 benchmarks demonstrate state-of-the-art performance, with significant improvements on datasets like RealWorldQA, VizWizQA, and OK-VQA. These results highlight VDEP's effectiveness in handling noisy images and knowledge-based questions. While promising, VDEP relies on a hyperparameter α for image loss weight, the optimal value of which needs automatic determination. Future work will address this limitation and further refine the training strategy.
Large Language Models and Provenance Metadata for Determining the Relevance of Images and Videos in News Stories by Tomas Peterka, Matyas Bohacek https://arxiv.org/abs/2502.09689
Caption: This prototype web interface allows users to evaluate the relevance and credibility of images and videos in news articles using LLMs and provenance metadata. Users can input full articles or URLs, along with image/video paths, captions, and files, to receive an LLM-driven assessment of media relevance. This tool represents a novel approach to combating misinformation by analyzing the interplay between text, visuals, and provenance data.
Misinformation campaigns often utilize out-of-context or fabricated images and videos. This paper proposes a system leveraging Large Language Models (LLMs) and provenance metadata to determine the relevance of media within news stories. This approach analyzes article text, visual captions, and provenance metadata (details about origin, creation, and modifications) to assess authenticity and context. The LLM uses this information to determine if the media's origin and edits are relevant to the story, providing an overall assessment and reasoning, and answering follow-up questions. A prototype web interface, using the C2PA standard for provenance and the Phi-3 LLM, has been open-sourced.
The interface allows users to input article data and receive a relevance assessment, categorized by location, source, and tampering. A chat interface enables follow-up questions. While promising, limitations exist, including the stochastic nature of LLMs, nascent provenance metadata adoption, and potential LLM biases. The lack of datasets with provenance metadata for news articles hinders benchmark evaluations, highlighting the need for future research. Future work should also focus on mitigating LLM biases and addressing edge cases. Despite these limitations, this research offers a promising new approach to combating misinformation by combining LLMs and provenance metadata.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model by Guoqing Ma et al. https://arxiv.org/abs/2502.10248
Step-Video-T2V is a state-of-the-art text-to-video model with 30 billion parameters, capable of generating videos up to 204 frames long. Its core innovation is the Video-VAE, a deep compression Variational Autoencoder achieving high compression ratios while maintaining reconstruction quality. The model utilizes two bilingual text encoders, Hunyuan-CLIP and Step-LLM, for English and Chinese prompts. A Diffusion Transformer (DiT) with 3D full attention, trained with Flow Matching, denoises input into latent frames. Video-DPO (Direct Preference Optimization) enhances visual quality. The training process involves cascaded stages: text-to-image pre-training, text-to-video/image pre-training at increasing resolutions, supervised fine-tuning, and Video-DPO incorporating human feedback. The DiT training objective minimizes the difference between predicted and true velocity: loss = E<sub>t,X₀,X₁,y</sub> [||u(X<sub>t</sub>, y, t; θ) – V<sub>t</sub>||²].
Evaluations on the Step-Video-T2V-Eval benchmark demonstrate state-of-the-art performance, surpassing open-source and commercial engines. Despite advancements, challenges remain, including generating complex actions, adhering to physics, composing multiple concepts, and efficiently generating long, high-resolution videos. Future work will explore new paradigms, improve 3D attention efficiency, and enhance human feedback incorporation.
This newsletter has showcased a range of exciting developments in multimodal image and text foundation models. From specialized models for insect identification to novel training paradigms for enhanced alignment and sophisticated benchmarks for evaluating reasoning, the field is rapidly evolving. However, significant challenges remain, particularly in handling negation, improving reasoning efficiency, and combating misinformation. The ongoing development of new datasets, benchmarks, and architectures, as highlighted in this newsletter, paves the way for more robust, reliable, and impactful multimodal AI systems in the future.