This newsletter dives into the cutting edge of multimodal image and text foundation models, exploring the latest research on enhancing their capabilities for tasks like visual question answering, style transfer, sarcasm detection, and medical image analysis. We'll unpack novel architectures, training methodologies, and benchmark datasets that are pushing the boundaries of what's possible in this rapidly evolving field.
Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant by Abhirama Subramanyam Penamakuri, Anand Mishra https://arxiv.org/abs/2410.19144
Caption: The diagram illustrates the KaLMA framework for knowledge-aware visual question answering. It shows how VisTEL links visual text from an input image (e.g., "Domino's") to a knowledge base, which is then integrated with the question and image features into an instruction prompt for the Large Language Model (LLaMA). This allows KaLMA to generate more accurate and fact-grounded answers compared to standard LMMs.
Researchers are tackling the complex challenge of knowledge-aware text-based visual question answering (Text-KVQA). This task requires systems not only to interpret text within images but also to leverage external knowledge to answer questions accurately. While Large Multimodal Models (LMMs) offer powerful capabilities, they often suffer from hallucinations, especially in Text-KVQA where precise reasoning about entities and their associated knowledge is crucial. This paper introduces KaLMA (Knowledge-aware Large Multimodal Assistant), a novel framework designed to significantly improve Text-KVQA performance by addressing these limitations.
At the heart of KaLMA lies VisTEL (Visual Text Entity Linker), a specialized module that accurately links visual text entities in images to a knowledge base. Unlike traditional methods relying on simple text similarity, VisTEL leverages both visual and textual context within an LMM framework. It begins by extracting text from the image using a visual text recognition engine and then retrieves candidate entities based on edit distance. Using an instruction prompt containing the OCRed text, candidate entities, and image features, VisTEL employs an LMM to predict the most appropriate entity. This approach overcomes challenges like noisy OCR, abbreviations, and homonyms, thereby enhancing entity linking accuracy.
KaLMA seamlessly integrates VisTEL with an LMM (specifically LLaVA) to provide knowledge-aware question answering. After VisTEL links the visual text to an entity, KaLMA retrieves relevant knowledge from the knowledge base. An instruction prompt is then formulated, incorporating the question, retrieved knowledge, and image features. This prompt is fed to the LMM, which generates the answer and a supporting fact, promoting chain-of-thought reasoning and attribution. The training loss function is a generative language modeling loss: L<sub>ans_gen</sub>(θ) = - Σ<sub>t=1</sub><sup>|A|</sup> log p(A<sub>at</sub>|X<sub>I</sub>, X<sub>TQ:K1</sub>, A<sub><t</sub>; θ), where A represents the answer and supporting fact, X<sub>I</sub> denotes the image features, X<sub>TQ:K1</sub> represents the text features of the question and knowledge, and θ are the trainable parameters.
Evaluated on the Text-KVQA dataset across scene, book, and movie splits, KaLMA achieves state-of-the-art results, outperforming the previous best approach by a substantial margin. Specifically, KaLMA improves accuracy by 18.2% on scene, 19.6% on book covers, and 32.2% on movie posters. Ablation studies confirm the crucial role of both VisTEL and explicit knowledge integration in mitigating the hallucination issues of LMMs in Text-KVQA. While computationally more demanding than traditional methods, the significant performance gains highlight the potential of this approach.
Towards Visual Text Design Transfer Across Languages by Yejin Choi, Jiwan Chung, Sumin Shim, Giyeong Oh, Youngjae Yu https://arxiv.org/abs/2410.18823
Caption: The figure illustrates the architecture of SIGIL, a framework for multilingual visual text design transfer. (a) The Generator uses a diffusion model guided by glyph and style latents to produce stylized text images. (b) The Corrector employs an OCR model and reinforcement learning (PPO) to refine the generated images, ensuring legibility and style integrity.
Designing visually appealing text, especially across different languages and writing systems, is a challenging task for AI. While existing text-to-image models can generate visual text, they often struggle to capture the nuances of design and style. This paper introduces MuST-Bench (Multimodal Style Translation Benchmark), a novel dataset designed to evaluate the effectiveness of visual text translation across various languages. MuST-Bench focuses on description-free, few-shot style transfer, pushing models to go beyond simply replicating fonts and instead capture the essence of visual design. The benchmark includes artistic typography from film posters, translated into five languages with diverse writing systems: Chinese, Korean, Thai, Russian, and Arabic. It also includes human-annotated character-level bounding boxes for precise evaluation.
Current visual text generation models struggle with MuST-Bench due to the limitations of textual descriptions in conveying complex visual styles. To address this, the authors propose SIGIL (Style Integrity and Glyph Incentive Learning), a framework for multimodal style translation. SIGIL incorporates three key innovations: a glyph latent for multilingual settings, pretrained VAEs for stable style guidance, and an OCR model with reinforcement learning feedback for optimizing readable character generation. The glyph latent allows comparison of glyphs across languages within a shared latent space, improving style fidelity. The loss function combines a diffusion loss (Ldiff = ||ε - ε̂||²) with a glyph loss (Lglyph = ||ε̂ - zg||²) in the latent space, where ε is the Gaussian noise, ε̂ is the estimated noise, and zg is the latent vector of the glyph. A dynamic coefficient balances these losses during training. An OCR model provides rewards during training to improve the readability of the generated text.
Evaluated against models like DALL-E 3 and AnyText on MuST-Bench using OCR accuracy, CLIP-I similarity, and MLLM evaluations, SIGIL significantly outperforms others. Achieving OCR accuracies of 0.7163, 0.7481, and 0.6577 for English, Chinese, and Korean respectively, SIGIL demonstrates superior legibility. While AnyText achieves slightly higher CLIP-I scores in some cases, this is attributed to its inclusion of background elements in the similarity calculation. MLLM evaluations and user studies further confirm SIGIL's superior style fidelity, achieving scores of 3.89, 3.50, and 4.00 on GPT-4V for English, Chinese, and Korean, respectively.
A Survey of Multimodal Sarcasm Detection by Shafkat Farabi, Tharindu Ranasinghe, Diptesh Kanojia, Yu Kong, Marcos Zampieri https://arxiv.org/abs/2410.18882
Caption: This bar graph illustrates the growing number of publications focused on sarcasm detection, with a notable increase in publications addressing multimodal sarcasm detection, particularly from 2020 onwards. This rise coincides with the increasing prevalence of multimodal communication and the development of more sophisticated models capable of processing combined visual and textual information, as highlighted in the accompanying overview.
Sarcasm detection is a complex NLP task where the intended meaning often contradicts the literal words used. This survey provides a comprehensive overview of multimodal sarcasm detection (MSD), encompassing datasets, models, and future directions. The survey categorizes MSD into two main types: visuo-textual and audio-visual & textual.
Visuo-textual sarcasm detection involves identifying sarcasm in text paired with images, where the incongruity between modalities is key. Datasets like MMSD, MMSD 2.0, Silver-Standard, and Gold-Standard Datasets are valuable resources. Approaches range from traditional deep learning models using separate encoders for image and text to multimodal transformers like VisualBERT, LXMERT, and ViLBERT, and more recently, LLM-based approaches using prompt engineering.
Audio-visual & textual sarcasm detection focuses on identifying sarcasm in dialogues, where facial expressions, tone, and gestures are crucial. Datasets like MUSTARD, SE-MUSTARD, and MUSTARD++ provide annotated video clips with contextual information. Methods include traditional deep learning approaches fusing multimodal features, approaches using multimodal attention mechanisms, and multi-task learning frameworks. The use of gaze features and emojis has also been explored.
The survey highlights the increasing use of multimodal transformers for MSD, due to their ability to capture intra and inter-modal dependencies. The use of LLMs with prompt engineering also shows promise. The survey also identifies key challenges, including the need for multilingual datasets, incorporating perspectivism in annotations, and exploring inter-task dependencies with related phenomena like humor and offensive language.
Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks by Lehan Wang, Haonan Wang, Honglong Yang, Jiaji Mao, Zehong Yang, Jun Shen, Xiaomeng Li https://arxiv.org/abs/2410.18387
Caption: This image illustrates the architecture of MedRegA, a region-aware medical Multimodal Large Language Model (MLLM). It highlights the three Region-Centric tasks: Region-to-Text Identification, Text-to-Region Detection, and Grounded Report Generation, each with an example medical image and instruction. The diagram also showcases the model's tokenization process, including image, text, coordinate, <ref>
, and <box>
tokens, which enable region-specific analysis.
Existing medical MLLMs often operate in a region-agnostic manner, treating medical images holistically. This contrasts with clinical practice where doctors focus on specific regions for detailed analysis. This limitation can hinder accuracy and interpretability. This paper introduces MedRegA, a region-aware medical MLLM designed to handle both image-level and region-level vision-language tasks across various medical modalities.
Key to MedRegA is the introduction of Region-Centric tasks and the creation of the MedRegInstruct dataset. This dataset includes Region-to-Text Identification, Text-to-Region Detection, and Grounded Report Generation. MedRegInstruct combines approximately 25,000 Chinese scan-report pairs with other multimodal medical corpora, enabling bilingual capabilities. A Region-Aligned evaluation framework was also introduced.
MedRegA's training involves alignment training and instruction tuning. A novel Regional Chain-of-Thought (CoT) strategy is used during inference, requiring the model to first detect critical regions and then generate text based on these regions. The results demonstrate MedRegA's superior performance across various tasks, including visual question answering, report generation, and medical image classification. It also shows strong performance on the Region-Centric tasks, highlighting the importance of regional information for improved performance and interpretability in medical MLLMs.
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data by Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Yulong Ao, Yaoqi Liu, Fangxiang Feng, Guang Liu https://arxiv.org/abs/2410.18558
Caption: This graph showcases the performance of Aquila-VL-2B on an unspecified benchmark as it's trained on progressively larger portions of the Infinity-MM dataset, demonstrating performance gains with increasing data size. The horizontal lines represent the performance of InternVL2-2B and Qwen2VL-2B, while the vertical dashed lines demarcate the four training stages with increasing data size and complexity.
Open-source VLMs have lagged behind closed-source models due to limitations in training data. This paper introduces Infinity-MM, a massive multimodal instruction dataset with 40 million samples, curated and augmented with synthetic data. A key innovation is the synthetic data generation method using existing open-source VLMs, leveraging detailed image annotations and diverse question generation.
Using Infinity-MM, the authors trained Aquila-VL-2B, a 2-billion-parameter VLM. A curriculum learning approach and custom data loader optimized training. Aquila-VL-2B achieves state-of-the-art performance for open-source models of similar scale across various benchmarks, including MMBench-1.1, MMStar, HallucinationBench, MMVet, MMMU, MathVista, and AI2D. The results highlight the effectiveness of scaling instruction data and synthetic data generation for improving open-source VLM performance.
This newsletter showcases a clear trend towards enhancing multimodal models with richer contextual information, whether it be external knowledge, visual design elements, or regional details within images. The development of specialized modules like VisTEL for entity linking, innovative training frameworks like SIGIL for style transfer, and the creation of large-scale datasets like Infinity-MM and MedRegInstruct are all contributing to more robust and interpretable multimodal AI systems. The exploration of regional awareness in medical MLLMs like MedRegA and the increasing use of chain-of-thought reasoning are particularly promising directions, paving the way for more nuanced and reliable applications of multimodal AI in critical domains. The continued focus on addressing challenges like hallucination and improving the legibility of generated text further underscores the commitment to developing truly practical and impactful multimodal systems.