This newsletter dives into the cutting edge of multimodal AI, exploring the latest developments in image and text foundation models. We’ll examine novel architectures, robust benchmarks, and the persistent challenges of distribution shifts and data incompleteness. From enhancing clinical decision support to deciphering the inner workings of complex models, this week's papers offer a rich tapestry of insights for the multimodal expert.
Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts by Fredrik K. Gustafsson, Mattias Rantalainen https://arxiv.org/abs/2410.06723
Caption: This image depicts three different weakly supervised Whole Slide Image (WSI) classification pipelines for prostate cancer grading. Each pipeline uses pre-trained foundation models (or a baseline ResNet) to extract patch-level features, which are then aggregated into a WSI-level feature vector using either attention-based multiple instance learning (ABMIL), simple averaging (Mean Feature), or k-nearest neighbors (kNN) before final grade prediction. These pipelines were used to evaluate the robustness of foundation models to distribution shifts in WSI data and label distributions.
Foundation models have revolutionized computational pathology, promising versatility as general-purpose feature extractors. However, their robustness against real-world data variability remains a critical question. This study evaluates two leading pathology foundation models, UNI and CONCH, in prostate cancer grading, specifically examining their resilience against distribution shifts.
The researchers employed a weakly supervised approach, using the foundation models as frozen patch-level feature extractors. Three different ISUP grade classification models—ABMIL (attention-based multiple instance learning), Mean Feature (simplified ABMIL), and kNN (k-nearest neighbors)—were used to assess performance. The PANDA dataset of prostate biopsy WSIs served as the testing ground, with Resnet-IN (trained on natural images) acting as a baseline.
Two types of distribution shifts were examined: variations in WSI image data (due to differences in staining and scanning procedures) and shifts in ISUP grade group label distributions. While UNI and CONCH outperformed the Resnet-IN baseline, their performance was far from perfect, particularly when confronted with distribution shifts. UNI's performance plummeted when trained on data from one site (Radboud) and tested on another (Karolinska), achieving a kappa score of only 0.247 ± 0.138 compared to 0.888 ± 0.013 when trained and tested on the full PANDA dataset. CONCH proved even more sensitive, often underperforming even the Resnet-IN baseline. Interestingly, shifts in the label distribution had a less significant impact.
This study underscores a crucial point: simply training foundation models on massive datasets doesn't guarantee robustness. The quality and diversity of the data used to train downstream models remain paramount. While pathology-specific models like UNI hold promise, their practical application necessitates careful consideration of distribution shifts and the inclusion of diverse data during downstream training. The superior performance of UNI over CONCH in this specific task, despite CONCH's strong general performance, highlights the importance of model selection based on the specific downstream task.
Utility of Multimodal Large Language Models in Analyzing Chest X-ray with Incomplete Contextual Information by Choonghan Kim, Seonhee Cho, Joo Heung Yoon https://arxiv.org/abs/2410.07111
The increasing use of LLMs in clinical settings faces a significant challenge: handling incomplete data, a frequent issue with radiology reports. This study investigated whether multimodal LLMs, incorporating both text and images, could enhance the accuracy and reliability of chest x-ray analysis, especially with incomplete reports.
Three LLMs – OpenFlamingo, MedFlamingo (clinically fine-tuned), and IDEFICS – were evaluated on 300 image-report pairs from the MIMIC-CXR database, in both text-only and multimodal formats. Simulated incompleteness was introduced by randomly deleting words or phrases from reports at 20%, 50%, and 80% rates. Performance was measured using ROUGE-L, F1RadGraph, and F1CheXbert, with statistical significance assessed via the Wilcoxon test.
As expected, text-only model performance declined with increasing data corruption. OpenFlamingo performed best with complete text (ROUGE-L: 0.39, F1RadGraph: 0.34, F1CheXbert: 0.53). However, the addition of images dramatically boosted MedFlamingo and IDEFICS performance, often matching or surpassing OpenFlamingo, even with incomplete data. For example, with 50% data corruption, MedFlamingo's ROUGE-L score jumped from 0.22 to 0.32, F1RadGraph from 0.12 to 0.29, and F1CheXbert from 0.29 to 0.46. Similar improvements were observed with IDEFICS.
This research highlights the limitations of text-only LLMs with incomplete data, a common real-world clinical scenario. Multimodal LLMs offer a promising solution, significantly enhancing robustness and reliability even with substantial data corruption. This approach could significantly improve clinical decision support and patient care. The study also indicates the potential cost-effectiveness of multimodal models like IDEFICS compared to their unimodal counterparts. While focused on chest x-rays, these findings could extend to other imaging modalities and clinical text analysis tasks where data incompleteness is a concern.
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling by Xudong Xie, Liang Yin, Hao Yan, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, Xiang Bai https://arxiv.org/abs/2410.05970
Caption: This image illustrates three different approaches to processing PDF documents for question answering. (a) shows a plain text solution, (b) a purely visual solution, and (c) the proposed method, which uses a sparse sampler to select relevant text and image segments for input to the LLM. This approach allows for efficient processing of long, image-rich PDFs.
Existing LLMs often struggle with long, image-rich PDFs like academic papers, relying on inefficient methods that focus solely on text or treat pages as individual images. PDF-WuKong, a new multimodal large language model (MLLM), tackles this challenge with an innovative end-to-end sparse sampling technique.
The key insight is that user queries typically relate to only a small portion of a document. PDF-WuKong's sparse sampler, operating on both text and image representations, efficiently identifies and extracts the most relevant paragraphs or diagrams, streamlining input to the language model and significantly boosting efficiency. Integrated with the MLLM's image encoder, this sampler enables seamless end-to-end training and inference, optimizing performance and providing valuable interpretability for question answering.
The researchers created PaperPDF, a new dataset of academic papers sourced from arXiv, containing 1 million automatically generated question-answer pairs with corresponding evidence sources. This rich dataset allows for robust training and evaluation, pushing the boundaries of multimodal PDF understanding.
Experimental results on PaperPDF showcase PDF-WuKong's superiority. It surpasses existing open-source models and even outperforms proprietary document understanding products by an average of 8.6% on F1. The model maintains impressive accuracy and efficiency even with increasing document length, demonstrating the effectiveness of sparse sampling. Its competitive performance on established document-oriented VQA datasets further confirms its versatility.
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond by Soyeon Caren Han, Feiqi Cao, Josiah Poon, Roberto Navigli https://arxiv.org/abs/2410.05608
A half-day tutorial at ACM Multimedia 2024 will provide a comprehensive overview of the rapidly evolving field of multimodal pretrained and large language models (MLLMs). Going beyond previous tutorials focused on vision-language models, this tutorial will encompass a wider range of modalities, including audio, video, and time-series data, alongside vision and language. Practical challenges such as computational cost and potential misuse will also be addressed.
The tutorial will start with an introduction to multimodality, covering its history, core tasks, and key technical challenges. It will then explore multimodal pretrained models, focusing on vision-language datasets and models, as well as models incorporating other modalities. The evolution to multimodal large language models (MLLMs) like BLIP-2, LLaVA, and GPT4V will be central, examining their architectures, pretraining strategies, and downstream applications. Multimodal instruction tuning techniques, covering models like InstructPix2Pix, Instruct-BLIP, and VideoLLaMA, along with domain-specific tuning and efficient fine-tuning strategies like LoRA and QLORA, will be a significant focus.
Hands-on labs will provide practical experience with state-of-the-art MLLMs. One lab will focus on downstream vision-language tasks like Visual Storytelling (VST) and Visual Question Answering (VQA) using pretrained models, while another will demonstrate instruction tuning techniques for MLLMs in various domains.
The tutorial will conclude with a summary and discussion of future directions and trends, addressing the limitations and potential misuse of current MLLMs and emphasizing responsible development. This tutorial caters to a diverse audience, from experienced researchers and practitioners to newcomers seeking a comprehensive introduction to the field.
VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models by Harshit, Tolga Tasdizen https://arxiv.org/abs/2410.04609
Caption: The images showcase visualizations of attention maps from Vision-Language Models (VLMs) overlaid on various images, including a cat, pizza, dog, and tennis player. These visualizations, part of the VISTA dataset, are used to compare VLM attention with human eye-tracking data, aiming to improve VLM interpretability and alignment with human visual understanding. The colored overlays represent the model's focus, with warmer colors indicating higher attention.
While Vision-Language Models (VLMs) have achieved impressive performance, their internal workings remain largely mysterious. The VISTA (Visual and Textual Attention) dataset aims to enhance VLM interpretability by connecting human visual attention to model processing. VISTA addresses the crucial question of which image regions correspond to specific text segments and how to decipher these associations, which is vital for improving model transparency, interpretability, and trustworthiness.
VISTA was created using eye-tracking and verbal descriptions. Participants described scenes while their eye movements were recorded, creating a synchronized dataset of eye-tracking data and transcribed descriptions. This enables direct comparison between human attention patterns and VLM internal activations. Two key tasks were focused on: image-text alignment (comparing model attention weights with human eye-tracking data) and text-guided image segmentation (assessing segmentation map accuracy based on textual descriptions). Normalized Cross-Correlation (NCC) and Area Under the Curve (AUC) quantified the alignment between human attention and model predictions. The formula for NCC is given by: 1/(P-1) Σ((I₁(p) - μ<sub>I₁</sub>)(I₂(p) - μ<sub>I₂</sub>))/σ<sub>I₁</sub>σ<sub>I₂</sub>, where I₁ and I₂ are the two images, p is the pixel coordinate, P is the total number of pixels, μ is the average value, and σ is the standard deviation.
Several state-of-the-art VLMs (CLIP, ViLT, BLIP, ALBEF) were evaluated on image-text alignment. BLIP-Base performed best (NCC: 0.24, AUC: 0.63), showing closer alignment with human attention than other models. CLIP also performed reasonably well (NCC: 0.13, AUC: 0.57), while ViLT exhibited the weakest alignment (NCC: -0.02, AUC: 0.49). For text-guided segmentation, CLIP-Seg achieved the highest scores (NCC: 0.31, AUC: 0.67), followed by OV-Seg and ODISE. VISTA represents a substantial step towards demystifying VLMs.
Temporal Image Caption Retrieval Competition -- Description and Results by Jakub Pokrywka, Piotr Wierzchoń, Kornel Weryszko, Krzysztof Jassem https://arxiv.org/abs/2410.06314
Caption: THE NORTH ATLANTIC SQUADRON IN CRUISING ORDER.
While multimodal models combining image and text are flourishing, incorporating temporal information remains relatively unexplored. The Temporal Image Caption Retrieval Competition (TICRC) addresses this gap with a novel task: retrieving relevant captions for historical images given the image and its publication date. Leveraging the Chronicling America and Challenging America projects, the competition uses a vast collection of digitized American newspapers spanning 274 years. The inclusion of temporal data adds complexity, as language and factual information are date-dependent.
The TICRC task involves retrieving the most relevant caption from a set for a given historical image and its publication date. The dataset comprises 3902 instances (image, caption, timestamp) extracted from digitized newspapers dating back to 1853, split into training, development, and testing sets. Mean Reciprocal Rank (MRR) is the evaluation metric, calculated as: MRR = (1/Q) * Σ(1/rank<sub>i</sub>), where Q is the number of queries and rank<sub>i</sub> is the relevant document's rank for the i-th query.
Five teams participated, with three outperforming baseline models. The winning solution (test-B MRR: 0.3444) used the EVA02_CLIP_E_psz14_plus_s9B model without fine-tuning. Baseline models (transformer-based clip-ViT-B-32 and randomized caption order) achieved test-B MRR scores of 0.1710 and 0.0193, respectively. These results highlight the potential for improvement in this novel task. The TICRC provides a valuable benchmark for evaluating temporal image caption retrieval models, presenting unique challenges due to the historical nature of the data.
Pixtral 12B by Pravesh Agrawal, et al. https://arxiv.org/abs/2410.07073
Caption: This scatter plot visualizes the performance and cost (measured by number of parameters) of several multimodal LLMs on the MM-MT-Bench. Pixtral 12B demonstrates superior performance (6.05) compared to models like Qwen-2-VL and Llama-3.2, while maintaining a competitive cost, placing it in the "best performance/cost ratio" region. The chart highlights Pixtral 12B's efficient architecture and strong performance on this newly introduced benchmark.
Mistral AI has released Pixtral 12B, an open-source multimodal LLM excelling in both image and text understanding. Unlike many open-source models that compromise text-only performance for multimodality, Pixtral 12B achieves leading performance on various benchmarks, surpassing larger models. Its novel architecture combines a 400M parameter vision encoder trained from scratch with a 12B parameter multimodal decoder based on Mistral Nemo 12B. This allows processing images at native resolution and aspect ratio, offering flexibility in token usage and efficient handling of multiple images within its 128K context window.
A key innovation is the new vision encoder, Pixtral-ViT, which utilizes ROPE-2D. This enables processing variable image sizes and aspect ratios using relative, rotary position encodings in self-attention layers. The ROPE-2D transform is defined as: ROPE-2D (x(i,j), Ө) = M(i,j) x(i,j), where x(i,j) is the patch vector at position (i,j), and M(i,j) is the rotation matrix derived from position and frequencies Θ. This contrasts with traditional methods that interpolate learned position embeddings, often sacrificing performance with varying image sizes. The architecture also includes break tokens to differentiate images with the same area but different aspect ratios, gating in the feedforward network, and sequence packing for efficient batch processing.
Addressing the lack of standardization in multimodal LLM evaluation, the researchers introduce MM-MT-Bench, an open-source benchmark using an LLM judge for grading model responses (1-10) based on correctness and completeness. "Explicit" prompts, clearly specifying desired output format, are advocated to address ambiguity in existing benchmarks. Pixtral 12B significantly outperforms comparable open-source models on MM-MT-Bench and ranks highly on the LMSys Vision Leaderboard. It also shows strong performance on other benchmarks (MathVista, MMMU, ChartQA, DocVQA, VQAv2) and maintains strong text-only performance.
This newsletter has showcased a diverse range of advancements in multimodal image and text foundation models. We've seen the challenges of real-world deployment in medical imaging, the innovative use of sparse sampling for efficient PDF comprehension, and the development of robust benchmarks for evaluating multimodal LLMs. The emergence of models like Pixtral 12B highlights the ongoing push for open-source solutions that excel in both image and text understanding. As the field continues to evolve, addressing issues like distribution shifts, data incompleteness, and model interpretability will remain crucial for unlocking the full potential of multimodal AI. The development of specialized models like MM-Ego for egocentric video understanding demonstrates the increasing specialization within the field. The research presented in this newsletter underlines the dynamic nature of multimodal AI and the exciting possibilities that lie ahead.