This newsletter explores the cutting edge of multimodal image and text foundation models, covering novel architectures, training strategies, and benchmark datasets. We'll delve into the challenges of integrating visual and textual information, the quest for efficient model design, and the ethical implications of data memorization and privacy. From universal embeddings for pathology to compact vision-language models, this newsletter offers a comprehensive overview of the latest developments in this rapidly evolving field.
MLLM4PUE: Toward Universal Embeddings in Computational Pathology through Multimodal LLMs by Qifeng Zhou, Thao M. Dang, Wenliang Zhong, Yuzhi Guo, Hehuan Ma, Saiyang Na, Junzhou Huang https://arxiv.org/abs/2502.07221
Caption: This diagram illustrates the MLLM4PUE framework for generating universal multimodal embeddings in computational pathology. It shows the pretraining process, zero-shot classification, and zero-shot composed retrieval, highlighting the use of prompts to guide the MLLM and the integration of image and text data. The framework aims to improve performance and adaptability across various pathology tasks by leveraging the power of Multimodal Large Language Models.
The field of pathology relies heavily on both visual and textual data, yet current AI models often struggle to integrate these modalities effectively. Existing approaches, primarily based on the CLIP architecture, employ separate encoders for images and text, limiting their ability to capture complex multimodal relationships. Moreover, these models are often task-specific and evaluated on disparate datasets, hindering reproducibility and comparison. This underscores the need for universal multimodal embeddings that can support a variety of downstream tasks.
Researchers have introduced MLLM4PUE, a novel framework leveraging Multimodal Large Language Models (MLLMs) to generate precisely these universal multimodal embeddings for pathology. Unlike CLIP, MLLM4PUE utilizes a transformer-based architecture that fully integrates image and text modalities, allowing it to learn from complex inter-relationships. This approach significantly enhances adaptability and effectiveness across a range of pathology tasks, including classification, retrieval, and the novel task of composed retrieval. A key innovation of MLLM4PUE is the use of prompts to guide the MLLM in distilling multimodal information into a single-word embedding.
The model is trained using contrastive learning on a combined dataset of over 590,000 image-text pairs from Openpath, PathCap, and Quilt1M, optimizing an infoNCE contrastive loss:
L = -(log \frac{e^{cos(h_{v_i},h_{t_i})/\tau}}{\sum_{j=1}^{n} e^{cos(h_{v_i},h_{t_j})/\tau}} + log \frac{e^{cos(h_{t_i},h_{v_i})/\tau}}{\sum_{j=1, i \neq j}^{n} e^{cos(h_{t_i},h_{v_j})/\tau}})
where τ is a temperature parameter, h<sub>v<sub>i</sub></sub> represents the visual embedding of image i, and h<sub>t<sub>i</sub></sub> represents the textual embedding of text i. This loss function encourages the model to learn embeddings where similar image-text pairs are close together, and dissimilar pairs are far apart.
To address the lack of standardized evaluation in this domain, the researchers also introduced the Pathology Multimodal Embedding Benchmark (PMEB). This comprehensive benchmark comprises 15 tasks drawn from 14 datasets, categorized into retrieval, classification, and composed retrieval. The composed retrieval task, novel to this benchmark, evaluates the model's ability to process combined image-text queries, a crucial capability for real-world clinical scenarios. Experimental results on PMEB demonstrate the superiority of MLLM4PUE across all tasks, showcasing the potential of MLLM-based models for unifying and streamlining the development of models for diverse pathology tasks.
NanoVLMs: How small can we go and still make coherent Vision Language Models? by Mukund Agarwalla, Himanshu Kumar, Raj Dandekar, Rajat Dandekar, Sreedath Panat https://arxiv.org/abs/2502.07838
Caption: This diagram illustrates the architecture of a NanoVLM, a compact Vision-Language Model. It shows the flow of information from the image input through the visual encoder, to the visual-textual connector where it's combined with text embeddings, and finally to the language decoder which generates the textual output. This streamlined design enables efficient processing and generation of text descriptions from images.
While large Vision-Language Models (VLMs) like GPT-4V and Llama 3.2 vision have demonstrated impressive capabilities, their size and computational demands limit accessibility. This paper explores the lower limits of VLM size while maintaining coherent text generation, drawing inspiration from how young children learn language through visual cues. The authors introduce NanoVLMs, a family of lightweight VLMs up to 10x smaller than state-of-the-art small VLMs. This remarkable size reduction is achieved by prioritizing efficient visual encoding and refined cross-modal alignment.
Central to this work are two novel datasets, ShortDesc (concise image descriptions) and LongDesc (detailed image descriptions), created using GPT-40 and image-caption pairs from the COCO dataset. The text in these datasets mimics the simple vocabulary and syntax typically used by young children, aligning with the research's inspiration. NanoVLMs employ a transformer-based architecture with three core components: a visual encoder (ViT inspired), a visual-textual connector (linear projection and GELU), and a language decoder (transformer with causal self-attention). The visual encoder processes images as 16x16 pixel patches, transforming them into token embeddings. The visual-textual connector aligns the visual and textual embeddings, concatenating them for input to the decoder. The decoder then generates text, utilizing causal self-attention to maintain coherence.
A unique aspect of this work is the evaluation methodology. GPT-40 is employed to grade the generated text on various criteria, including grammar, consistency, creativity, meaningfulness, and plot. This nuanced evaluation provides a richer understanding of model capabilities compared to traditional benchmarks. The results demonstrate that despite their significantly smaller size, NanoVLMs achieve competitive performance against much larger models. This work pushes the boundaries of efficient VLM design, opening exciting avenues for future exploration in resource-constrained environments and real-time applications.
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification by Anh-Tien Nguyen, Duy Minh Ho Nguyen, Nghiem Tuong Diep, Trung Quoc Nguyen, Nhat Ho, Jacqueline Michelle Metsch, Miriam Cindy Maurer, Daniel Sonntag, Hanibal Bohnenberger, Anne-Christin Hauschild https://arxiv.org/abs/2502.07409
Caption: This figure contrasts prior work with MGPATH for few-shot WSI classification. Prior work uses patch-level attention with fixed prompts, while MGPATH employs multi-granular attention with learnable prompts at varying resolutions, capturing hierarchical tissue information for improved performance. Both methods process patches extracted from WSIs, encode them using a vision encoder, and then use a classifier to generate predictions.
Whole slide image (WSI) analysis is critical for cancer diagnosis, but the massive size of these images and the scarcity of annotations present significant challenges for model generalization. MGPATH addresses these challenges by introducing a novel Vision-Language Model (VLM) that leverages multi-granular prompt learning and optimal transport for enhanced WSI classification.
MGPATH adapts the Prov-GigaPath vision foundation model, pre-trained on a vast dataset of 1.3 billion pathology image patches, into a VLM. This adaptation is achieved through contrastive learning with a pre-trained text encoder and additional image-text pairs. The model then utilizes multi-granular prompt learning for few-shot WSI tasks. Visual embeddings and descriptive text prompts are generated for image patches at different resolutions. Unlike traditional methods, MGPATH integrates learnable prompts with frozen visual features at both fine- and coarse-grained levels using a novel multi-granular attention mechanism. This approach effectively captures hierarchical information by representing image patches as a spatial graph and directing attention across patch and region levels.
To further enhance robustness, MGPATH employs optimal transport (OT) to measure the distance between the prompt-fused visual embedding and multiple text prompts. This offers greater flexibility in aligning heterogeneous data distributions and robustness against data augmentation perturbations. The objective function is to minimize the optimal transport distance: d<sub>OT</sub>(μ,ν) = (T<sup>*</sup>,C), where T<sup>*</sup> is the optimal transport plan and C is the cost matrix. This formulation ensures that the model learns a robust mapping between visual and textual representations.
Empirical experiments on various pathology datasets demonstrate MGPATH's superior performance compared to state-of-the-art MIL and VLM competitors. The results highlight the power of combining large-scale domain-specific models with multi-granular prompt learning and optimal transport for few-shot learning in pathology. The multi-granular attention mechanism effectively captures hierarchical tissue details, while OT ensures robustness against data augmentation, paving the way for more accurate and reliable WSI classification.
The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation by Raman Dutt https://arxiv.org/abs/2502.07516
Caption: This image demonstrates the ineffectiveness of current de-identification mitigation strategies in synthetic chest X-ray generation. It shows generated X-rays from prompts with original de-identification traces, replaced traces (with a random word and number), and removed traces, illustrating that these methods fail to reduce memorization as the generated images remain visually similar. This highlights the unintended consequence of current de-identification practices in exacerbating memorization risks in generative models.
While generative text-to-image (T2I) models offer promising applications in medical image analysis, they are susceptible to memorizing training data, raising serious privacy concerns. This study investigates memorization risks in synthetic chest X-ray generation using the MIMIC-CXR dataset, adopting a data-driven approach to identify specific prompts and text tokens that contribute most significantly to memorization. The study employs a memorization detection metric based on text-conditional noise: d<sub>mem</sub> = (1/T) Σ<sub>t=1</sub> ||ε<sub>θ</sub>(x<sub>t</sub>, e<sub>p</sub>) – ε<sub>θ</sub>(x<sub>t</sub>, e<sub>∅</sub>)||<sub>2</sub>, where ε<sub>θ</sub> is the noise predictor, T is the number of timesteps, e<sub>p</sub> is the embedding of prompt p, and e<sub>∅</sub> is the embedding of an empty string. A higher d<sub>mem</sub> indicates stronger memorization.
A surprising finding is that prompts containing traces of de-identification procedures are among the most memorized. Specifically, the de-identification marker ("___") used to replace Protected Health Information (PHI) contributes significantly to memorization. This is attributed to the marker's uniqueness within the MIMIC-CXR text corpus and its high frequency, leading the model to learn spurious correlations and memorize specific samples. Furthermore, existing inference-time mitigation strategies, such as random word/number addition and complete marker removal, prove ineffective in reducing memorization.
This unexpected finding highlights the unintended consequences of current de-identification practices in exacerbating memorization risks. The study suggests that these practices, intended to protect patient privacy, may inadvertently increase the risk of data leakage. This underscores the need for a paradigm shift in de-identification methodologies to prevent unintended memorization and ensure patient privacy in the age of generative AI. The study proposes actionable strategies for dataset curators and model developers to mitigate these risks, including the use of rule-based de-identification with randomized marker symbols and pre-processing techniques like recaptioning to enhance caption diversity.
UniCoRN: Unified Commented Retrieval Network with LMMs by Maximilian Jaritz, Matthieu Guillaumin, Sabine Sternig, Loris Bazzani https://arxiv.org/abs/2502.08254