This newsletter explores the cutting edge of multimodal image and text foundation models, showcasing exciting new research and breakthroughs. We'll delve into novel architectures, training methodologies, and applications across diverse domains, from computational pathology to wildlife conservation. Prepare to uncover how these powerful models are transforming our ability to analyze and interpret complex visual and textual data.
CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology by Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang https://arxiv.org/abs/2412.12077
Caption: CPath-Omni is a new large multimodal model designed to unify patch and whole-slide image analysis in computational pathology. Trained on a diverse dataset, it excels at tasks like classification, visual question answering, and captioning, even surpassing human performance on some benchmarks. This unified approach simplifies workflows and improves diagnostic accuracy, representing a significant advancement in the field.
Computational pathology has witnessed remarkable progress, with models traditionally focusing on either patch-level or whole-slide image (WSI) analysis. This separation, however, limits knowledge integration and creates redundancy. Researchers introduce CPath-Omni, a 15-billion parameter large multimodal model (LMM) designed to unify both patch and WSI analysis within a single, powerful framework. This "one-for-all" model handles a diverse range of tasks, including classification, visual question answering (VQA), captioning, and visual referring prompting, effectively streamlining the field of computational pathology.
At the heart of CPath-Omni lies CPath-CLIP, a novel pathology-specific visual processor. CPath-CLIP uniquely integrates the self-supervised vision model Virchow2 with the original CLIP-L model and, for the first time, incorporates a large language model (LLM), Qwen2-1.5B, as the text encoder. This enhanced design improves alignment with LLM world knowledge and allows the model to capture fine-grained details crucial for accurate pathological analysis. CPath-Omni's training is a multi-stage process, encompassing patch-based pretraining, fine-tuning, WSI-based pretraining, and finally, mixed patch-WSI training. This approach enables the model to leverage the unique strengths of both patch-level and WSI-level data, resulting in seamless processing and enabling a wide array of downstream tasks.
The training dataset used for CPath-Omni is the largest and most diverse curated for LMMs in pathology, spanning seven tasks across 42 datasets. This includes a newly curated patch image-caption dataset, CPath-PatchCaption, containing 700,145 pairs. WSI training leverages existing datasets, augmented with additional processing and pathologist-annotated high-resolution images. Extensive experiments demonstrate CPath-Omni's superior performance. CPath-CLIP achieves state-of-the-art (SOTA) results on nine zero-shot and four few-shot classification datasets, outperforming existing pathology CLIP models. CPath-Omni itself achieves SOTA performance on 39 out of 42 datasets across the seven diverse tasks, surpassing or matching the performance of task-specific models trained individually. Impressively, on PathMMU, the largest pathology-specific VQA dataset, CPath-Omni surpasses human-level performance by 0.6%, achieving 72.4% accuracy compared to 71.8% for human pathologists. It also significantly outperforms both general-purpose and pathology-specific LMMs, showcasing its ability to learn generalized knowledge across diverse medical fields.
CPath-Omni's unified approach represents a paradigm shift in computational pathology. Its ability to handle both patch and WSI analysis within a single model simplifies workflows and significantly enhances performance across diverse tasks. The model's exceptional performance, even exceeding human-level accuracy on certain tasks, underscores the potential of LMMs to become indispensable tools for pathologists, streamlining diagnoses and ultimately improving patient care. This groundbreaking work paves the way for the development of even more versatile and comprehensive LMMs in the field.
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants by Hritik Bansal, Daniel Israel, Siyan Zhao, Shufan Li, Tung Nguyen, Aditya Grover https://arxiv.org/abs/2412.12661
Caption: This bar chart compares the average accuracy of several large language models on 12 downstream biomedical visual question answering (VQA) tasks. The MedMax-7B model, fine-tuned on the MEDMAX dataset, significantly outperforms other models, including Chameleon-7B and GPT-40, demonstrating the effectiveness of the MEDMAX dataset for multimodal biomedical instruction tuning.
The landscape of biomedical AI is undergoing a rapid transformation, with mixed-modal foundation models holding immense promise for integrating information from diverse sources like images and text. However, these models often face challenges when dealing with biomedical data due to the inherent distribution shifts from the natural data they are typically trained on. This paper introduces MEDMAX, a groundbreaking large-scale multimodal biomedical instruction-tuning dataset designed to address this challenge and empower the next generation of biomedical AI assistants.
Existing resources in this domain are often hampered by limitations such as limited data, narrow domain coverage, and restricted sources. MEDMAX tackles these limitations head-on with its impressive 1.47 million instances, encompassing a diverse range of tasks, including multimodal content generation, image captioning, visual chatting, and report understanding across various medical domains like radiology and histopathology. A key innovation within MEDMAX is MEDMAX-INSTRUCT, a novel dataset specifically designed for generating interleaved image-text content, which is crucial for advanced diagnostics, report generation, and medical training.
The authors fine-tuned a mixed-modal foundation model, Chameleon, on the MEDMAX dataset, which comprises 1.7B multimodal discrete tokens. The model utilizes an autoregressive sequence modeling objective, maximizing the log-likelihood of the next token given the preceding sequence: max<sub>θ</sub> E<sub>x~D</sub> Σ<sup>n</sup><sub>k=1</sub> log P<sub>θ</sub>(x<sub>k</sub>|x<sub>1:k-1</sub>). Instruction tuning further refines the model's capabilities by focusing on specific tasks. This is achieved by training on paired multimodal sequences (x, y), where x represents the instruction and y the response, optimizing the following objective: max<sub>θ</sub> E<sub>(x,y)~D1</sub> Σ<sup>n</sup><sub>k=1</sub> log P(y<sub>k</sub>|y<sub>1:k−1</sub>,x). This targeted approach concentrates the learning process on the desired tasks, making the model more effective at responding to specific user instructions.
The results of this research are compelling, demonstrating the power and efficacy of MEDMAX. The fine-tuned model significantly outperforms existing open and closed multimodal models. On 12 downstream biomedical visual question-answering (VQA) tasks, it achieved a remarkable 26% gain over the Chameleon model and an 18.3% improvement over GPT-40. Moreover, MEDMAX exhibited substantial improvements in biomedical image captioning, image generation, visual chatting, and, notably, multimodal data generation. The authors also introduce a unified evaluation suite for biomedical tasks, providing a robust and standardized framework for assessing and comparing future mixed-modal models.
The paper also investigates the impact of dataset scaling and specialized visual encoders. Experiments revealed that performance improves with increasing dataset size, underscoring the high quality of the MEDMAX data. Interestingly, fine-tuning the visual encoder separately resulted in a slight performance decrease, suggesting that the original visual tokens from the base VQGAN are better aligned with the model's internal representations. Ablation studies further confirmed the importance of including diverse tasks in the dataset, as removing VQA or visual chat data led to significant performance drops on those respective tasks. MEDMAX represents a significant leap forward in multimodal biomedical AI. By providing a large-scale, diverse dataset and demonstrating its effectiveness in training a powerful mixed-modal model, this work paves the way for the development of more sophisticated and capable biomedical assistants. The unified evaluation suite further contributes to the field by establishing a standardized benchmark for future research and development in this exciting area.
A Knowledge-enhanced Pathology Vision-language Foundation Model for Cancer Diagnosis by Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ruifen Wang, Lifeng Wang, Xin Sun, Kun Sun, Ya Zhang, Yanfeng Wang, Weidi Xie https://arxiv.org/abs/2412.13126
Caption: This figure illustrates the KEEP model's architecture for knowledge-enhanced pathology image analysis. It showcases various components, including image and text encoding, denoised contrastive learning with positive and negative sample selection, and zero-shot classification for segmentation and diagnosis. The figure also highlights pre-processing steps like pathology image detection, caption overlap identification, and UMLS entity extraction, as well as downstream tasks such as grouping similar images/captions and cross-modal ranking.
Deep learning has revolutionized computational pathology, but existing models often struggle with the complexities of rare cancer subtypes and lack the crucial integration of medical knowledge. Researchers have introduced KEEP, a KnowledgE-Enhanced Pathology vision-language foundation model that directly addresses these limitations, achieving state-of-the-art performance in zero-shot cancer diagnosis. Unlike its data-driven predecessors, KEEP leverages a vast disease knowledge graph (KG) encompassing 11,454 human diseases, complete with attributes such as synonyms, definitions, and hierarchical relationships. This KG, encoded by a language model, guides the model's learning process, allowing it to interpret complex medical data with greater accuracy and understanding. Furthermore, KEEP tackles the prevalent issue of noisy public pathology image-text data by filtering and restructuring it into semantic groups linked by the KG's hierarchical relations. This structured pre-training approach, which aligns semantically grouped data, significantly enhances the model's ability to understand and categorize pathology images.
The KEEP model architecture consists of a knowledge encoding stage and a vision-language alignment stage. A BERT-based text encoder is trained on the disease KG using metric learning to encode the hierarchical relationships: sim(Φк(а), Φк(а)) » sim(Φк(а), Φk(a)), і ≠ j. This ensures that attributes of the same disease are mapped to similar embeddings. The vision-language pre-training utilizes a novel semantic-level alignment approach. Noisy public image-text data is meticulously cleaned and organized into semantic groups. Within these groups, images are augmented with cropping and random dropping, while captions are augmented with random word dropping and paraphrasing using templates like "[Template] + disease label". A metric loss function is then employed to align the visual and textual embeddings within each semantic group while simultaneously separating embeddings between different groups: min sim(Φv(xp), Φk(či)) > max sim(Φv(xq), Φk(ĉj)), i ≠ j. Strategies such as positive mining, hardest negative selection, and false negative elimination further refine the training process.
Evaluations conducted across 18 diverse benchmarks with over 14,000 whole slide images (WSIs) demonstrate KEEP's superior performance in zero-shot cancer region segmentation, detection, and subtyping. For cancer detection, KEEP achieves an average sensitivity of 89.8% at a specificity of 95.0% across seven cancer types, significantly outperforming vision-only models and previous vision-language models. In the challenging task of rare cancer subtyping (30 rare brain cancers), KEEP attains a median balanced accuracy of 0.456, surpassing the previous best model by an impressive 8.5 percentage points. These compelling results highlight KEEP's immense potential for clinical application, particularly in the diagnosis of rare and challenging tumor types.
The remarkable success of KEEP can be attributed to two key factors: the integration of disease knowledge during the training process and the tumor-ratio-based prediction method. The incorporation of disease knowledge enhances the model's understanding of disease characteristics and improves semantic alignment, while the tumor-ratio-based prediction offers superior interpretability, mimicking the diagnostic reasoning employed by human pathologists. Despite these advancements, some limitations remain, such as the potential for false positives in cancer region segmentation due to the independent processing of image tiles and the reliance on prompt engineering for zero-shot performance. Future research directions include incorporating contextual information for segmentation, exploring few-shot learning for rare cancer subtypes, and implementing prompt learning for enhanced adaptability and robustness. The integration of genomic or epigenomic information also presents a promising avenue for further improving the model's diagnostic capabilities.
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning by Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu https://arxiv.org/abs/2412.10840
Caption: The top figure illustrates the process of grounding a visual element given a text query using cross-attention maps from a pre-trained MLLM. The bottom figure details the novel Self-Attention Head Selection mechanism within the Tuning-free Attention-driven Grounding (TAG) pipeline, where attention maps are aggregated and filtered to select the most relevant heads for accurate grounding without fine-tuning.
Multimodal Large Language Models (MLLMs) are transforming how we interact with Graphical User Interfaces (GUIs), but accurately pinpointing GUI elements (grounding) presents a significant challenge. Current methods often rely on fine-tuning MLLMs on specialized datasets, a process that can be resource-intensive and susceptible to overfitting. This paper introduces a novel approach called Tuning-free Attention-driven Grounding (TAG) that leverages the inherent attention mechanisms within pre-trained MLLMs, eliminating the need for additional fine-tuning.
TAG operates by strategically selecting and aggregating attention maps from within the MLLM. Instead of directly predicting element locations, TAG identifies specific tokens from the user query or the model's generated response and propagates the corresponding attention values back to the image plane. A key innovation of TAG is the adaptive text token selection, which prioritizes attention between the most relevant tokens, improving localization accuracy. Furthermore, a self-attention head selection mechanism filters irrelevant attention heads, ensuring only the most pertinent information is used for grounding. The core of the method involves propagating attention from selected text tokens {Tj} to image patches. First, head-wise attention maps A<sub>llm</sub> ∈ [0, 1]<sup>N×T×Q</sup> are generated, representing attention between text tokens and visual query tokens across all layers. These are aggregated using weighted summation: A<sub>llm</sub>(T<sub>j</sub>) = (1/N) Σ<sub>k=1</sub><sup>N</sup> a<sub>k,j</sub> A<sub>llm</sub>[k, j, :] ∈ [0, 1]<sup>Q</sup>. Then, attention is propagated to image patches using R<sub>j</sub> = A<sub>llm</sub>(T<sub>j</sub>) × A<sub>cross</sub> ∈ [0, 1]<sup>H⋅W</sup>, where A<sub>cross</sub> ∈ [0, 1]<sup>Q×(H⋅W)</sup> represents attention between visual query tokens and image patches. Finally, an overall relationship is obtained by averaging across selected text tokens: R = (1/T) Σ<sub>j∈{1,2,...,T}</sub> R<sub>j</sub> ∈ [0, 1]<sup>H⋅W</sup>.
The researchers evaluated TAG on three benchmarks: a novel Optical Character Grounding (OCG) dataset derived from Mind2Web, the ScreenSpot GUI element grounding dataset, and the Mind2Web GUI agent evaluation dataset. On the OCG dataset, TAG achieved an average accuracy of 84.5% across 10 different aspect ratios, significantly outperforming the fine-tuned SeeClick model (60.2%) and the foundation model MiniCPMV2.5 (48.1%). On ScreenSpot, TAG achieved an average accuracy of 54.8%, surpassing both tuning-free and tuning-based state-of-the-art methods. It also demonstrated superior performance on the Mind2Web agent evaluation, achieving comparable accuracy to the best fine-tuned approach. The results demonstrate that TAG effectively leverages the inherent spatial awareness of pre-trained MLLMs, offering a more efficient and scalable alternative to traditional fine-tuning methods for GUI grounding. The tuning-free nature of TAG is particularly appealing, as it simplifies deployment and reduces the risk of overfitting to specific datasets. This innovative approach opens exciting new avenues for developing more robust and adaptable AI agents capable of seamlessly interacting with a wide range of GUIs.
CATALOG: A Camera Trap Language-guided Contrastive Learning Model by Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo https://arxiv.org/abs/2412.10624
Caption: The diagram illustrates the architecture of CATALOG, a novel foundation model for camera trap image recognition. It leverages multiple foundation models, including CLIP, LLaVA, and BERT, to learn domain-invariant features and align multi-modal information from images and text descriptions, ultimately improving accuracy in identifying animal species from camera trap images. The model uses a contrastive loss function to optimize the alignment of image and text embeddings, leading to improved performance in both in-domain and out-of-domain evaluations.
Camera traps are revolutionizing wildlife research, but analyzing the massive volumes of image data they generate remains a significant challenge. Traditional computer vision models often struggle with the inherent variability in camera trap images, such as changing lighting conditions, animal camouflage, and occlusions. Furthermore, these models are often susceptible to domain shift – the difference between the dataset used for training and the dataset used for testing. A new model, Camera Trap Language-guided Contrastive Learning (CATALOG), addresses these limitations by harnessing the power of foundation models (FMs).
CATALOG combines multiple FMs, including a large language model (LLM), CLIP, LLaVA, and BERT. The model's innovative approach involves three key components. First, it combines text information from various sources, including LLM-generated descriptions and predefined templates, by calculating the centroid in the embedding space. This creates a more robust and comprehensive representation of each animal category. Second, CATALOG aligns the multi-modal (text and image) features using a convex combination of different sources, controlled by a hyperparameter α. The alignment is calculated using the following formula: S = αW + (1 - α)Q, where W and Q represent the cosine similarity matrices between text and image embeddings, and text and image-text embeddings, respectively. Third, it utilizes a contrastive loss function to train the model, encouraging it to learn domain-invariant features. This loss function is defined as: L(S) = (1/B) Σᵢ [-log (exp(Sᵢₖ/T) / Σⱼ exp(Sᵢⱼ/T))], where B is the batch size, Sᵢₖ is the similarity between the i-th image and its correct class k, T is a temperature parameter, and j iterates over all classes.
The researchers evaluated CATALOG on two benchmark datasets: Snapshot Serengeti and Terra Incognita. In out-of-domain evaluations (trained on Snapshot Serengeti, tested on Terra Incognita), CATALOG achieved an accuracy of 48.59% on the Cis-Test set and 41.92% on the Trans-Test set, significantly outperforming existing FMs like CLIP and WildCLIP. In in-domain evaluations (training and testing on the same dataset), CATALOG achieved 90.63% accuracy on Snapshot Serengeti and 89.64% and 84.32% on the Cis-Test and Trans-Test sets of Terra Incognita, respectively. Ablation studies confirmed the importance of each component of CATALOG, particularly the CLIP image encoder and the combined use of LLM descriptions and predefined templates. CATALOG represents a substantial step forward in camera trap image recognition. Its ability to leverage multiple FMs and learn domain-invariant features makes it particularly well-suited for real-world applications where domain shift is a major challenge. While the model shows promising results, further research is needed to improve its performance in out-of-domain scenarios and to close the gap between in-domain and out-of-domain accuracy. Collecting larger, more diverse datasets and incorporating expert knowledge are promising avenues for future development.
This newsletter has showcased several significant advancements in the field of multimodal image and text foundation models. From unifying patch and WSI analysis in computational pathology with CPath-Omni to leveraging the inherent attention mechanisms of pre-trained MLLMs for GUI grounding with TAG, these models are pushing the boundaries of what's possible. The introduction of MEDMAX, a large-scale biomedical instruction-tuning dataset, and the knowledge-enhanced KEEP model for cancer diagnosis further demonstrate the transformative potential of these models in specialized domains. Finally, CATALOG's innovative approach to camera trap image recognition highlights the power of combining multiple FMs and contrastive learning to address the challenges of domain shift. These advancements underscore the rapid pace of innovation in this field and pave the way for even more powerful and versatile multimodal models in the future.