The convergence of vision and language is reshaping the landscape of AI, and this newsletter dives into the latest breakthroughs in multimodal image and text foundation models. From revolutionizing pathology to detecting image manipulation, and from generating creative content to enhancing zero-shot learning, these papers showcase the power and potential of combining pixels and prose. Let's explore the innovative architectures, training strategies, and exciting applications driving this rapidly evolving field.
Multimodal Whole Slide Foundation Model for Pathology by Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F.K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood https://arxiv.org/abs/2411.19666
Caption: This image describes TITAN, a multimodal foundation model for whole-slide image analysis in pathology. It leverages a three-stage pretraining process, combining visual and textual data from WSIs, synthetic captions, and pathology reports to learn rich slide representations. The diagram illustrates TITAN's architecture, data processing steps (patch encoding, region cropping), and downstream applications like report generation and rare cancer retrieval.
The field of computational pathology is being transformed by foundation models trained via self-supervised learning (SSL). These models convert histopathology regions-of-interest (ROIs) into versatile feature representations. However, applying these advancements to patient and slide-level clinical challenges is hampered by limited clinical data, especially for rare diseases. TITAN, a multimodal whole-slide foundation model, addresses this limitation. Unlike models focused solely on visual data, TITAN incorporates both visual and textual information, enabling a more comprehensive analysis.
TITAN's pretraining involves three stages. Stage 1 focuses on visual self-supervised pretraining using 335,645 WSIs across 20 organ types, learning ROI-level features. Stage 2 incorporates cross-modal alignment using 423,122 synthetic captions generated by a pathology AI called PathChat, linking visual features with morphological descriptions. Stage 3 aligns WSI representations with 182,862 pathology reports, adding clinical context. This approach leverages millions of high-resolution ROIs and employs a novel paradigm utilizing pre-extracted patch features, enabling large-scale, resolution-agnostic pretraining and scalable WSI encoding. To handle long WSI sequences, TITAN employs non-overlapping 512x512 pixel patches and Attention with Linear Biases (ALiBi) at inference: softmax(qᵢkⱼ - m√((iₓ - jₓ)² + (iᵧ - jᵧ)²))
, where qᵢ
and kⱼ
are query and key vectors, m
is a slope, and iₓ
, iᵧ
, jₓ
, jᵧ
are 2D grid coordinates of patches i
and j
.
Evaluations show TITAN outperforms existing ROI and slide foundation models across various tasks. In morphological subtyping, TITAN and its vision-only variant (TITAN∨) showed significant improvements over the next-best model (+8.4% and +6.7% respectively in balanced accuracy/AUROC). TITAN also excelled in few-shot learning, especially one-shot scenarios. In cross-modal zero-shot classification, TITAN substantially outperformed PRISM (+56.52% in balanced accuracy and +13.8% in AUROC). TITAN's report generation capabilities also surpassed PRISM by an average of 161% across various metrics. Its superior performance in rare cancer retrieval highlights its potential for complex diagnostics. TITAN's multimodal pretraining, combining visual self-supervision with vision-language alignment, enables it to capture multiscale morphological semantics and generalize effectively. While promising, limitations include the dataset size relative to patch-level models and potential improvements through larger pretraining contexts and advanced report preprocessing.
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection by Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang https://arxiv.org/abs/2411.19466
Caption: This diagram illustrates the architecture of ForgerySleuth, a framework that leverages Multimodal Large Language Models (M-LLMs) for image manipulation detection. It integrates a trace encoder with an M-LLM, allowing the model to capture both high-level semantic inconsistencies and low-level forgery traces, which are then fused with visual features by a vision decoder to generate precise segmentation masks of manipulated regions. The framework also incorporates a novel Chain-of-Clues prompting strategy and the ForgeryAnalysis dataset to enhance the M-LLM's reasoning capabilities for explaining detected forgeries.
Multimodal Large Language Models (M-LLMs) offer exciting possibilities, but their direct application to image manipulation detection (IMD) has limitations. M-LLMs often generate hallucinated and overthought explanations, lacking the precision of traditional segmentation-based methods. Recognizing this, ForgerySleuth empowers M-LLMs for IMD, generating both textual explanations and precise segmentation masks of tampered regions.
ForgerySleuth combines an M-LLM with a trace encoder to capture both high-level semantic anomalies and low-level forgery traces. The trace encoder, using constrained convolutions with residual connections, learns manipulation features. A vision decoder, inspired by Transformer segmentation models, fuses the M-LLM's high-level anomalies, the trace encoder's low-level traces, and dense visual features. This facilitates comprehensive clue fusion and accurate mask generation. The model is trained end-to-end using a weighted loss function: L = λ<sub>txt</sub>L<sub>txt</sub> + λ<sub>mask</sub>L<sub>mask</sub>, balancing textual and mask losses.
To improve M-LLM reasoning for IMD, the ForgeryAnalysis dataset was created. This dataset, generated using a "Chain-of-Clues" prompt with GPT-40 and expert refinement, provides detailed reasoning about tampered regions. A data engine further generates a larger ForgeryAnalysis-PT dataset for pre-training.
Experiments show ForgerySleuth's superior performance. It outperforms state-of-the-art methods in pixel-level localization by up to 24.7%, achieving impressive AUC scores across various datasets. On ForgeryAnalysis-Eval, ForgerySleuth surpasses GPT-40 by 35.8% in comprehensive scoring, showcasing improved generalization, robustness, and explainability. The robust evaluation on NIST16 with distortions further highlights its resilience. Ablation studies confirm the contribution of each component.
JetFormer: An Autoregressive Generative Model of Raw Images and Text by Michael Tschannen, André Susano Pinto, Alexander Kolesnikov https://arxiv.org/abs/2411.19722
Caption: This diagram illustrates the architecture of JetFormer, a transformer-based model that uses a normalizing flow ("Flow") to encode images into a tokenized representation. This allows the transformer to jointly process text and image inputs, optimizing for the raw data likelihood and enabling both image and text generation. The loss function compares the target tokens with the transformer's output, guiding the training process.
JetFormer, a novel autoregressive, decoder-only transformer, offers a streamlined approach to joint generative modeling of images and text. Unlike models relying on separate components, JetFormer directly maximizes the likelihood of raw data, enabling it to both understand and generate both modalities. Central to JetFormer is a normalizing flow model ("jet") that encodes images into a soft-token representation, also serving as a decoder during inference. This unified architecture is trained end-to-end.
Addressing the challenge of global coherence in likelihood-based image generation, JetFormer employs a noise curriculum during training, gradually decreasing noise levels to prioritize high-level image structure. It also tackles image redundancy by factoring out redundant dimensions or using PCA for dimensionality reduction.
JetFormer's performance is competitive with existing methods. On ImageNet256, it achieves a FID score comparable to VQVAE- and VAE-based baselines, while also providing strong log-likelihood bounds. On MS-COCO, its text-to-image generation shows promising results. Beyond generation, JetFormer demonstrates robust image understanding capabilities, performing well on zero-shot classification, image captioning, and visual question answering. While its text-to-image generation currently trails diffusion models, its unified architecture and explicit log-likelihood modeling represent a significant step towards more flexible generative models.
HOPPR Medical-Grade Platform for Medical Imaging AI by Kalina P. Slavkova, Melanie Traughber, Oliver Chen, Robert Bakos, Shayna Goldstein, Dan Harms, Bradley J. Erickson, Khan M. Siddiqui https://arxiv.org/abs/2411.17891
The HOPPR Medical-Grade Platform aims to overcome the barriers hindering widespread LVLM adoption in medical imaging. High computational costs, specialized AI expertise requirements, and limited access to diverse datasets have slowed progress. HOPPR tackles these challenges with a comprehensive platform built on four pillars: Platform, Data, Foundation Models, and Validation & Regulatory.
HOPPR's massive dataset of over 120 million imaging studies, with 70 million added annually, sets it apart. Sourced from over 400 imaging centers across eight states, this diverse dataset enables the training of robust foundation models that generalize well across different populations.
The platform allows users to fine-tune HOPPR's pre-trained foundation models for specific clinical applications, drastically reducing resource requirements compared to training from scratch. Both self-service tools and expert support are provided. Integration into clinical workflows is facilitated via a secure API, with HIPAA compliance ensured through robust de-identification techniques. HOPPR's commitment to quality control, operating under an ISO 13485-compliant QMS, sets a new standard for foundation models in medical imaging, building trust in deployed AI solutions. By providing the infrastructure, data, and tools, HOPPR accelerates AI development and clinical implementation, optimizing workflows and meeting the growing demands of medical imaging.
Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection by Tsun-Hin Cheung, Ka-Chun Fung, Songjiang Lai, Kwan-Ho Lin, Vincent Ng, Kin-Man Lam https://arxiv.org/abs/2411.19220
Caption: This diagram illustrates the zero-shot anomaly detection pipeline. It leverages an LLM for prompt generation (positive and negative descriptions of the product), Grounding DINO for object detection and cropping, and CLIP for image-text matching to produce an anomaly score. The text branch processes prompts while the image branch processes the input image, both feeding into the anomaly detection module.
This paper presents a novel zero-shot, training-free method for automated industrial image anomaly detection. This approach leverages a multimodal pipeline of three foundation models: an LLM (GPT-3), a grounding object detection model (Grounding DINO), and a zero-shot image-text matching model (CLIP). This eliminates the reliance on extensive labeled training data.
The method utilizes GPT-3 to generate text prompts describing normal and anomalous product appearances. Grounding DINO then locates the product within the image, crucial for mitigating background noise and multi-resolution issues. Finally, CLIP compares the cropped product image to the generated prompts, producing an anomaly score (s) calculated as:
s = (e<sub>fused</sub> ⋅ t<sub>anomaly</sub>) / (e<sub>fused</sub> ⋅ t<sub>anomaly</sub> + e<sub>fused</sub> ⋅ t<sub>normal</sub>)
where e<sub>fused</sub> is the fused image embedding, and t<sub>anomaly</sub> and t<sub>normal</sub> are the text embeddings for "anomaly" and "normal" prompts, respectively.
This method achieves impressive results on MVTec-AD (AUROC of 93.2%, AUPR of 96.6%) and VisA (AUROC of 82.9%, AUPR of 85.7%), outperforming existing zero-shot and few-shot methods. Ablation studies confirm the importance of both prompt generation and object detection. This research demonstrates the potential of combining LLMs, grounding object detection, and zero-shot image-text matching for efficient and scalable industrial anomaly detection, paving the way for future research in this area.
CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections by Mohamed Fazli Imam, Rufael Fedaku Marew, Jameel Hassan, Mustansar Fiaz, Alham Fikri Aji, Hisham Cholakkal https://arxiv.org/abs/2411.19346
Caption: The figure illustrates the NoLA (No Labels Attached) framework for zero-shot image classification. (a) shows the generation of Class Description Embeddings (CDEs) using an LLM and CLIP's text encoder. (b) depicts the DINO-based Labeling (DL) network aligning with CLIP's embedding space. (c) visualizes DINO-assisted prompt learning, where learnable visual prompts are prepended to the CLIP vision encoder input and trained using the DL network as an auto-labeller. Snowflakes represent frozen weights and flames represent trainable weights.
While VLMs like CLIP excel in zero-shot image classification, their performance often falls short of supervised approaches. NoLA (No Labels Attached) addresses this by combining CLIP with the self-supervised model DINO to fine-tune zero-shot classifiers using only unlabeled images.
NoLA's three-stage process begins by leveraging LLMs to generate robust textual embeddings from class-specific descriptions, forming a Class Description Embedding (CDE) classifier, '$\phi$': '$\phi_c = \frac{1}{K} \sum_{i=1}^{K} f_t(\theta_t)$' and '$\phi = Concat[\phi_1, \phi_2, ..., \phi_C]$', where '$\phi_c$' is the embedding for class 'c', 'K' is the number of descriptions, and '$f_t$' is the CLIP text encoder. Next, a DINO-based Labeling (DL) network, comprising a pre-trained DINO vision encoder and an alignment module, is trained using top-k confident samples from the CDE classifier. Finally, DINO-assisted prompt learning fine-tunes CLIP's vision encoder using learnable visual prompts trained with the DL network as an auto-labeller. The training objective is: '$min_{\theta_p, \phi} \ L_{SCE}(\phi(F_v(X_s; \theta_v, \theta_p)), h(g_s(X_o; \theta_g)))$', where '$L_{SCE}$' is the smoothed cross-entropy loss, '$F_v$' is the CLIP vision encoder, '$X_s$' and '$X_o$' are augmented images, and '$\theta_g$' are DINO parameters.
Evaluated on 11 datasets, NoLA achieves state-of-the-art performance on 9, surpassing existing label-free and some few-shot methods. It achieves a 3.6% average absolute gain over LaFter. Ablation studies confirm the contribution of each component, showcasing the power of combining LLM-generated descriptions and DINO's visual features for enhanced zero-shot classification.
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation by Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang https://arxiv.org/abs/2411.18499
Caption: This diagram outlines the OpenING benchmark creation pipeline, including data curation, annotation, filtering, and processing, as well as the training and evaluation pipeline for IntJudge, a novel judge model for interleaved image-text generation. It showcases the data flow through various stages, from dev set queries and gold answers to model training with RAG data and interleaved arena comparisons, ultimately leading to IntJudge's evaluation against human and GPT judges. The diagram also illustrates the performance of different MLLMs on the OpenING benchmark, highlighting the challenges and future directions in interleaved generation.
While MLLMs excel in various tasks, interleaved image-text generation remains challenging. Existing benchmarks lack the scale and diversity to adequately evaluate these models. GATE OpenING (OpenING) aims to solve this with 5,400 human-annotated instances across 56 real-world tasks, including travel planning, design, and brainstorming. A rigorous annotation pipeline ensures high-quality data.
Recognizing the limitations of current evaluation metrics, the authors also introduce IntJudge, a judge model trained using an "Interleaved Arena" for pairwise comparisons and a "Reference-Augmented Generation" (RAG) approach for data augmentation. IntJudge achieves 82.42% agreement with human judgments, surpassing GPT-40 by 11.34%.
Experiments on OpenING reveal that integrated pipelines generally outperform end-to-end models, although two-stage generators based on unified models show promise. Generating high-quality images remains a key challenge. Interestingly, GPT-40's generated text often surpasses human-annotated text. Ablation studies confirm the positive impact of increased sampling size and RAG data on IntJudge's performance. OpenING and IntJudge provide valuable resources for advancing interleaved image-text generation.
This newsletter highlights the exciting progress in multimodal image and text foundation models. From enhancing pathology analysis and detecting image manipulations to advancing generative capabilities and zero-shot learning, these models are transforming how we interact with and understand visual and textual information. The development of new benchmarks and evaluation metrics, like GATE OpenING and IntJudge, further fuels this progress, enabling more rigorous and nuanced assessment of these increasingly sophisticated models. The trend towards unified architectures, like JetFormer, and the innovative use of LLMs for prompt generation and reasoning, as seen in the zero-shot anomaly detection work, point towards a future where the seamless integration of vision and language unlocks even greater potential in AI.