The field of multimodal AI is exploding, with new research constantly pushing the boundaries of what's possible. This newsletter dives into the latest advancements in image and text foundation models, covering novel architectures, efficient transfer learning techniques, and safety considerations. From boosting performance in specific domains like pathology and transmission line defect detection to unifying modalities through innovative "next-frame prediction," these papers showcase the exciting trajectory of multimodal AI.
Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement by Yanyan Huang, Weiqin Zhao, Yihang Chen, Yu Fu, Lequan Yu https://arxiv.org/abs/2411.09894
Caption: This diagram illustrates the Concept Anchor-guided Task-specific Feature Enhancement (CATE) framework for boosting foundation models in computational pathology. CATE uses concept anchors from a pathology vision-language model to guide two modules: the Concept-guided Information Bottleneck (CIB) and the Concept-Feature Interference (CFI), which enhance feature expressiveness and discriminativeness for improved performance in downstream tasks like cancer subtyping. The overview (a) shows the overall workflow, while (b), (c), and (d) detail the text encoder, CIB, and CFI modules, respectively.
Pathology foundation models, trained on vast datasets, offer powerful feature representations for Whole Slide Image (WSI) analysis. However, their general-purpose nature isn't always ideal for specific downstream tasks or cancer types. This research introduces Concept Anchor-guided Task-specific Feature Enhancement (CATE), a new paradigm designed to boost these models' expressiveness and discriminativeness without extra supervision or substantial computational overhead. CATE effectively provides a "free lunch" by adapting the foundation model to specific tasks, improving performance and generalization.
CATE leverages task-specific concept anchors derived from a pathology vision-language model using expert-designed prompts. These anchors guide two interconnected modules: the Concept-guided Information Bottleneck (CIB) and the Concept-Feature Interference (CFI). The CIB enhances task-relevant characteristics by maximizing the mutual information I(â; c) between image features and concept anchors while minimizing superfluous information I(x; â|c), following the information bottleneck principle. The CFI further refines these calibrated features by leveraging their similarity to concept anchors, creating highly discriminative task-specific features.
Evaluated on cancer subtyping tasks using three public WSI datasets (BRCA, NSCLC, and RCC) from The Cancer Genome Atlas (TCGA), CATE consistently improved the performance of several state-of-the-art Multiple Instance Learning (MIL) models. For example, on BRCA, CATE boosted the out-of-domain (OOD) Area Under the Curve (AUC) of ABMIL by an impressive 4.05% and OOD accuracy by 5.28% when only a single site was used for in-domain training. Similar improvements were observed across other MIL models and datasets, particularly in OOD settings, highlighting CATE's ability to enhance generalization. For NSCLC and RCC, the OOD performance was emphasized due to the single-subtype nature of samples within each site, reflecting the models' discriminative and generalization capabilities.
TL-CLIP: A Power-specific Multimodal Pre-trained Visual Foundation Model for Transmission Line Defect Recognition by Ke Zhang, Zhaoye Zheng, Yurong Guo, Jiacun Wang, Jiyuan Yang, Yangjie Xiao https://arxiv.org/abs/2411.11370
Caption: This diagram illustrates the two-stage training pipeline of TL-CLIP. Stage 1 shows the power-specific vision-language pre-training with three tasks: Image-Text Contrastive learning (ITC), Component Type Matching (CTM), and Defect-Normality Comparison (DNC). Stage 2 depicts the fine-tuning process, incorporating the ITC task to mitigate overfitting and enhance downstream task performance for both classification and detection.
Traditional transmission line defect recognition models often rely on general pre-trained VFMs, lacking domain-specific knowledge and resulting in weak generalization. This paper introduces TL-CLIP, a power-specific multimodal pre-trained VFM designed to address this. Leveraging vision-language pre-training (VLP), TL-CLIP incorporates power-related semantic knowledge, significantly improving defect recognition.
TL-CLIP uses a two-stage training pipeline. The first stage involves power-specific VLP, building on the Chinese CLIP (CN-CLIP) model. Beyond standard image-text contrastive learning (ITC), TL-CLIP introduces two novel pre-training tasks: Component Type Matching (CTM) and Defect-Normality Comparison (DNC). CTM enhances component discrimination by matching multimodal instances and judging inter-class relations (STSS: Same type & same status, STDS: Same type & different status, DT: Different types). DNC focuses on learning normality and defect concepts by comparing instances within the same component type. The pre-training loss is: L<sub>pre-train</sub> = λ<sub>1</sub>L<sub>itc</sub> + λ<sub>2</sub>L<sub>ctm</sub> + λ<sub>3</sub>L<sub>dnc</sub>, where L<sub>itc</sub>, L<sub>ctm</sub>, and L<sub>dnc</sub> are the respective losses, and λ<sub>1</sub>, λ<sub>2</sub>, and λ<sub>3</sub> are their weights.
The second stage is fine-tuning with the pre-training objective (FTP). To combat overfitting due to limited inspection data, FTP incorporates the ITC task into fine-tuning. This balances the strong discrimination of supervised learning with the weaker signals of contrastive learning. For defect classification, the loss is: L<sub>cls-ftp</sub> = L<sub>cls</sub> + L<sub>itc</sub>, where L<sub>cls</sub> is the classification loss. Defect detection uses a two-step approach: training the backbone and classification branch with ITC and image classification on cropped images, then training the full detector on the original dataset.
Experiments on transmission line defect datasets demonstrate TL-CLIP's effectiveness. On TLDC (classification), TL-CLIP achieves accuracy improvements of 2.9% and 5.6% over the CN-CLIP baseline with RN50 and ViT backbones, respectively. On TLDD (detection), it achieves mean Average Precision (mAP) improvements of 3.1% and 3.9%. Ablation studies confirm the contributions of both the power-specific VLP tasks and FTP. Attention map visualization reveals that TL-CLIP effectively learns component-related semantics and defect-normality concepts, focusing on relevant image regions.
JRadiEvo: A Japanese Radiology Report Generation Model Enhanced by Evolutionary Optimization of Model Merging by Kaito Baba, Ryota Yagi, Junichiro Takahashi, Risa Kishikawa, Satoshi Kodera https://arxiv.org/abs/2411.09933
Caption: This bar graph compares the density and weight parameters of different large language models (LLMs) used in the JRadiEvo model. It highlights the relatively balanced parameter distribution of OpenBioLLM, contributing to its crucial role in adapting the non-medical vision-language model to the medical domain. This efficient parameter usage allows JRadiEvo to achieve state-of-the-art performance with limited data.
JRadiEvo, a novel model for generating Japanese radiology reports, leverages evolutionary optimization to merge a non-medical vision-language model with medical and Japanese language text-to-text models. This addresses data scarcity in non-English medical contexts and limitations of relying on computationally expensive and privacy-compromising APIs. Remarkably, JRadiEvo achieves state-of-the-art performance with only 50 translated samples from the MIMIC-CXR dataset.
Its innovation lies in its efficient use of limited data and novel application of model merging. Traditional methods require large datasets, a hurdle in the privacy-conscious medical field. JRadiEvo bypasses this using an evolutionary algorithm to optimize the merging process (TIES-Merging with DARE), integrating the strengths of multiple pre-trained models without extensive training data. The model merges the parameters of a large language model (LLM) component (M<sub>L</sub>) of a vision-language model fine-tuned on vision-to-text data (θ<sup>ft</sup><sub>t1</sub>) with LLMs fine-tuned on medical and Japanese text-to-text data (θ<sup>ft</sup><sub>t2</sub>, θ<sup>ft</sup><sub>t3</sub>,...). The output (y) is: y = M<sub>L</sub>(M<sub>P</sub>(M<sub>v</sub>(x<sub>I</sub>)), x<sub>T</sub>), where x<sub>I</sub> is the image, x<sub>T</sub> is the text, M<sub>v</sub> is the vision encoder, and M<sub>P</sub> is the projector.
Evaluated against leading models (CheXagent, GPT-40, and instruction-tuned approaches) using BLEU, ROUGE-L, and METEOR, JRadiEvo consistently outperformed instruction-tuned models and CheXagent. While GPT-40 scored higher on BLEU, JRadiEvo excelled in ROUGE-L and METEOR, metrics closer to human evaluation. This is impressive given its compact size (8 billion parameters), making it more deployable than GPT-40, allowing local deployment within hospitals, addressing privacy and security concerns.
Analysis revealed OpenBioLLM's medical knowledge is crucial for adapting the non-medical VLM. While Llama 3 Swallow contributed to Japanese proficiency, the lack of medical knowledge was a greater limitation than language, highlighting domain-specific knowledge's importance. JRadiEvo demonstrates evolutionary model merging's potential for efficient domain adaptation, particularly in non-English languages and resource-limited settings.
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations by Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, Mahesh Pasupuleti https://arxiv.org/abs/2411.10414
Caption: An example from Llama Guard Vision's evaluation dataset demonstrates the model's ability to analyze image-prompt pairs and classify responses for safety. The model assesses the user's question about tax evasion alongside an image of tax documents, classifying the agent's response as "unsafe" due to its advice on illegal activities (S2). This highlights the model's capability to identify harmful content within multimodal conversations.
Meta introduces Llama Guard 3 Vision, a multimodal safeguard for human-AI image conversations. Unlike previous text-only versions, it supports image reasoning, detecting harmful content in both multimodal prompts (prompt classification) and responses (response classification). This addresses the growing need for safety in vision-language models. Fine-tuned on Llama 3.2-Vision, it uses the MLCommons taxonomy of 13 hazards.
Trained on a hybrid human-generated and synthetic dataset (image-prompt pairs and model-generated responses, some elicited using jailbreaking), it used supervised learning (sequence length: 8192, learning rate: 1 × 10⁻⁵, training steps: 3600). Data augmentation (dropping random prompt categories, shuffling indices) improved generalization.
Evaluation on an internal benchmark (MLCommons taxonomy) showed Llama Guard 3 Vision outperformed GPT-40 and GPT-40 mini (F1 score), particularly in response classification (F1: 0.938 vs. 0.667 and 0.641). It also had lower false positives. Prompt classification was lower due to inherent image-text prompt ambiguity (F1: 0.733), but still achieved F1 scores above 0.69 across all categories, excelling in "Indiscriminate Weapons" and "Elections."
Robustness testing (PGD-image, GCG-text attacks) showed it's more robust in response than prompt classification. Small PGD perturbations increased prompt misclassification (harmful as safe) from 21% to 70%. Unbounded PGD attacks only reached 27% misclassification for responses. GCG attacks bypassed prompt classification (72% misclassification), but response classification remained robust (30%-75% misclassification). This highlights the importance of both classification modes and further research into adversarial robustness.
Everything is a Video: Unifying Modalities through Next-Frame Prediction by G. Thomas Hudson, Dean Slack, Thomas Winterbottom, Jamie Sterling, Chenghao Xiao, Junjie Shentu, Noura Al Moubayed https://arxiv.org/abs/2411.10503
Caption: This image showcases examples of the "Everything is a Video" framework, demonstrating its application across diverse multimodal tasks. Each row represents a different task (video classification, video colorization, video QA, and text classification), with input frames and corresponding predictions or outputs displayed as a unified video sequence. The colored bars represent attention weights, highlighting the model's focus on relevant information within the sequence.
This research proposes a novel framework extending task reformulation from NLP to multimodal learning. The core idea: reformulate diverse tasks into a unified next-frame prediction problem. This allows a single model to handle various modalities without modality-specific components, treating inputs and outputs as sequential video frames. This facilitates seamless modality integration and knowledge transfer.
The methodology converts tasks into a standardized 64x64 RGB video sequence. Text is tokenized and rendered as frames, while images and audio are preprocessed. A separator token (|) delineates input and output frames. A transformer-based model (ViT/TimesFormer inspired) with local/global spatiotemporal self-attention processes patches at progressively smaller resolutions. Training is end-to-end with Multi-Scale Structural Similarity Index Measure (SSIM) loss on each dataset independently, without language model or image pre-training.
Evaluation across diverse tasks (text/image/video/audio classification, video QA, object tracking, video colorization) shows promising performance, often comparable to single-task models trained without extra data (e.g., 76.8 F1 on SST-2, 89.1% accuracy on CIFAR-10, 97.1% on AudioMNIST, 52.5% on CLEVRER, 0.63 IoU on LaSOT). While not exceeding state-of-the-art everywhere, it demonstrates the feasibility of this unified approach without extensive pre-training.
Attention map analysis reveals interesting learning patterns. For video classification, the model focuses on the action-performing object/person and indicative frames. In colorization, it prioritizes initial frames for consistent coloring and key objects/color areas. For VQA, it focuses on question keywords and object trajectories. In text classification, it attends to emotive words, similar to LLMs. Colorization highlighted a trade-off between color diversity and temporal consistency. This research is a step towards generalized multimodal foundation models.
Efficient Transfer Learning for Video-language Foundation Models by Haoxing Chen, Zizheng Huang, Yan Hong, Yanshuo Wang, Zhongcai Lyu, Zhuoer Xu, Jun Lan, Zhangxuan Gu https://arxiv.org/abs/2411.11223
Caption: The diagram illustrates the Multi-modal Spatio-Temporal Adapter (MSTA) architecture for efficient transfer learning in video-language models. MSTA uses separate and shared projection layers for text and video inputs, incorporating a spatio-temporal description-guided consistency constraint to preserve general knowledge while adapting to video action recognition. This constraint (Lcc) is visualized by comparing outputs from pre-trained and trainable language branches given template and LLM-generated descriptions.
Adapting pre-trained vision-language models to video action recognition efficiently while preserving generalized knowledge is challenging. Current methods, while effective, can lead to catastrophic forgetting. This paper introduces Multi-modal Spatio-Temporal Adapter (MSTA), a more efficient and effective transfer learning method balancing task-specific and general knowledge.
MSTA uses independent projection layers (text and video branches) to learn modality-specific knowledge and a shared projection layer for representation alignment. This shared space receives gradients from both modalities during fine-tuning. The video branch has spatial and temporal up-projection layers for spatiotemporal feature adaptation. A spatio-temporal description-guided consistency constraint combats overfitting. This involves feeding template inputs (e.g., "a video of {cls}") to the trainable language branch and LLM-generated descriptions to the pre-trained branch, enforcing output consistency. The consistency constraint loss (L<sub>cc</sub>) uses cosine distance:
L<sub>cc</sub> = Σ<sub>s</sub> (w<sub>c</sub> ⋅ D<sub>s</sub>) / (||w<sub>c</sub>|| ||D<sub>s</sub>||) + Σ<sub>t</sub> (w<sub>c</sub> ⋅ D<sub>t</sub>) / (||w<sub>c</sub>|| ||D<sub>t</sub>||)
where w<sub>c</sub> is the pre-trained branch's text embedding for class c, and D<sub>s</sub> and D<sub>t</sub> are average embeddings of spatio-temporal descriptions for class c. This helps retain pre-trained knowledge.
Evaluated on zero-shot, few-shot, base-to-novel generalization, and fully-supervised learning across six datasets, MSTA achieved state-of-the-art results in base-to-novel generalization (e.g., +37% novel category accuracy on Something-Something V2 vs. OST). In few-shot learning, it secured 9 of 12 best performances with only 10% of the parameters compared to the second-best method. It also showed consistent improvements in zero-shot transfer and achieved state-of-the-art results in fully-supervised learning on Kinetics-400. Ablation studies confirmed each component's importance. MSTA offers a promising direction for efficient transfer learning, promoting robust and generalizable video understanding.
This newsletter has highlighted several key trends in multimodal image and text foundation models. The development of techniques like CATE and MSTA demonstrates a clear focus on efficient adaptation and transfer learning, addressing the challenges of overfitting and maximizing the utility of pre-trained models. The introduction of TL-CLIP and JRadiEvo showcases the power of specializing these models for specific domains and languages, opening up new possibilities for applications in fields like medicine and infrastructure monitoring. Finally, the ongoing work on safety mechanisms like Llama Guard 3 Vision underscores the crucial importance of addressing ethical considerations and building robust safeguards as these powerful models become increasingly integrated into human-AI interactions. The "Everything is a Video" framework represents a particularly compelling vision for the future, suggesting a path towards truly unified multimodal models capable of seamlessly processing and understanding information from diverse sources.