Hi Elman,
In this newsletter, we delve into the latest advancements in multimodal image and text foundation models. We'll explore two exciting new papers that tackle critical challenges in this field: scaling model training and addressing data contamination. The first paper proposes a novel architecture to improve the efficiency of training massive multimodal models, while the second investigates the pervasive issue of data contamination and introduces a framework for detection and analysis.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models by Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin https://arxiv.org/abs/2411.04996
Caption: The image presents the architecture of a Mixture-of-Transformers (MoT) model, highlighting the modality-specific processing blocks for text, speech, and image inputs. It contrasts the autoregressive and diffusion objective pathways, showing how tokens are processed differently for next token prediction versus continuous diffusion timesteps. The diagram also illustrates the joint attention mechanism in a shared feature space and the sequence re-ordering buffer within the MoT framework.
Training large multimodal models is computationally demanding. This paper introduces Mixture-of-Transformers (MoT), a novel architecture designed to address this challenge. MoT employs a sparsity strategy, decoupling non-embedding parameters like feed-forward networks, attention matrices, and layer normalization by modality. This allows for modality-specific processing while maintaining global self-attention across all input modalities, effectively capturing cross-modal relationships. The authors posit that this approach acknowledges the inherent differences in processing various modalities, as observed in distinct modality clusters within the feature space of existing multimodal models.
The authors evaluated MoT across three increasingly complex multimodal settings. The first, Chameleon, focuses on autoregressive text and image generation. The second extends Chameleon to include speech (Chameleon+Speech). The third, Transfusion, uses autoregressive objectives for text and diffusion-based objectives for images. MoT was compared against dense transformer and Mixture-of-Experts (MoE-4x) baselines, controlling for FLOPs to ensure fair comparison. The study also explored combining MoT with MoE-4x, integrating the latter into MoT's text transformer. System profiling was conducted to assess wall-clock time efficiency.
Results showed significant computational gains. In Chameleon, the 7B MoT model matched the dense baseline's performance using only 55.8% of the FLOPs. Including speech in Chameleon+Speech maintained comparable performance across modalities with only 37.2% of the FLOPs for speech. Impressively, in Transfusion, a 760M MoT model outperformed a 1.4B dense baseline on key image generation metrics using only one-third of the FLOPs. System profiling revealed MoT achieved dense baseline image quality in 47.2% and text quality in 75.6% of the wall-clock time. Further analysis through leave-one-modality-out experiments and ablation studies confirmed the benefits of modality-specific parameter allocation and the effectiveness of untying parameters in different model components.
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination by Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang https://arxiv.org/abs/2411.03823
While MLLMs demonstrate impressive performance, concerns arise regarding data contamination, where benchmark data leaks into the training process, potentially inflating performance metrics. This paper introduces MM-Detect, a framework specifically designed to detect such contamination in MLLMs, focusing on Visual Question Answering (VQA) tasks.
MM-Detect employs two key methods. For multiple-choice VQA, it uses the Option Order Sensitivity Test, shuffling answer choices and measuring performance changes. For caption-based VQA, it uses Slot Guessing for Perturbation Captions, masking keywords in captions and their back-translated versions to evaluate the model's prediction ability. MM-Detect quantifies contamination at both dataset and instance levels using Δ = PCR - CR (difference between correct rate before and after perturbation) and IL = X / |D| (instance leakage metric, where X is the number of instances correctly answered before but incorrectly after perturbation, and |D| is the dataset size).
Applying MM-Detect to eleven MLLMs across five VQA datasets revealed widespread contamination, varying in degree across models and datasets. For instance, Claude-3.5-Sonnet showed a significant Δ of -5.3 on the ScienceQA training set, while fuyu-8b showed a substantial Δ on MMStar. Experiments with intentional contamination further validated MM-Detect's effectiveness, demonstrating its sensitivity to varying contamination levels. Introducing contaminated data led to an average 8.2% increase in the correct rate (CR) and a 4.5% decrease in Δ.
Interestingly, the study also suggests that contamination can originate from the pre-training phase of the LLMs used by MLLMs. LLaMA2-7b, used by LLaVA-1.5 and VILA, exhibited a high contamination rate (25.6%) even without image input. Finally, the paper demonstrates that training set leakage can significantly boost test set performance (average 4.3% increase in CR), highlighting the potential for unfair evaluation.
This newsletter highlighted two critical aspects of multimodal model development. The introduction of MoT offers a promising pathway towards more efficient training of large-scale multimodal models by reducing computational costs without sacrificing performance. Conversely, the findings on data contamination and the introduction of MM-Detect underscore the importance of rigorous evaluation and the need for standardized practices in dataset usage and reporting. These two papers contribute significantly to the evolving landscape of multimodal research, pushing towards more scalable and reliable models.