This newsletter dives into the cutting edge of multimodal AI, exploring recent breakthroughs in image and video understanding, scientific table interpretation, and parameter-efficient fine-tuning techniques for these powerful models. From novel training paradigms to innovative architectural designs, these papers offer a glimpse into the future of multimodal AI and its potential to revolutionize how we interact with and understand information across different modalities.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding by Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao https://arxiv.org/abs/2501.13106
Caption: The image illustrates the four-stage training process of VideoLLaMA3: 1) Vision Encoder Adaptation, 2) Vision-Language Alignment, 3) Multi-task Fine-tuning, and 4) Video-centric Fine-tuning. Each stage leverages different datasets (sizes indicated below each stage) and modalities, culminating in a model capable of diverse image and video understanding tasks. The data used in each stage is visually represented by icons and quantities.
VideoLLaMA3 presents a new paradigm in multimodal foundation models, focusing on image and video understanding through a vision-centric approach. Unlike models that prioritize video-text data, VideoLLaMA3 emphasizes the importance of high-quality image-text data. This is reflected in its four-stage training process:
Vision Encoder Adaptation: This stage primes the vision encoder to handle dynamic image resolutions and capture fine-grained details. This is crucial for handling the variability inherent in real-world images and videos.
Vision-Language Pretraining: This stage leverages diverse image-text data and a small portion of text-only data to jointly tune the vision encoder, projector, and LLM. This broad exposure to various visual and textual concepts lays the foundation for robust multimodal understanding.
Multi-task Fine-tuning: This stage incorporates image-text and video-text data for specific downstream tasks. This targeted training helps specialize the model for practical applications, bridging the gap between general understanding and specific task performance.
Video-centric Fine-tuning: This final stage refines the model's capabilities for video understanding using diverse video datasets, including those designed for streaming video understanding and temporal grounding. This specialization ensures the model effectively handles the temporal dynamics inherent in video data.
Two key innovations underpin VideoLLaMA3's architecture: Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP). AVT allows the vision encoder to process images and videos of any resolution by using Rotary Position Embeddings (RoPE) instead of fixed positional embeddings. This allows for more detailed visual information capture, regardless of input size. DiffFP acts as a video compressor, reducing redundancy in video tokens by pruning patches with minimal differences between adjacent frames based on the $L_1$ norm distance in pixel space. This improves efficiency, especially for long videos. The authors also constructed a high-quality image re-caption dataset, VL3-Syn7M, from COYO-700M and meticulously cleaned and re-captioned it using aspect ratio filtering, aesthetic scoring, and text-image similarity calculations.
VideoLLaMA3 achieves state-of-the-art results on several image and video understanding benchmarks. On image tasks, the 2B model excels on InfoVQA, MathVista, and RealWorldQA. The 7B model further pushes these boundaries. On video tasks, both the 2B and 7B models demonstrate strong performance across various benchmarks, including VideoMME, PerceptionTest, MLVU-dev, TempCompass, and NextQA. While these results are impressive, the authors acknowledge limitations, including the need for higher-quality video-text data and optimization for real-time inference. Future work will focus on addressing these limitations and exploring the integration of other modalities, such as audio.
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning by Bohao Yang, Yingji Zhang, Dong Liu, André Freitas, Chenghua Lin https://arxiv.org/abs/2501.13042
Caption: This diagram illustrates the MMSci framework for multimodal scientific table understanding. The framework begins with the generation and verification of scientific table images (MMSci-Pre) and then proceeds to table structure learning and visual instruction tuning (MMSci-Ins) using these images along with instruction-following data. Finally, the framework outputs responses for tasks like table question answering (TQA), table fact verification (TFV), and table-to-text (T2T) as part of the MMSci dataset construction and evaluation (MMSci-Eval).
This paper addresses the challenges of scientific table understanding, a domain where traditional LLMs and even MLLMs with fixed input resolutions often fall short. The authors introduce MMSci, a comprehensive framework designed to improve multimodal scientific table understanding and reasoning with dynamic input image resolutions. MMSci consists of three key components:
MMSci-Pre: A domain-specific dataset of 52K scientific table structure recognition samples. This dataset is crucial for training models to understand the specific layout and conventions of scientific tables.
MMSci-Ins: An instruction tuning dataset with 12K samples across three table-based tasks: Table Question Answering (TQA), Table Fact Verification (TFV), and Table-to-Text (T2T). This dataset focuses on developing models' reasoning capabilities and understanding of scientific concepts within tabular data. Crucially, it includes explicit intermediate reasoning steps for enhanced learning.
MMSci-Eval: A benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities, a key requirement for understanding scientific tables.
A key finding is the importance of data quality over quantity. Models trained on the smaller, domain-specific MMSci-Pre dataset significantly outperformed those trained on a larger, general-domain table dataset. This highlights the value of curated, high-quality data for specialized tasks. The framework implemented dynamic input resolutions on two MLLM architectures: Qwen2-VL-7B-Instruct and LLaVA-NeXT-7B. Experiments demonstrate significant improvements in both general table understanding and numerical reasoning, with strong generalization to held-out datasets like TABMWP and TAT-QA. Ablation studies confirm the benefits of incorporating reasoning steps in the training data, while representational alignment analysis highlights the superior language-vision alignment of Qwen2-VL-7B-Instruct.
Parameter-Efficient Fine-Tuning for Foundation Models by Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang https://arxiv.org/abs/2501.13787
Caption: This infographic illustrates various Parameter-Efficient Fine-Tuning (PEFT) techniques categorized by their approach (Selective, Additive, Prompt, Reparameterization) and mapped to foundation models (LLMs, VFMs, VLMs, MFMs, VGMs) and their release years. It visually represents how PEFT methods modify model architectures and parameters, emphasizing their efficient adaptation for diverse tasks like language understanding, image understanding, and code generation. The infographic also highlights the progression of foundation models and associated PEFT tools over time.
This survey offers a deep dive into Parameter-Efficient Fine-Tuning (PEFT) techniques for foundation models. PEFT aims to minimize the computational and storage costs associated with fine-tuning large models while maintaining optimal performance on downstream tasks. The survey categorizes PEFT techniques into five families: Selective PEFT (e.g., BitFit, PASTA), Additive PEFT (adapter networks), Prompt PEFT (learned prompts), Reparameterization PEFT (e.g., LoRA), and Hybrid PEFT (e.g., UniPELT).
The survey explores the application of these techniques across various foundation model types, including LLMs, VFMs, VLMs, MFMs, and VGMs, providing a comprehensive overview of the current landscape. It highlights the strengths and weaknesses of each approach in different contexts. For example, LoRA and P-Tuning v2 have shown promise for LLMs, while adapter-based methods and visual prompt tuning are prevalent in VFMs. The survey emphasizes the importance of hyperparameter tuning in PEFT and the ongoing challenge of interpretability.
Looking forward, the survey identifies several key research directions, including domain-specific PEFT, continual PEFT, scaling laws for PEFT, and the development of PEFT-optimized architectures. It also suggests exploring inspiration from neuroscience for developing more efficient and biologically plausible PEFT methods.
RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering by Yang Bai, Christan Earl Grant, Daisy Zhe Wang https://arxiv.org/abs/2501.13297
Caption: This diagram illustrates the two-stage architecture of RAMQA, a novel framework for Multi-modal Question Answering. The first stage uses RankLLaVA to rank image-related documents, while the second stage employs RAMLLaMA to re-rank and generate answers, leveraging image-to-text transformations and a cache for efficiency. The example shows a query about a sheep image, demonstrating the flow of information through the system.
RAMQA introduces a unified framework for Multi-modal Retrieval-Augmented Question Answering (MRAQA), addressing the limitations of traditional ranking methods when combined with modern generative LLMs. RAMQA uses a two-stage approach:
RankLLaVA: A pointwise multi-modal ranker based on LLaVA encodes and ranks candidate documents based on relevance to the query. Zero-shot LLaVA transforms image data into text, simplifying the input for the LLM.
RAMLLaMA: A LLaMA model fine-tuned with instruction tuning and an autoregressive multi-task learning approach re-ranks the top-k documents from the first stage and generates the answer simultaneously. Using permutations of document candidates helps reduce bias in the ranking and generation process.
Evaluations on WebQA and MultiModalQA demonstrate RAMQA's effectiveness, achieving state-of-the-art results and significant improvements over strong baselines. Ablation studies highlight the importance of both permutation-based generative retrieval and the multi-task objective generation. While promising, the authors acknowledge limitations related to data dependence and potential biases.
This newsletter highlights significant advancements in multimodal image and text foundation models. From vision-centric training paradigms in VideoLLaMA3 to the specialized framework for scientific table understanding in MMSci, and the efficient fine-tuning strategies explored in the PEFT survey, these works push the boundaries of multimodal AI. RAMQA's unified approach to retrieval and generation further demonstrates the innovative ways researchers are combining traditional techniques with the power of generative LLMs. These advancements collectively pave the way for more robust, efficient, and versatile multimodal AI systems capable of tackling complex tasks across various domains. The identified limitations and future research directions provide a roadmap for continued progress in this exciting field.