This newsletter dives into the latest advancements in multimodal image and text foundation models. We'll explore groundbreaking new models like Magma and GRAPHGPT-O, which push the boundaries of AI agent capabilities and graph-based multimodal understanding. We'll also examine innovative training techniques like ViFT that challenge conventional wisdom and frameworks like HermesFlow that aim to harmonize the understanding and generation abilities of these powerful models. Finally, we'll delve into the limitations revealed by MET-Bench, a new benchmark for multimodal entity tracking, highlighting the challenges that still lie ahead.
Magma: A Foundation Model for Multimodal AI Agents by Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao https://arxiv.org/abs/2502.13130
Tired of AI agents confined to single tasks? Magma, a groundbreaking foundation model developed at Microsoft, tackles multimodal AI agentic tasks in both digital and physical realms. Unlike previous Vision-Language-Action (VLA) models specializing in either UI navigation or robot manipulation, Magma boasts a unified architecture capable of understanding, planning, and acting across diverse domains. This single model can navigate websites, control robotic arms, and answer visual questions, thanks to novel pretraining techniques and a massive, heterogeneous dataset.
The key to Magma's versatility lies in its two novel pretraining objectives: Set-of-Mark (SoM) and Trace-of-Mark (ToM). SoM enhances action grounding by labeling actionable objects in images with numeric marks, simplifying prediction. ToM extends this to videos, forcing the model to predict the future trajectories of these marks, improving action planning. This allows Magma to learn from UI datasets, robotics data, instructional videos, and image-text pairs, bridging the gap between verbal and spatial intelligence.
Architecturally, Magma combines a convolutional vision encoder (ConvNeXt) with a decoder-only LLM (LLaMA-3-8B) to process visual and textual inputs and generate verbal, spatial, and action outputs. Agentic modeling is formulated as an autoregressive decoding procedure: O = π(I, task, ctx) = {o<sub>1</sub>,…, o<sub>T</sub>}, where O represents output tokens, I visual observations, task the task description, and ctx the context.
Zero-shot evaluations demonstrate Magma's superior performance. It achieves state-of-the-art results on UI navigation benchmarks like ScreenSpot and VisualWebBench, outperforming existing general-domain LMMs and specialized agentic models. In robotic manipulation, Magma surpasses even domain-specific models like OpenVLA in simulated environments (SimplerEnv), nearly doubling the average success rate. Furthermore, Magma shows promise in real-world robotic tasks, successfully performing complex manipulations like hot dog assembly and cloth pushing on a WidowX 250 robot arm. It also maintains competitive performance on visual question answering tasks, comparable to leading LMMs.
Ablation studies confirm the significance of both SoM and ToM. Combining UI and robotics data without these techniques degrades performance due to domain discrepancies. However, their unified interface allows Magma to learn effectively from heterogeneous data, significantly boosting both verbal and spatial intelligence, further validated by its superior performance on visual spatial reasoning benchmarks like VSR and SpatialEval. With moderate finetuning, Magma adapts efficiently to downstream tasks, achieving state-of-the-art results on web and mobile UI navigation (Mind2Web and AITW) and demonstrating strong few-shot learning in robotic manipulation (LIBERO).
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation by Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui https://arxiv.org/abs/2502.12148
Multimodal Large Language Models (MLLMs) have significantly advanced both understanding and generating visual content. However, a key discrepancy exists: MLLMs typically demonstrate stronger understanding than generation. This gap, quantified through a novel evaluation pipeline involving GPT-40, poses a significant challenge for developing truly unified multimodal models. Existing methods often focus on improving either understanding or generation individually, neglecting the potential synergy between them.
HermesFlow addresses this imbalance by leveraging homologous data (image-caption pairs) to curate paired preference data for both understanding and generation. For understanding, MLLMs generate multiple captions for a single image, and BERT similarity scores select the best and worst, forming the preference pair. For generation, MLLMs generate multiple images from a single caption, using a self-critique approach based on self-VQA scoring to select the best and worst.
The curated homologous preference data trains the MLLM using Pair-DPO, an extension of Direct Preference Optimization. The Pair-DPO loss function, L<sub>Pair-DPO</sub>(θ) = -E<sub>(x,y,x<sub>w</sub>,x<sub>l</sub>,y<sub>w</sub>,y<sub>l</sub>)~D</sub> [log σ (Δ<sub>Und</sub>Δ<sub>Gen</sub>)], where Δ<sub>Und</sub> and Δ<sub>Gen</sub> represent preference differences in understanding and generation, simultaneously optimizes both by maximizing preference for winning samples (x<sub>w</sub>, y<sub>w</sub>) and minimizing it for losing samples (x<sub>l</sub>, y<sub>l</sub>). This joint optimization, combined with self-play iterative training where the MLLM refines its own preference data, allows continuous self-improvement without external high-quality data.
Experiments demonstrate HermesFlow's effectiveness. On understanding benchmarks like POPE, MME, Flickr30k, VQAv2, GQA, and MMMU, it achieves comparable or superior performance to larger models with fewer parameters. For instance, it achieves an MMMU score of 28.3 compared to Show-o's 27.4. On generation benchmarks like GenEval and DPG-Bench, it outperforms diffusion-based and autoregressive models, improving object attribute generation and accurate counting. User studies confirm superior generation quality, with a clear preference for HermesFlow's generated images.
Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models by Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen https://arxiv.org/abs/2502.11427
Caption: This diagram illustrates the ViFT (Visual Instruction-Free Fine-tuning) framework for training large vision-language models. It shows the separate training pathways for visual perception (using image captions) and task-solving (using text instructions), and how these abilities are fused during inference using weighted vectors to generate a final response without explicit visual instructions. The diagram also contrasts this approach with traditional sub-ability specific fine-tuning that relies on image-instruction pairs.
Visual instruction tuning is the current standard for training large vision-language models (LVLMs), enabling them to perform various multimodal tasks. However, relying on image-instruction pairs presents challenges, including the cost of dataset creation and potential inaccuracies in synthesized instructions. ViFT (Visual Instruction-Free Fine-tuning) challenges the need for visual instructions by disentangling and separately training visual perception and task-solving abilities.
ViFT utilizes readily available text-only instruction data (e.g., FLAN, OpenHermes) for task-solving and image caption data (e.g., LAION, supplemented with synthetic captions) for visual perception. During training, the LVLM is jointly fine-tuned on these data sources using a modality-specific batching strategy. The training objective is an auto-regressive loss function: L(θ) = ∑ log Pr(rj|v, q, r<j; θ), where v is the image (empty set for text instructions), q is the query, r is the response, and θ are the model parameters.
During inference, ViFT extracts separate "steering vectors" representing individual abilities. A task-solving vector, h(q), is derived from the LLM's hidden states processing the text instruction q. A visual perception vector, h(v, q), is extracted from the hidden states corresponding to q when processing both the image v and instruction q. These vectors are combined using weighted addition: h'(v, q) = αh(v, q) + βh(q). This fused vector guides the LLM's generation, effectively combining both abilities.
ViFT outperforms state-of-the-art open-source LVLMs on visual reasoning benchmarks like MathVerse and MathVision, achieving scores of 34.8 and 24.0 respectively, compared to LLaVA-OneVision's 31.0 and 18.1, with significantly less training data. Adding a small amount of VQA data (ViFT-A) further improves performance on some benchmarks. ViFT also demonstrates superior performance on LLaVA-Bench, a visual instruction following benchmark.
GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han https://arxiv.org/abs/2502.11925
Multimodal Large Language Models (MLLMs) excel at understanding and generating text and images individually, but often struggle with relationships between these modalities, particularly in graph structures. Real-world scenarios frequently involve multimodal attributed graphs (MMAGs), where nodes have associated text and image attributes, and edges depict relationships. Existing MLLMs struggle to incorporate this structural information.
GRAPHGPT-O addresses these limitations, enabling synergistic multimodal comprehension and generation on MMAGs. It tackles challenges like graph size explosion, the non-Euclidean nature of graphs, hierarchical modality dependency, and inference dependency.
GRAPHGPT-O uses personalized PageRank (PPR) for neighbor sampling, $N(v_i) = \text{argmax}{N(v_i) \subset V, |N(v_i)|=K} \sum{v_j \in N(v_i)} P_{i,j}$, to select relevant neighboring nodes. It explores both linearization and a novel hierarchical aligner with node-level and graph structure Q-Formers to transform graph information into sequences for MLLM processing, capturing intricate modality dependencies. It also offers sequential and parallel inference strategies to manage interdependence between text and image generation.
Experiments on ART500K, Amazon-Baby, and Amazon-Beauty datasets demonstrate GRAPHGPT-O's effectiveness, showing significant improvements over baselines like DreamLLM and Chameleon. Qualitative evaluations reveal that GRAPHGPT-O generates images and text more contextually aligned with the graph structure.
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models by Vanya Cohen, Raymond Mooney https://arxiv.org/abs/2502.10886
Caption: This graph from the MET-Bench benchmark compares the accuracy of various large language models (LLMs) on the Chess task, tracking piece locations after a sequence of image-based moves. It demonstrates a significant drop in accuracy when models process image-based actions compared to text-based actions, highlighting a key limitation in multimodal entity tracking. The x-axis represents the number of moves, and the y-axis represents the accuracy of the models.
Entity tracking is crucial for natural language understanding. While text-based entity tracking has advanced, the multimodal aspect, where entities evolve through both text and visual information, remains challenging. MET-Bench evaluates VLMs on multimodal entity tracking using two structured domains: Chess (tracking piece locations) and the Shell Game (tracking a hidden ball).
MET-Bench represents initial and final states in text but evaluates tracking changes through image-based action sequences. It tests models in zero-shot, few-shot, and chain-of-thought settings, using both text and image-based action representations. The task is defined as inferring the final state S<sub>T</sub> = f(S<sub>0</sub>, A<sub>1</sub>, A<sub>2</sub>, ..., A<sub>T</sub>), where S<sub>0</sub> is the initial state, A<sub>t</sub> is the action at timestep t, and f models the state update function.
Results reveal a significant performance gap between text-based and image-based tracking. In Chess, the best text model achieved 96.8% zero-shot accuracy, while its image-based counterpart dropped to 66.2%. Models showed near-perfect accuracy in classifying individual image actions, suggesting the limitation lies in reasoning about sequential visual updates, not perception. Few-shot learning and chain-of-thought prompting improved performance, particularly in the Shell Game. Fine-tuning also significantly boosted performance, especially for images.
This newsletter highlights the rapid advancements and persistent challenges in multimodal image and text foundation models. Models like Magma demonstrate the potential for unified architectures to handle diverse tasks across digital and physical domains, while GRAPHGPT-O showcases the power of incorporating graph structures into multimodal understanding and generation. Innovative training approaches like ViFT offer potential solutions to the data bottleneck of visual instruction tuning, and frameworks like HermesFlow strive to bridge the gap between understanding and generation capabilities. However, benchmarks like MET-Bench reveal significant limitations in current models' ability to reason about sequential visual information and maintain entity coherence across modalities. These findings underscore the need for continued research into more robust multimodal representations and reasoning techniques, paving the way for truly intelligent and versatile multimodal AI agents.