Hi Elman,
This newsletter delves into the latest advancements and challenges in the rapidly evolving field of multimodal image and text foundation models. We'll explore new benchmarks for evaluating reasoning capabilities, novel architectures for improved performance and scalability, efficient serving strategies for large models, and critical security vulnerabilities that demand attention. The convergence of vision and language continues to be a fertile ground for research, and this newsletter aims to provide you with a concise overview of some of the most exciting recent developments.
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark by Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng https://arxiv.org/abs/2501.05444
Multimodal Large Language Models (MLLMs) have shown remarkable progress, but their ability to reason across text and image modalities remains a critical open question. Existing benchmarks often fall short, relying on superficial visual understanding or text-dominant reasoning. This paper introduces EMMA (Enhanced MultiModal reAsoning), a new benchmark specifically designed to evaluate organic multimodal reasoning across STEM subjects (mathematics, physics, chemistry, and coding) and coding tasks involving visualization generation.
EMMA's 2,788 problems, including 1,796 newly created questions developed with domain experts, require a back-and-forth interplay between visual and textual information. This necessitates cross-modal reasoning that goes beyond independent processing of each modality. The questions are categorized by specific skills within each subject, offering fine-grained insights into MLLM capabilities. For example, math questions assess 3D spatial simulation, 2D transformation, and path tracing, while chemistry questions focus on knowledge-based counting, structure recognition, and reaction simulation. The coding section, uniquely focused on visualization generation, evaluates tasks such as reproducing a visualization from code, generating code from a visualization, and modifying code to achieve a target visualization.
Evaluation of nine state-of-the-art MLLMs on EMMA reveals significant limitations. On EMMA-mini (a balanced subset), the best model achieved only 45.75% accuracy, significantly below human expert performance (77.75%). Even advanced techniques like Chain-of-Thought (CoT) prompting and test-time compute scaling, such as majority voting and best-of-N selection, offered minimal improvements. This suggests that simply increasing candidate responses doesn't address the fundamental challenges of multimodal reasoning. A detailed error analysis points to visual reasoning as a key bottleneck, with models struggling on tasks involving precise spatial simulations, multi-hop visual reasoning, and integrating visual and textual information. Interestingly, CoT prompting, while beneficial for some closed-source models, often hindered open-source models, possibly indicating their inability to effectively leverage textual CoT for visual-centric tasks. EMMA highlights the urgent need for improved architectures and training paradigms to bridge the gap between human and machine reasoning in multimodality.
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design by Ziheng Wu, Zhenghao Chen, Ruipu Luo, Can Zhang, Yuan Gao, Zhentao He, Xian Wang, Haoran Lin, Minghui Qiu https://arxiv.org/abs/2501.05901
ByteDance introduces Valley2, a new MLLM focused on e-commerce and short-video applications. Utilizing Qwen2.5 as its LLM backbone and SigLIP-384 as the vision encoder, Valley2 incorporates a ConvAdapter, a lightweight component that reduces vision encoder output tokens without expanding dimensions, improving efficiency and training stability. The Eagle Module, featuring a parallel vision encoder, addresses distortions and expands token representation, particularly beneficial for OCR and large-document understanding.
ByteDance curated specialized e-commerce datasets and benchmarks, including the Ecom-Caption and Ecom-VQA Benchmarks, focusing on text-vision alignment and domain-specific knowledge/reasoning across single/multiple images and video. The training corpus includes 2M multimodal alignment data, 5M knowledge-injected instances, and 1M high-quality instruction data. Chain-of-Thought (CoT) data enhances systematic reasoning and structured outputs.
Valley2's four-stage training pipeline (Text-Vision Aligning, High-Quality Knowledge Learning, Instruction Fine-Tuning, and CoT Post-Training) incorporates offline packing for a 220% training speed boost, curriculum learning, and annealing techniques.
Valley2 ranks second on OpenCompass among sub-10B parameter models (average score: 67.4) and achieves state-of-the-art results on e-commerce benchmarks (79.66 vs. 72.76 for similar-sized open-source models). Ablation studies confirm the contributions of Qwen2.5, ConvAdapter, and the Eagle Module. Future work will integrate audio, develop multimodal embedding training for retrieval, and create more complex benchmarks.
Efficiently serving large multimedia models using EPD Disaggregation by Gursimran Singh, Xinglu Wang, Ivan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan https://arxiv.org/abs/2501.05460
Caption: This diagram illustrates the EPD Disaggregation framework, which decouples the encoding, prefill, and decoding stages of Large Multimodal Model (LMM) inference onto dedicated resources (highlighted by the blue borders). This allows for asynchronous data transfer and optimized resource allocation for each stage, leading to improved performance and efficiency. The red-bordered components are specific to the EPD framework.
Large Multimodal Models (LMMs) introduce a computationally and memory-intensive encoding stage, impacting key Service Level Objectives (SLOs) like Time To First Token (TTFT) and end-to-end throughput. This paper introduces EPD (Encode-Prefill-Decode) Disaggregation, a framework decoupling these stages onto dedicated resources, enabling customized batching, parallelization, and scheduling. A new caching mechanism for multimodal tokens allows asynchronous transfer between stages, minimizing latency. A black-box optimization algorithm dynamically determines optimal configurations (p for parallelization, b for batch size, s for scheduling) to maximize a performance metric f(p, b, s) while minimizing GPU cost cost(p):
max<sub>(p,b,s)∈X</sub> f(p, b, s) – βcost(p)
where X is the search space for system configurations and β controls the performance-cost trade-off.
Evaluations with MiniCPMv 2.6, InternVL2-8B, and InternVL2-26B show EPD Disaggregation significantly improves memory efficiency (up to 15x lesser for encoding GPUs), supports up to 22x higher batch sizes, 10x more images/request, and 2.2x higher KV cache size. It also improves SLO attainment (up to 90-100% more), TTFT (up to 71% lower), and end-to-end throughput (up to 57% better) compared to non-disaggregated systems. The asynchronous token transfer and dynamic resource allocation further enhance efficiency.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency by Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, Xingxing Wei https://arxiv.org/abs/2501.04931
Caption: This diagram illustrates the SI-Attack, a novel jailbreaking technique exploiting the "Shuffle Inconsistency" vulnerability in Multimodal Large Language Models (MLLMs). By shuffling text and image inputs at the word and patch levels respectively, and using a toxic judge model for optimization, SI-Attack bypasses MLLM safety mechanisms to elicit harmful responses, demonstrating significantly higher attack success rates against both open-source and closed-source models.
Despite advancements, MLLMs remain vulnerable to jailbreak attacks. This paper introduces SI-Attack, exploiting Shuffle Inconsistency – the discrepancy between MLLM comprehension and safety abilities when processing shuffled harmful instructions. MLLMs understand shuffled inputs but their safety mechanisms are less effective in this context.
SI-Attack shuffles text at the word level (T' = Shuffle<sub>w</sub>(T), T = [w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>]) and images at the patch level (I' = Shuffle<sub>p</sub>(I), I = [p<sub>1</sub>, p<sub>2</sub>, ..., p<sub>m</sub>]). A query-based black-box optimization iteratively queries the MLLM with shuffled inputs and uses a toxic judge model (ChatGPT-3.5) to assess response harmfulness, guiding selection of increasingly harmful inputs.
SI-Attack outperforms existing methods on MM-safetybench, HADES, and SafeBench. It achieves attack success rates (ASR) of 62.68%, 62.44%, 71.01%, and 40.77% for LLaVA-NEXT, MiniGPT-4, InternVL-2, and VLGuard, respectively, surpassing QR-Attack by a large margin (18.69% to 35.24%). It also demonstrates effectiveness against closed-source models like GPT-40 (68.57% ASR) and Claude-3.5-Sonnet (47.20% ASR). This Shuffle Inconsistency highlights a critical security gap, demanding more robust safety mechanisms.
This newsletter has highlighted key themes in multimodal research. The introduction of EMMA provides a much-needed robust benchmark for evaluating true multimodal reasoning capabilities, revealing current limitations of MLLMs. Valley2 demonstrates progress in architectural design and training methodologies, particularly for specialized domains like e-commerce. EPD Disaggregation offers a practical solution for efficiently serving large multimodal models, addressing critical performance bottlenecks. Finally, the discovery of Shuffle Inconsistency exposes a significant security vulnerability, emphasizing the urgent need for stronger safety mechanisms in MLLMs. These advancements and challenges underscore the dynamic nature of this field and the continued push towards more capable, robust, and secure multimodal AI systems.