This newsletter explores the cutting edge of multimodal AI, focusing on the rapid advancements in image and text foundation models. We'll delve into new benchmarks that expose limitations in current models, innovative architectures pushing the boundaries of unified multimodal understanding and generation, and exciting new datasets expanding the horizons of what's possible. From tackling ethical blind spots to generating remote sensing imagery, this collection of research highlights the dynamic and evolving landscape of multimodal AI.
M$^3$oralBench: A MultiModal Moral Benchmark for LVLMs by Bei Yan, Jie Zhang, Zhiyuan Chen, Shiguang Shan, Xilin Chen https://arxiv.org/abs/2412.20718
Caption: This diagram illustrates the construction of M³oralBench, a multimodal moral benchmark for Large Vision-Language Models (LVLMs). The process involves expanding Moral Foundations Vignettes (MFVs) using GPT-40, generating images from these scenarios with SD3.0, and creating moral tasks (judgement, classification, response) presented as multiple-choice questions. The resulting benchmark assesses LVLMs' ability to understand and reason about moral situations presented in both visual and textual formats.
The increasing integration of Large Vision-Language Models (LVLMs) into critical sectors necessitates a robust evaluation of their ethical alignment. M³oralBench, the first multimodal moral benchmark specifically for LVLMs, addresses this need by leveraging visual and textual cues to assess moral understanding and reasoning. Grounded in Moral Foundations Theory (MFT), this benchmark expands upon existing Moral Foundations Vignettes (MFVs).
The construction of M³oralBench involved expanding MFV scenarios with GPT-40, resulting in 1,160 diverse scenarios depicting everyday moral violations. These scenarios were then transformed into detailed image generation prompts and used to create corresponding images with the SD3.0 diffusion model. Three distinct moral tasks were designed: moral judgment (assessing right from wrong), moral classification (identifying the violated moral foundation), and moral response (choosing an appropriate reaction). These tasks, presented as multiple-choice questions, resulted in a total of 4,640 instructions.
Evaluation on ten prominent LVLMs, including both open-source and closed-source models, revealed significant performance gaps. Closed-source models generally outperformed their open-source counterparts, likely due to more rigorous alignment procedures during development. Among open-source models, GLM-4V showed the best overall performance. Across all models, moral classification emerged as the most challenging task, with average accuracy below 40%. Models also exhibited stronger performance on Care/Harm and Fairness/Cheating foundations compared to Loyalty/Betrayal and Sanctity/Degradation, highlighting specific areas for improvement. Model performance was quantified by the probability of choosing the correct option o<sub>r</sub>: ¹⁄ₘ Σⱼ₌₁ᴹ 𝐼[αⱼ = oᵣ], where M is the number of sampled responses and I is an indicator function.
The findings from M³oralBench underscore the need for continued research in aligning LVLMs with human values. The benchmark exposes limitations in nuanced moral reasoning and handling sensitive moral foundations, paving the way for the development of more ethically sound and reliable AI. It also highlights the need for improved evaluation methods that capture the multimodal nature of moral understanding and the interplay between visual and textual information in moral decision-making.
Dual Diffusion for Unified Image Generation and Understanding by Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, Peng Wang https://arxiv.org/abs/2501.00289
This paper addresses the challenge of creating a single model proficient in both image generation and understanding, introducing the Dual Diffusion Transformer (D-DiT). This novel architecture leverages the strengths of diffusion models (for generation) and autoregressive models (for understanding) through a cross-modal maximum likelihood estimation framework. This allows D-DiT to learn the conditional likelihoods of both images and text simultaneously under a single loss function.
D-DiT consists of two transformer branches, one for image tokens and another for text tokens, with cross-attention between them at each layer. The image branch predicts the velocity field of a continuous diffusion process conditioned on text. The text branch uses a masked discrete diffusion process to predict denoised text tokens conditioned on the image. The joint training objective combines a flow matching loss for image diffusion, L<sub>image</sub> = E<sub>t,q(img)</sub> ||∇<sub>θ</sub>(x<sup>(img)</sup>, t, x<sup>(txt)</sup>) – (ε – x<sup>(img)</sup>)||<sup>2</sup>, and a masked diffusion loss for text, L<sub>text</sub> = E<sub>q(txt)</sub>[Σ<sub>i=1</sub><sup>K</sup> log[x<sub>θ</sub>(x<sup>(txt)</sup>, x<sup>(img)</sup>) · x]/t<sub>i</sub>], where ε represents noise, x<sub>θ</sub> is the predicted denoised text, and t<sub>i</sub> are sampled diffusion timesteps. The overall loss is a weighted sum: L<sub>dual</sub> = L<sub>image</sub> + λL<sub>text</sub>.
Evaluations across image generation, captioning, and visual question answering demonstrate D-DiT’s capabilities. It achieved comparable text-to-image generation performance to SD3 and competitive results on VQA benchmarks, even surpassing some models combining I2T and T2I capabilities. This is particularly significant as D-DiT is the first diffusion-only model to achieve this. Further, it showed promise in long-form visual assistance, providing detailed responses to image-related queries. Ablation studies confirmed the importance of joint training and the bidirectional transformer architecture, highlighting D-DiT as a potent alternative to autoregressive models for unified multimodal understanding and generation.
GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models by Shangyu Xing, Changhao Xiang, Yuteng Han, Yifan Yue, Zhen Wu, Xinyu Liu, Zhangtai Wu, Fei Zhao, Xinyu Dai https://arxiv.org/abs/2412.21036
Caption: This scatter plot visualizes MLLM performance on the GePBench, a new benchmark assessing geometric perception. It compares performance on GePBench (y-axis) with performance on OpenCompass (x-axis), highlighting that even top-performing models struggle with basic geometric understanding, often scoring worse than random guessing (25%). The plot also shows the "passing threshold" for GePBench, indicating that most models fall short of satisfactory performance.
While MLLMs excel in complex real-world scenarios, GePBench reveals a surprising weakness: these models struggle with basic geometric perception. This benchmark focuses exclusively on geometric figures, assessing core competencies like spatial perception, shape understanding, and relationship identification across six dimensions: location, size, existence, counting, reference, and relationships. Comprising 20,000 images and 250,000 multiple-choice questions, GePBench offers easy and hard levels based on shape complexity and visual noise.
A specialized data synthesis engine generates structured textual descriptions, translated into geometric figures using Matplotlib, with added visual noise. Multiple-choice questions are automatically generated, targeting the six dimensions of geometric perception. Evaluations revealed significant limitations in current MLLMs, with even the best model, Gemini-1.5-pro, achieving only 69.4% average accuracy. Performance was particularly weak in size and location dimensions, with some models performing worse than random guessing. Scaling model size provided limited improvement, suggesting that simply increasing size isn't enough to address these fundamental perceptual challenges.
Different visual encoders within MLLMs showed specialization in different geometric dimensions, but combining them didn't significantly improve results. Training a new model, LLaVA-GeP, on GePBench data led to improvements on downstream tasks involving spatial perception and diagram understanding, averaging a 0.8% boost across nine real-world benchmarks. This underscores the importance of geometric perception as a foundational skill for advanced multimodal applications and highlights GePBench as a valuable tool for evaluating and improving MLLMs.
Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model by Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, Zhenwei Shi https://arxiv.org/abs/2501.00895