This newsletter explores the cutting-edge research in multimodal image and text foundation models, covering new architectures, training strategies, evaluation benchmarks, and ethical considerations. From Apple's new open vision model to advancements in 4D scene simulation and specialized medical AI, this edition provides a comprehensive overview of the key developments shaping the future of this exciting field.
Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby https://arxiv.org/abs/2411.14402
Caption: This diagram illustrates the architecture of AIMv2, showcasing the Prefix Vision Encoder processing image patches (1-4) and the Causal Multimodal Decoder generating both text ("Grayscale photo of the Eiffel Tower") and image patches based on the encoded visual information. The model is trained using a combined Pixel MSE Loss for image generation and Cross-entropy Loss for text generation. The final grayscale image of the Eiffel Tower is generated by "patchifying" the output image patches.
Apple researchers have introduced AIMv2, a family of open vision models designed to advance visual understanding. Departing from methods relying solely on discriminative or generative pre-training, AIMv2 combines the strengths of both in a multimodal setting. A large vision encoder is pre-trained alongside a multimodal decoder that autoregressively generates both image patches and text tokens. This joint training, using a unified objective function, allows AIMv2 to learn richer visual representations.
The objective function minimizes the combined loss of image and text generation:
L<sub>text</sub> + α * L<sub>img</sub>
where L<sub>text</sub> is the cross-entropy loss for text tokens and L<sub>img</sub> is the L2 pixel-level regression loss for image patches. This offers several advantages, including straightforward implementation, seamless integration with LLM-powered multimodal applications, and denser supervision compared to discriminative methods.
AIMv2 uses a Vision Transformer (ViT) for the encoder and a causal multimodal decoder. The encoder processes image patches using prefix attention, while the decoder combines these representations with text embeddings, generating the next token in the sequence, regardless of modality. Trained on a massive 12 billion image-text pair dataset (a mix of public and proprietary data), AIMv2 boasts impressive performance. Post-training strategies further enhance its capabilities for downstream tasks. Notably, AIMv2-3B achieves 89.5% top-1 accuracy on ImageNet-1k with a frozen trunk, outperforming state-of-the-art contrastive models like CLIP and SigLIP in multimodal image understanding. Its strong performance and open-source nature position it as a promising foundation for future advancements in vision modeling.
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains by Yurii Paniv, Artur Kiulian, Dmytro Chaplynskyi, Mykola Khandoga, Anton Polishko, Tetiana Bas, Guillermo Gabrielli https://arxiv.org/abs/2411.14647
Caption: This chart visualizes the number of questions included in the ZNO-Vision benchmark, broken down by year and academic subject from 2006 to 2024. The benchmark, derived from the Ukrainian standardized university entrance exam (ZNO), tests vision-language models across diverse disciplines, as demonstrated by the variety of subjects represented.
This research addresses the significant gap in evaluating vision-language models (VLMs) for low-resource languages, specifically Ukrainian. The researchers introduce ZNO-Vision, a comprehensive multimodal Ukrainian benchmark based on the standardized university entrance exam (ZNO). This benchmark includes over 4,300 questions across 12 academic disciplines. The study also includes the first evaluation of multimodal text generation for Ukrainian, assessing caption generation on Multi30K-UK and translating the VQA benchmark. Additionally, UACUISINE, a novel benchmark based on Ukrainian cuisine, tests models' cultural knowledge.
Several proprietary and open-source VLMs were evaluated. ZNO-Vision required models to answer multiple-choice questions based on images and text. OCR capabilities were also tested. Multi30K-UK used SacreBLEU and BERT scores for caption generation. The translated VQA 2.0 benchmark followed original evaluation protocols. UACUISINE employed exact match (EM) and intersection match (IM) for dish and ingredient recognition, respectively, and BERT score for recipe generation. A fine-tuned Paligemma model was also tested.
Results on ZNO-Vision showed Gemini Pro and Claude 3.5 Sonnet performing best, with Qwen2-VL-72B being the only open-source model to significantly outperform the baseline. Results on Multi30K-UK were inconclusive. The translated VQA revealed performance degradation across all models compared to the English version, highlighting potential biases. UACUISINE demonstrated the benefits of fine-tuning, with the specialized Paligemma model outperforming Qwen. This research provides valuable insights into Ukrainian VLM development, highlighting the need for improved language capabilities and culturally sensitive training data.
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation by Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, Di Zhang https://arxiv.org/abs/2411.14423
Caption: This framework uses multi-modal foundation models (e.g., GPT-4) to infer material properties from images and text, then refines these properties through video diffusion and a differentiable Material Point Method (MPM) guided by optical flow loss. The optimized material parameters are applied to a 3D Gaussian splat representation of the scene, enabling realistic 4D dynamic simulations of complex object interactions. The example shows an alocasia plant swaying, demonstrating the method's ability to simulate deformable objects.
This research introduces a novel approach for enhanced 4D dynamic scene simulation, leveraging the power of multi-modal foundation models and video diffusion. Realistic simulation requires accurate material properties and complex object interactions grounded in physics. Existing methods often fall short, limited by basic material types and parameters. This new method utilizes multi-modal models like GPT-4 to identify materials and initialize parameters from image and text queries, simultaneously inferring 3D Gaussian splats for detailed scene representation.
These initial parameters are then refined using video diffusion with a differentiable Material Point Method (MPM), guided by optical flow. This avoids computationally expensive render loss or SDS loss. The optical flow loss is defined as:
Lflow = ∑t ||U(It, It+1) – Û(Ît, Ît+1)||2,
where U and Û represent the optical flow from input and simulated frames. This allows for efficient optimization of material properties even for complex motions.
Evaluated on synthetic and real-world datasets, the method demonstrates superior performance. On synthetic data, it achieves lower absolute errors in predicting material properties compared to methods like PAC-NeRF and GIC. On real-world data, human evaluation shows significant improvements in physical-realism and photo-realism. This integrated framework, combining multi-modal understanding, video diffusion, and differentiable physics, represents a significant advancement in physics-based simulation, offering increased accuracy and flexibility for applications in robotics and video generation.
HeadRouter: A Training-free Image Editing Framework for MM-DiTs by Adaptively Routing Attention Heads by Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, Tong-Yee Lee https://arxiv.org/abs/2411.15034
Caption: This diagram illustrates the HeadRouter framework for training-free image editing with Multimodal Diffusion Transformers (MM-DiTs). It showcases the Reconstruction and Editing branches, highlighting the Instance-adaptive Attention Head Router (IARouter) for selective head activation and the Dual-token Refinement Module (DTR) for enhancing image and text tokens. The framework enables precise semantic control for targeted edits by leveraging the inherent properties of MM-DiTs.
Multimodal Diffusion Transformers (MM-DiTs) excel in image generation, but targeted text-guided editing remains challenging due to their lack of explicit cross-attention mechanisms. HeadRouter, a novel training-free framework, addresses this by adaptively routing attention heads within MM-DiTs for precise semantic control.
The research reveals that different attention heads within MM-DiTs exhibit varying sensitivities to specific semantic concepts. This contrasts with CLIP-ViT, where specific heads are linked to specific image properties. HeadRouter leverages this adaptive semantic distribution with two key techniques: the Instance-adaptive Attention Head Router (IARouter) and the Dual-token Refinement Module (DTR).
IARouter selectively activates heads based on their sensitivity to the target semantic, calculated using a normalized dissimilarity score (d<sub>h</sub>) based on cosine similarity (s<sub>h</sub>): d<sub>h</sub> = (s<sub>max</sub> - s<sub>h</sub>) / (s<sub>max</sub> - s<sub>min</sub>). Weights (w<sub>h</sub>) are then assigned using a sigmoid function. DTR refines the editing process by enhancing image and text tokens, focusing edits on key regions and preserving text guidance across attention blocks.
Experimental results on TEDBench++ and PIE-Bench demonstrate HeadRouter's effectiveness, achieving higher scores in structure-alignment (DINO - 0.9194), prompt-alignment (CLIP - 0.3203), and image quality (LPIPS - 0.2103) compared to other training-free methods. This approach offers a promising direction for future research in text-guided image editing with diffusion transformers.
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI by Tianbin Li, Yanzhou Su, Wei Li, Bin Fu, Zhe Chen, Ziyan Huang, Guoan Wang, Chenglong Ma, Ying Chen, Ming Hu, Yanjun Li, Pengcheng Chen, Xiaowei Hu, Zhongying Deng, Yuanfeng Ji, Jin Ye, Yu Qiao, Junjun He https://arxiv.org/abs/2411.14522
Caption: Researchers have introduced GMAI-VL, a novel large vision-language model specifically designed for the medical domain, trained on a massive, multimodal dataset called GMAI-VL-5.5M. This dataset comprises 5.5 million image-text pairs spanning various medical specialties and modalities, enabling GMAI-VL to achieve state-of-the-art performance on several medical benchmarks, paving the way for more accurate and reliable medical image understanding. The image depicts the data collection, preprocessing, dataset creation, and the three-stage training process of the GMAI-VL model.
This research introduces GMAI-VL, a specialized LVM for the medical domain, trained on GMAI-VL-5.5M, a comprehensive multimodal dataset containing 5.5 million image-text pairs derived from 219 medical imaging datasets. This dataset covers 13 modalities and 18 specialties, offering broad coverage of various medical tasks like diagnosis and severity assessment. The inclusion of multilingual data (English and Chinese) enhances generalization capabilities.
GMAI-VL-5.5M was created using an annotation-guided data generation method. Key annotations from open-source datasets were used to generate detailed image descriptions and instruction-following data using GPT-40. This ensures high-quality, pathology-specific descriptions.
GMAI-VL, based on the LLaVA architecture, utilizes a three-stage training strategy: shallow alignment, deep alignment, and instruction tuning. This progressive approach strengthens the model's ability to integrate visual and linguistic features, improving its performance on various medical tasks.
GMAI-VL achieves state-of-the-art results on various medical benchmarks, including VQA-RAD (66.3%), SLAKE, PMC-VQA, OmniMedVQA (88.48% average accuracy), GMAI-MMBench (62.43%), and the Health & Medicine track of MMMU (51.3%). This work represents a significant step towards building truly generalist medical AI, with the comprehensive dataset and robust architecture paving the way for more accurate medical image understanding.
Evaluating and Advancing Multimodal Large Language Models in Ability Lens by Feng Chen, Chenhui Gou, Jing Liu, Yang Yang, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu https://arxiv.org/abs/2411.14725
Caption: (a) Offline evaluation of MLLMs using AbilityLens, which assesses various perception skills. (b) Online evaluation during training tracks performance over time, revealing ability conflicts and informing checkpoint selection. (c) Ability-Specific Model Merging (ASMM) combines checkpoints to enhance overall performance and stability by mitigating ability conflicts.
This research tackles the challenge of evaluating Multimodal Large Language Models (MLLMs), introducing AbilityLens, a new benchmark designed for a more robust and unified assessment of their visual perception skills. Existing benchmarks often lead to inconsistent evaluations due to their focus on specific question types or metrics, neglecting the crucial aspect of stability.
AbilityLens evaluates six key perception abilities: counting, OCR, attribute recognition, entity extraction, grounding, and structural data understanding. It leverages data from 11 existing benchmarks, providing diverse question types and domains. Model accuracy (A) is calculated as a weighted sum of baseline-corrected sub-metrics:
A = (Σ nᵢmᵢ) / N,
where N is the total sample count, nᵢ is the sample count for metric mᵢ. Stability (I) is measured by the standard deviation of the sub-metrics:
I = std(m).
AbilityLens offers both online and offline evaluation modes. Online evaluation enables real-time monitoring of training dynamics, while offline evaluation provides absolute rankings. Evaluations of 14 state-of-the-art MLLMs reveal a significant stability gap between open-source and commercial models. The research also identifies "ability conflict," where some abilities decline during training as others improve.
To mitigate this, the authors propose Ability-Specific Model Merging (ASMM), which combines a final checkpoint with earlier, ability-specific checkpoints using linear interpolation:
θᵢₙₜₑᵣₚₒₗₐₜₑ = α ⋅ θᵦₐₛₑ + (1 - α) ⋅ θ꜀ₕₑ꜀ₖₚₒᵢₙₜ.
ASMM improves both accuracy and stability, offering valuable insights for future MLLM development.
This newsletter has highlighted significant advancements in the field of multimodal image and text foundation models. We've seen the introduction of novel architectures like AIMv2, which leverage autoregressive pre-training for enhanced visual understanding. New benchmarks like AbilityLens and ZNO-Vision are addressing the critical need for robust evaluation and focusing on low-resource languages. Furthermore, the integration of multi-modal foundation models with techniques like video diffusion is pushing the boundaries of 4D dynamic scene simulation. Finally, the development of specialized models like GMAI-VL, trained on massive multimodal datasets like GMAI-VL-5.5M, demonstrates the potential of these models to revolutionize specific domains like medicine. These diverse research efforts collectively contribute to a more comprehensive and nuanced understanding of multimodal AI, paving the way for exciting future applications.