This newsletter explores the cutting edge of multimodal AI, focusing on the exciting developments in image and text foundation models. We'll delve into novel architectures, training paradigms, and evaluation strategies that are pushing the boundaries of what's possible in this rapidly evolving field. From generating images from complex text-image instructions to understanding the nuances of multimodal persona embodiment, this newsletter provides a concise yet comprehensive overview of the key advancements driving the future of multimodal AI.
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think by Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang https://arxiv.org/abs/2502.20172
Caption: This diagram illustrates three different approaches to conditioning diffusion models for image generation. (a) Emu2 uses regression on the LMM output. (b) Seed employs special tokens within the LMM. (c) Dream Engine (Ours) leverages the hidden states of the LMM, providing a richer representation for multimodal control.
The current landscape of text-to-image generation is rapidly evolving, with unified frameworks combining powerful text encoders like CLIP and T5 with Diffusion Transformer backbones gaining prominence. However, these models often struggle with complex, interleaved text-image instructions, particularly those involving the fusion of concepts from multiple images. Existing control methods like ControlNet and IP-Adapter provide low-level control, but they fall short when it comes to high-level semantic manipulation. DREAM ENGINE addresses this limitation by introducing a novel approach to arbitrary text-image interleaved control.
The core innovation of DREAM ENGINE lies in its use of Large Multimodal Models (LMMs), such as QwenVL. These LMMs provide a shared representation space where text and images align seamlessly, creating an ideal conditioning input for external diffusion models. Instead of relying on traditional text-only encoders as seen in models like Stable Diffusion v3.5, DREAM ENGINE incorporates an LMM and a lightweight projector layer. The training process involves a two-stage paradigm: joint text-image alignment followed by multimodal interleaved instruction tuning. A key feature is the inclusion of a skip connection for visual features within the LMM, controlled by a blending ratio r ( h₁ = (1 - r) ·hLLM + r · hvit ), which enables fine-grained control over visual consistency.
The results achieved by DREAM ENGINE are impressive. On the GenEval benchmark, it attains an overall score of 0.69, comparable to state-of-the-art models like SDv3.5 (0.71) and outperforming FLUX.1 Dev (0.66). This demonstrates the feasibility of replacing text encoders with powerful multimodal encoders without compromising generation quality. Moreover, DREAM ENGINE excels in tasks requiring text-image interleaved instructions, significantly outperforming models like Emu2-gen with considerably less training data. It exhibits emergent capabilities, such as synthesizing concepts from multiple input images based on textual prompts, a feat not explicitly present in the training data.
The training process reveals a fascinating concept-to-detail progression. Initially, DREAM ENGINE focuses on reconstructing primary concepts from images, leveraging the aligned representation space of the LMM. As training progresses, it refines details like colors, shapes, and poses. This dynamic showcases the strength of LMMs in providing a unified framework for text and image understanding. An ablation study on the visual blending ratio r further confirms its importance in controlling visual consistency, with higher ratios resulting in greater fidelity in image reconstruction.
A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs by Julius Broomfield, Kartik Sharma, Srijan Kumar https://arxiv.org/abs/2502.20504
Caption: This bar chart visualizes the refusal counts of five multimodal LLMs (GPT-40, GPT-40 Mini, Llama 11B, Llama 90B, and Pixtral 12B) across four persona modalities: text, image, assisted image (image with short text), and descriptive image (text embedded within the image). The results demonstrate a significant preference for text-based personas, with Llama 90B exhibiting the highest refusal count across all modalities. The other LLMs have negligible refusal counts.
Large language models (LLMs) are becoming increasingly proficient at embodying personas, which enhances their capabilities as conversational agents and virtual assistants. While LLMs have advanced significantly in multimodal processing, the influence of persona modality (text vs. image) on LLM embodiment remains largely unexplored. This research investigates how different modalities affect the expressiveness of personas in multimodal LLMs.
The researchers constructed a modality-parallel dataset of 40 diverse personas, each represented in four ways: image-only, text-only, image with short text, and typographical images (text stylized within the image). They then employed a systematic evaluation framework with 60 questions and corresponding metrics to assess how well five multimodal LLMs (GPT-40, GPT-40 mini, Llama 3.2 11B, Llama 3.2 90B, and Pixtral 12B) embodied each persona across various attributes and scenarios. The evaluation included both LLM-based assessments (persona consistency, linguistic habits, action justification, and expected action) and linguistic analysis (types, root type-token ratio, and measure of textual lexical diversity).
The study reveals a clear preference for text-based personas. Across all LLMs, text-based personas consistently scored higher in most evaluation criteria, especially in linguistic habits, with a minimum 0.2 increase in score. This suggests that LLMs are more proficient at extracting and utilizing information from textual descriptions than from visual representations. This preference was further corroborated by a preference-based evaluation, where an LLM judge chose text-based responses over 90% of the time. Interestingly, descriptive images (text embedded within the image) sometimes outperformed text in persona consistency and expected action, indicating that LLMs might prioritize embedded text for guiding actions.
This research highlights a significant gap in the current capabilities of multimodal LLMs. Despite impressive advancements in language modeling, the vision frontier requires further exploration. The richness of visual information is often overlooked by LLMs, hindering their ability to fully embody visual personas. This underscores the need for future work to improve the vision-understanding capabilities of multimodal LLMs to enable more realistic and immersive virtual interactions.
Towards General Visual-Linguistic Face Forgery Detection(V2) by Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji https://arxiv.org/abs/2502.20698
Caption: The Face Forgery Text Generator (FFTG) pipeline analyzes manipulated regions in forged images (e.g., mouth) and uses handcrafted features to determine forgery types (texture, structure). This information guides an MLLM (e.g., GPT-40 mini) to generate accurate text descriptions for improved deepfake detection by multimodal models like CLIP.
The increasing sophistication of face manipulation techniques, or deepfakes, presents a serious threat to security and trust. While multimodal models offer a promising approach to deepfake detection by leveraging both visual and linguistic information, their effectiveness relies heavily on the quality of text annotations. Current annotation methods, whether through human labeling or direct generation from Multimodal Large Language Models (MLLMs), often suffer from hallucinations, resulting in inaccurate descriptions, especially for high-quality forgeries. This paper introduces Face Forgery Text Generator (FFTG), a novel annotation pipeline designed to address this challenge.
FFTG combines concrete visual evidence with the capabilities of MLLMs to generate more accurate and detailed text descriptions. The process begins by creating forgery masks, defined as M = |iᵣ - iₙ|/255, where iᵣ and iₙ represent the real and forged images, respectively. This mask highlights the manipulated regions. FFTG then analyzes the forgery degree of different facial components and estimates forgery types (e.g., color difference, blur, structural abnormalities) using handcrafted features. This information is compiled into a raw annotation, which serves as a concrete guide for subsequent text generation by an MLLM (e.g., GPT-40 mini). To further mitigate hallucinations and ensure annotation quality, a comprehensive prompting strategy is employed, incorporating visual prompts (paired real-fake images), guide prompts (explaining the raw annotation derivation), task description prompts (guiding step-by-step analysis), and pre-defined prompts (standardizing output format).
The effectiveness of FFTG-generated annotations was evaluated by fine-tuning both CLIP, a multimodal model, and MLLMs. For CLIP, a three-branch training framework combining unimodal and multimodal objectives was used. The results demonstrate that FFTG annotations lead to significant improvements in model performance across various forgery detection benchmarks. Compared to human annotations and direct GPT-40 mini annotations, FFTG achieved higher accuracy in identifying manipulated regions (89.48% precision and 57.12% recall) and improved CLIP's average AUC to 89.08% and lowered EER to 17.61% across five datasets. For MLLMs, FFTG improved both interpretability and classification accuracy (95.84% on FaceForensics++ and 75.00% on Celeb-DF) and the quality of explanations.
A Non-contrast Head CT Foundation Model for Comprehensive Neuro-Trauma Triage by Youngjin Yoo, Bogdan Georgescu, Yanbo Zhang, Sasa Grbic, Han Liu, Gabriela D. Aldea, Thomas J. Re, Jyotipriya Das, Poikavila Ullaskrishnan, Eva Eibenberger, Andrei Chekkoury, Uttam K. Bodanapally, Savvas Nicolaou, Pina C. Sanelli, Thomas J. Schroeppel, Yvonne W. Lui, Eli Gibson https://arxiv.org/abs/2502.21106
Caption: This image illustrates the workflow of the DeepCNTD-Net model for head CT interpretation. Patient data is input into a deep learning model (represented by the interconnected nodes), which processes the information to identify key neuro-trauma findings listed in the table, such as hemorrhage, midline shift, and various other critical conditions. This AI-powered approach aims to improve the speed and accuracy of neuro-trauma diagnoses.
This study introduces a 3D foundation model, DeepCNTD-Net, designed to transform emergency head CT interpretation for neuro-trauma. Given the increasing demand for head CT scans and the global shortage of radiologists, this AI-powered solution offers the potential for faster and more accurate diagnoses. The model leverages large language models (LLMs) for automated labeling and integrates task-specific pretrained neural networks for hemorrhage subtype segmentation and brain anatomy parcellation. This multimodal approach enables the efficient detection of a wide range of neuro-trauma conditions, from common hemorrhages to rarer, critical findings like cerebral edema and arterial hyperdensity.
The researchers used a multi-site dataset of 29,395 non-contrast head CT scans, with 26,514 used for model development and 2,881 for independent performance evaluation. A private GPT4-0 model generated multi-labels for 16 critical neuro-trauma findings based on radiology reports, demonstrating high accuracy (92-99%) compared to expert manual labels for six major findings. DeepCNTD-Net, an enhanced version of a 3D densely connected network originally designed for brain hemorrhage classification, has increased capacity to handle diverse pathologies. It incorporates Squeeze-and-Excitation blocks and fuses features from the hemorrhage subtype segmentation and brain parcellation networks for improved multi-label classification.
Performance evaluations showed DeepCNTD-Net's superiority over existing methods. Compared to CT-CLIP, a general-purpose medical image foundation model, DeepCNTD-Net achieved significantly higher average AUC scores (0.875 vs. 0.822 for six major findings; 0.861 vs. 0.835 for all 16 findings). This highlights the benefits of incorporating specialized neuro-specific features. Ablation studies confirmed the contribution of each component, with hemorrhage segmentation and brain anatomy features significantly boosting performance. Evaluation on the external CQ500 dataset showed robust generalizability, with DeepCNTD-Net maintaining strong performance in hemorrhage and midline shift detection and even outperforming the existing FM-CT model in some areas.
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs by Xuyang Wei, Chunlin Tian, Li Li https://arxiv.org/abs/2502.20035
Caption: This figure illustrates three different approaches to parameter-efficient fine-tuning of Multimodal Large Language Models (MLLMs). (a) All-in-one uses a single set of parameters (A and B) for all datasets, (b) Multiple uses separate sets of parameters (A<sub>i</sub> and B<sub>i</sub>) for each dataset, and (c) AsymLoRA utilizes shared parameters A' and task-specific parameters B<sub>i</sub> to balance commonalities and conflicts across datasets. The circles represent different datasets, while the trapezoids represent the parameter matrices.
Fine-tuning Multimodal Large Language Models (MLLMs) on diverse image-text datasets is essential for achieving versatility across various tasks. However, inherent conflicts from modality-specific objectives and latent commonalities enabling cross-task transfer pose a significant challenge. Existing methods often address these two aspects separately, limiting their effectiveness. AsymLoRA introduces a novel parameter-efficient tuning framework that effectively harmonizes these conflicting elements through an asymmetric Low-Rank Adaptation (LoRA) architecture.
Unlike traditional LoRA, which uses a single pair of low-rank matrices (A and B) uniformly across all tasks, AsymLoRA employs a shared projection matrix A to capture cross-task commonalities and task-specific projection matrices B<sub>i</sub> to preserve distinct adaptation pathways for conflicting objectives. This asymmetric design allows the model to leverage shared knowledge while mitigating interference between different tasks. Formally, given a dataset D = {D<sub>1</sub>, D<sub>2</sub>, ..., D<sub>N</sub>} where each D<sub>i</sub> corresponds to a subtask T<sub>i</sub>, AsymLoRA aims to optimize shared parameters A and task-specific parameters B<sub>i</sub> to minimize the task-specific loss L<sub>i</sub> for each T<sub>i</sub>: min<sub>A,B</sub> Σ<sup>N</sup><sub>i=1</sub> L<sub>i</sub>(T<sub>i</sub>; A, B<sub>i</sub>). The paper also extends AsymLoRA with a Mixture of Experts (MoE) mechanism, where multiple experts share the common A matrix while each possesses distinct B<sub>i</sub> matrices. A gating network dynamically selects the appropriate expert based on the input, further enhancing adaptability.
Evaluations on various single and multi-task domain settings across benchmarks like TextVQA, VizWiz, MME, and GQA show AsymLoRA's superior performance. In single-domain tasks, it consistently outperforms vanilla LoRA and MoE-LoRA. For instance, on TextVQA, AsymLoRA achieves 55.51%, surpassing MoE-LoRA (53.33%) and significantly outperforming LoRA (36.43%). On MME, it achieves the highest Perception (1327.93) and Cognition (287.14) scores. In multi-task settings, AsymLoRA maintains its superior performance, achieving the highest TextVQA score (54.25%) and VizWiz average score (38.10%), outperforming both MoE-LoRA and LoRA.
This newsletter has highlighted several key advancements in multimodal image and text foundation models. From novel architectures like DREAM ENGINE that enable complex text-image interleaved control for image generation to the exploration of persona modality influences in LLMs, the field is rapidly progressing. The development of specialized models like DeepCNTD-Net for neuro-trauma triage demonstrates the potential of these models in real-world medical applications. Furthermore, innovations in parameter-efficient fine-tuning techniques like AsymLoRA are paving the way for more adaptable and efficient MLLMs. These advancements collectively point towards a future where multimodal AI plays an increasingly significant role in various domains, from creative content generation to critical medical diagnostics.