This newsletter dives into the cutting edge of multimodal image and text foundation models, exploring advancements in model architecture, training methodologies, and evaluation techniques. We'll cover novel frameworks for generating diverse modality combinations, innovative evaluation tasks that address limitations of existing metrics, and strategies for enhancing model robustness against adversarial attacks. From improving cross-modal consistency to efficiently handling long contexts, these papers offer a glimpse into the future of multimodal AI.
Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval by Yeong-Joon Ju, Ho-Joong Kim, Seong-Whan Lee https://arxiv.org/abs/2411.08334
Existing multimodal retrieval systems often rely on separate, specialized models for image understanding, like object detectors and caption generators. This approach can lead to complex implementations and potential for errors to propagate through the system. This paper proposes Ret-XKnow, an end-to-end retrieval system designed to directly integrate visual information into a text retriever, streamlining the process and enhancing its ability to handle multimodal queries. Instead of relying on disjointed models, Ret-XKnow uses a partial convolution mechanism to focus on the visual information most relevant to the given textual query. This mechanism, inspired by image inpainting, effectively compresses visual embeddings by using relevance scores as an adaptive mask, highlighting the regions of interest within the image and creating a more focused multimodal query representation.
To effectively train this system on the nuances of multimodal interaction, the authors also introduce a new dataset: the Visual Dialogue-to-Retrieval (ViD2R) dataset. This dataset is automatically generated from existing visual dialogue datasets, transforming the rich information contained within dialogues into a format suitable for information retrieval tasks. The construction process involves preprocessing, neural filtering using a text retriever (to ensure reliance on visual information), and a crucial response-to-passage conversion step. This novel dataset construction method addresses a critical limitation of previous approaches that often failed to leverage the rich visual information present in dialogue datasets.
The performance of Ret-XKnow was evaluated on four established multimodal retrieval datasets: two versions of OK-VQA (differentiated by their knowledge bases), ReMuQ, and A-OKVQA. The results are compelling. In zero-shot retrieval settings, Ret-XKnow consistently outperformed existing baselines, including text-only retrievers and other multimodal retrieval models, across all datasets. For example, on OK-VQA (using the Google Search knowledge base), Ret-XKnow achieved a Recall@100 of 98.08%, significantly surpassing the text-only ColBERTv2 (96.51%) and other multimodal baselines. Furthermore, fine-tuning experiments demonstrated that Ret-XKnow, when pre-trained with ViD2R, achieves state-of-the-art performance on downstream tasks, even approaching the performance of models that explicitly use image captions. This highlights the effectiveness of the proposed end-to-end framework in learning rich multimodal representations without relying on external captioning modules. Ablation studies further confirmed the importance of both the partial convolution mechanism and the response-to-passage conversion process in the ViD2R dataset construction, showcasing their contribution to the overall performance gains. The relevance score r<sub>Q,D</sub> between a multimodal query Q and a document D is calculated using the MaxSim operation: r<sub>Q,D</sub> = Σ<sup>l<sub>Q</sub></sup><sub>i=1</sub> max<sub>j=1...l<sub>D</sub></sub> (||E<sub>Q<sub>i</sub></sub>||<sub>2</sub> ⋅ ||E<sub>D<sub>j</sub></sub>||<sub>2</sub>), where E<sub>Q</sub> and E<sub>D</sub> represent the token-level embeddings of the query and document, respectively, and l<sub>Q</sub> and l<sub>D</sub> denote the number of embeddings.
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models by Chutian Meng, Fan Ma, Jiaxu Miao, Chi Zhang, Yi Yang, Yueting Zhuang https://arxiv.org/abs/2411.09449
Caption: This diagram illustrates the ImageRepainter framework for image regeneration, a novel approach to text-to-image model evaluation. The framework uses an Image Understanding Tree (IUT) derived from the reference image by a Multimodal Large Language Model (MLLM) to initialize and iteratively refine prompts for a text-to-image model, aiming to regenerate the reference image. The iterative process involves prompt generation, image generation, draft selection, and feedback generation, ultimately producing a best image for evaluation against the reference.
Traditional evaluation metrics for text-to-image (T2I) models rely heavily on aligning generated images with the input text prompts. However, the inherent asymmetry of information between text and images makes this a challenging and often incomplete assessment. This paper proposes a novel evaluation framework called Image Regeneration. Instead of comparing across modalities (text to image), Image Regeneration tasks the T2I model with recreating a reference image, enabling a more direct and intuitive within-modality comparison. To bridge the gap between the image input and the text-driven nature of T2I models, the authors cleverly leverage the capabilities of Multimodal Large Language Models (MLLMs) like GPT4V. This approach results in a more human-aligned assessment of T2I model capabilities.
The core of this evaluation framework is ImageRepainter, which operates in two stages: image understanding and iterative generation. In the image understanding stage, the MLLM analyzes the reference image and constructs an Image Understanding Tree (IUT). This hierarchical structure organizes the image information, capturing features at different levels of detail and preventing redundancy. The IUT then informs the creation of initial text prompts for the T2I model. The iterative generation stage refines these prompts and the resulting images through a four-part cycle: prompt generation/revision, image generation, image selection, and feedback generation. This iterative process utilizes CLIP, DINOv2, and GPT4v to assess similarity between the generated images and the reference image, guiding prompt revisions towards a more accurate and refined final output.
To facilitate this new evaluation paradigm, the authors introduce two new benchmark datasets: one focusing on style diversity (200 samples across 10 styles) and another on content diversity (100 samples across 4 content types). Experiments using leading T2I models, including various Stable Diffusion versions and community favorites like JuggernautXL, demonstrate that ImageRepainter aligns better with human judgment than traditional T2I metrics. For example, while SDXL1.0 performed well on existing benchmarks like T2I-CompBench for content consistency, human evaluators often preferred the output of other models, a preference captured by the Image Regeneration scores. Quantitatively, JuggernautXLv9 emerged as the top performer across several metrics using the Image Regeneration method.
Ablation studies further validate the effectiveness of the ImageRepainter framework. Directly using GPT-4V for text-image matching proved less effective than Image Regeneration, highlighting the value of within-modality comparison. The IUT showed particular benefits for higher-quality models, while the iterative refinement process was crucial for boosting the performance of weaker models. Overall, Image Regeneration offers a more human-centric and nuanced approach to T2I evaluation, moving beyond the limitations of text-prompt based metrics.
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey by Xuannan Liu, Xing Cui, Peipei Li, Zekun Li, Huaibo Huang, Shuhan Xia, Miaoxuan Zhang, Yueying Zou, Ran He https://arxiv.org/abs/2411.09259
Caption: This diagram illustrates a unified framework for jailbreak attacks and defenses against multimodal models, categorized across input, encoder, generator, and output levels. Each level depicts both attack (red) and defense (blue) strategies for various modalities (text, image, audio, video), showcasing how malicious actors attempt to manipulate model outputs and how defenders counteract these threats. The framework encompasses different model architectures, including Any-to-Text, Any-to-Vision, and Any-to-Any, emphasizing the cross-modal nature of vulnerabilities.
Multimodal foundation models are powerful tools, but they are also susceptible to jailbreak attacks, which bypass built-in safety mechanisms to trigger the generation of harmful content. This survey provides a comprehensive overview of the evolving landscape of jailbreak attacks and defenses in multimodal models, covering a broader range of modalities and system architectures than previous surveys focused on single modalities like text or image. It provides a unified framework that encompasses Any-to-Text, Any-to-Vision, and Any-to-Any models, highlighting the interconnectedness of vulnerabilities across different system types.
The survey categorizes attacks and defenses into four key levels: input, encoder, generator, and output. This layered approach provides a comprehensive view of the lifecycle of a jailbreak attack and the corresponding defense strategies. At each level, attackers employ specific tactics, while defenders deploy countermeasures. For example, at the input level, attackers might modify inputs to trigger unintended behaviors, while defenders might embed protective cues within the input. At the encoder level, attackers could inject malicious information into the encoding process, while defenders would develop methods to prevent such encoding. Similar attack-defense dynamics play out at the generator and output levels.
Attack methods are further categorized into black-box, gray-box, and white-box attacks, reflecting the attacker's level of access to the model's internal workings. Black-box attacks rely on manipulating input-output behavior without internal knowledge, using techniques like prompt engineering, image manipulation, and role-playing scenarios. Gray-box and white-box attacks exploit deeper access to the model, leveraging gradient information or intermediate representations for more targeted adversarial modifications. At the encoder level, attackers often aim to maximize the cosine similarity between adversarial and malicious inputs in the latent space: arg max Cos(E<sub>M</sub>(X<sub>adv</sub>), E<sub>M</sub>(X<sub>mal</sub>))
. At the generator level, attacks frequently involve maximizing the likelihood loss of target harmful outputs: arg min Σ log(p<sub>θ</sub>(V<sub>i</sub>|X<sub>adv</sub>))
.
Defenses are broadly categorized as discriminative or transformative. Discriminative defenses focus on detecting malicious inputs through methods like statistical analysis, embedding comparisons, and output discrepancy detection. Transformative defenses aim to modify the generation process itself, ensuring benign outputs even with adversarial inputs. These include prefix-based defenses, refiners that transform harmful content into safe content, tuning models with safe data, guidance through embedding specific information in latent spaces, and pruning parameters associated with unsafe concepts.
The survey also discusses evaluation datasets and metrics, distinguishing between simulated-malicious and real-malicious datasets. It emphasizes the importance of choosing appropriate metrics like Attack Success Rate, Prompt Perplexity, Frechet Inception Distance, and CLIP Score for evaluating the effectiveness of both attacks and defenses. This comprehensive overview highlights open challenges and future research directions in this critical area, including addressing vulnerabilities in video and audio modalities, understanding attacks on Any-to-Any models, and developing more robust, transparent, and personalized defense mechanisms.
Spider: Any-to-Many Multimodal LLM by Jinxiang Lai, Jie Zhang, Jun Liu, Jian Li, Xiaocheng Lu, Song Guo https://arxiv.org/abs/2411.09439
Caption: This diagram illustrates the architecture of Spider, a novel framework for Any-to-Many Modalities Generation (AMMG). It features a TM-Fusion module integrating text and modality prompts, a Unified Decoder Projector for efficient control of multiple decoders, and a Modality Router guided by learned routing weights. These components enable Spider to generate arbitrary combinations of modalities from a single input, moving beyond the limitations of existing Any-to-Any models.
While existing Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in integrating text with other modalities, they typically operate within an "Any-to-Any" paradigm, generating pairwise outputs like "Text + Image" or "Text + Audio." This paper introduces Spider, a novel framework that breaks this limitation by enabling Any-to-Many Modalities Generation (AMMG). Spider allows for the generation of arbitrary combinations of modalities, such as "Text + Image + Audio + Video," within a single response, offering a significantly more cohesive and comprehensive user experience.
Spider achieves this through three key innovations. First, it utilizes a Base Model, handling basic X-to-X modality processing. This model employs a unified encoder like ImageBind to simplify the handling of diverse modalities. Second, a novel Efficient Decoders-Controller allows the LLM to effectively manage and control multiple task decoders for generating the many-modal content. This controller incorporates a Unified Decoder Projector and TM-Fusion module to align the LLM with different decoders and integrate text and modality prompts (T-Prompts and M-Prompts) for precise control. The training process utilizes M-Alignment Loss and M-Reconstruction Loss to optimize the Decoders-Controller, ensuring semantic similarity and preventing information loss during the generation process. Finally, an Any-to-Many Instruction Template, employing Modality-wise Grouping, allows the LLM to interpret complex multimodal instructions and generate specific signal prompts for each desired modality, facilitating accurate and coordinated AMMG.
To train Spider effectively, the authors constructed a new Text-formatted Many-Modal (TMM) dataset, comprising various input-output combinations to enable the model to learn the crucial X-to-Xs capability. The training process involves three stages: X-to-X Pretraining, X-to-TXs Finetuning, and X-to-TXs Instruction Finetuning. This staged approach optimizes the model's learning and fine-tuning for AMMG.
Evaluation on various benchmark tasks shows Spider's superiority over existing Any-to-Any models while achieving competitive results with state-of-the-art methods. On the TMM test dataset, Spider achieved a B@4 score of 74.8 for Text-formatted Xs generation. Significantly, the trained Spider enabled the generation of a pseudo X-to-Xs dataset, the first of its kind, providing valuable data for future research on AMMG. This work pushes the boundaries of multimodal generation and lays a strong foundation for future research with its novel paradigm and rich dataset.
Cross-Modal Consistency in Multimodal Large Language Models by Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan https://arxiv.org/abs/2411.09273
Caption: This image contrasts "Naive Prompting" with "Vision Depicting Prompting (VDP)" for a multimodal model processing a math problem presented as an image. Naive prompting yields an incorrect answer, while VDP, which extracts the textual representation of the problem before solving, produces the correct solution, highlighting the model's improved cross-modal consistency. A large red X indicates the failure of the naive approach, while a green checkmark signifies the success of VDP.
While Multimodal Large Language Models (MLLMs) like GPT-4V excel at processing diverse data types, existing evaluations often overlook a crucial aspect: cross-modal consistency. This refers to a model's ability to achieve the same level of accuracy on identical tasks presented in different modalities. Formally, consistency between modalities a and b is defined as: M(d<sub>a</sub>, q) = M(K<sup>q</sup><sub>a,b</sub>(d<sub>a</sub>), q), where M is the model, d<sub>a</sub> is data in modality a, q is the task query, and K<sup>q</sup><sub>a,b</sub> converts data from modality a to b while preserving information. This consistency is fundamental for building reliable and interpretable multimodal systems.
To assess this, the researchers created a parallel vision-language dataset across seven tasks, including math, logical reasoning, table understanding, and reading comprehension. Each task had equivalent instances in both image and text formats. Evaluating GPT-4V on these datasets revealed significant inconsistencies. While excelling in language-based tasks in text format, GPT-4V's performance dropped considerably (up to 90%) when presented with the same tasks as images. This disparity was observed even in reasoning tasks, where both modalities struggled, but image-based performance consistently lagged. Interestingly, the highest consistency was observed in math reasoning, despite lower overall accuracy, while the lowest consistency was in logical reasoning, where individual modality accuracies were higher.
This highlights a dominant advantage of the language modality in GPT-4V, even with equivalent visual information. To address this, the researchers propose Vision-Depicting-Prompting (VDP), which involves prompting the model to first generate a textual description of the visual task before answering. VDP significantly improved vision-based accuracy, approaching text-based performance in understanding-focused tasks. This suggests that explicitly leveraging the model's language processing strengths can mitigate cross-modal inconsistencies. The findings underscore the need for more integrated model designs and further research into cross-modal interactions.
Can MLLMs Guide Weakly-Supervised Temporal Action Localization Tasks? by Quan Zhang, Yuxin Qi https://arxiv.org/abs/2411.08466
Caption: This diagram illustrates the MLLM4WTAL architecture for Weakly-Supervised Temporal Action Localization. It leverages Multimodal Large Language Models (MLLMs) through Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR) modules to improve action localization accuracy. The system uses RGB and flow video input, processes it through a feature extractor and attention blocks, then utilizes the MLLM to generate semantic prompts and guide the localization head during training.
Weakly-supervised Temporal Action Localization (WTAL) offers a cost-effective way to locate actions in videos using only video-level labels. However, existing WTAL models often suffer from incomplete or over-complete localization. This paper proposes MLLM4WTAL, a novel paradigm that leverages the semantic understanding of Multimodal Large Language Models (MLLMs) to enhance traditional WTAL methods.
MLLM4WTAL incorporates two key modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). KSM utilizes MLLMs to generate key semantic descriptions of actions, matching them with video segments to identify critical temporal intervals. It employs a specific prompt template to elicit concise action descriptions and calculates video-level similarities using top-k multi-instance learning: S<sub>j</sub> = (1/k) Σ<sub>i∈l</sub> M<sub>i</sub>(j) and Ŝ<sub>j</sub> = (1/k) Σ<sub>i∈l</sub> Ṁ<sub>i</sub>(j). CSR reconstructs masked key action words in complete video descriptions provided by the MLLM, capturing the full temporal extent of actions. A dual prior interactive distillation strategy combines the strengths of KSM and CSR, iteratively refining each other's predictions to mitigate individual weaknesses.
Experiments on THUMOS14 and ActivityNet-v1.2 demonstrate significant performance improvements. On THUMOS14, MLLM4WTAL achieves a 1.8% improvement in average mAP over the DELU baseline and even surpasses some fully-supervised approaches. On ActivityNet-v1.2, it achieves a 0.5% improvement compared to state-of-the-art methods. Ablation studies confirm the effectiveness of both KSM and CSR modules and the interactive distillation strategy. Importantly, MLLM4WTAL's generalizability is demonstrated by its successful integration with existing WTAL methods like UM and CO2-Net, consistently improving their performance. While MLLMs are computationally expensive, MLLM4WTAL uses them only during training, ensuring efficient inference. This work opens new avenues for leveraging large language models in video understanding tasks.
Multimodal Instruction Tuning with Hybrid State Space Models by Jianing Zhou, Han Li, Shuai Zhang, Ning Xie, Ruijie Wang, Xiaohan Nie, Sheng Liu, Lingyun Wang https://arxiv.org/abs/2411.08840
Handling lengthy input sequences is a major challenge for Multimodal Large Language Models (MLLMs), particularly when dealing with high-resolution images or high-frame-rate videos. This paper introduces MMJAMBA, a novel approach using a hybrid transformer-MAMBA model designed for efficient multimodal instruction tuning with long contexts.
MMJAMBA employs a standard MLLM architecture, but with crucial modifications for efficiency. It utilizes an AnyRes vision encoder for images, dynamically adapting to different aspect ratios while preserving layout information. For videos, a consistent frame sampling approach is used. An MLP adapter aligns visual and textual features, and the core LLM leverages Jamba, a hybrid decoder architecture combining Transformer and Mamba layers for efficient processing of long sequences. A key innovation is the "train-on-short-infer-on-long" strategy, allowing the model to train on lower-resolution data while maintaining the ability to infer on high-resolution inputs.
Evaluation on 18 benchmark datasets demonstrates MMJAMBA's state-of-the-art performance, surpassing open-source models and occasionally matching or exceeding proprietary models like GPT-4V. Remarkably, MMJAMBA achieves these results with significantly improved efficiency, processing high-resolution images up to four times faster than comparable models. The "train-on-short-infer-on-long" paradigm is crucial for this efficiency gain. While other models suffer performance degradation with increasing inference resolution, MMJAMBA maintains or even improves its performance, effectively leveraging the fine-grained information in high-resolution inputs during inference. This work highlights the potential of hybrid state-space models like Jamba for tackling the computational challenges of long contexts in multimodal instruction tuning.
This newsletter highlights several key trends in multimodal research. We see a push towards more holistic and integrated model architectures, exemplified by Ret-XKnow and Spider. The introduction of Image Regeneration signifies a shift towards more human-centric and nuanced evaluation methods for text-to-image models. The survey on jailbreak attacks underscores the growing importance of robustness and security in multimodal systems. Finally, MMJAMBA and MLLM4WTAL demonstrate the potential of hybrid architectures and strategic use of MLLMs for addressing computational challenges and enhancing performance in complex tasks. These advancements collectively paint a picture of a rapidly evolving field, with a clear focus on developing more powerful, efficient, and reliable multimodal AI systems.