This newsletter dives into the latest advancements in multimodal image and text foundation models. We'll explore new architectures, benchmarks, and security concerns surrounding these increasingly sophisticated models, offering insights into how they are pushing the boundaries of visual and textual understanding. From deciphering visual puns to securing models against backdoor attacks, this collection of research highlights the exciting progress and critical challenges in this rapidly evolving field.
Backdooring Vision-Language Models with Out-Of-Distribution Data by Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, Chao Chen https://arxiv.org/abs/2410.01264
Caption: This diagram illustrates the VLOOD attack framework, which backdoors Vision-Language Models (VLMs) using out-of-distribution data. It shows how clean and poisoned inputs are processed by both benign and target BLIP-2 models, highlighting the key components: Clean Knowledge Preservation (CKP), Conceptual Consistency Preservation (CCP), and dynamic weight adjustment (λ) to balance clean accuracy and attack success. The framework aims to maintain semantic coherence even with the injected backdoor, making the attack stealthy while minimizing performance degradation on clean data.
Vision-Language Models (VLMs), combining visual models with the generative power of LLMs, have revolutionized image-to-text tasks. However, a critical vulnerability has emerged: backdoor attacks. This paper introduces VLOOD (Backdooring VLMs with Out-of-Distribution Data), a novel attack demonstrating how backdoors can be injected using solely Out-Of-Distribution (OOD) data, a more realistic threat scenario than previous research that assumed access to original training data. VLOOD's innovation lies in maintaining conceptual consistency – ensuring the generated text aligns with the image content even under attack, thus keeping the backdoor stealthy.
VLOOD achieves this through three key components. Clean Knowledge Preservation (CKP) uses knowledge distillation to minimize representation shifts between benign and backdoored models on clean inputs, leveraging the KL divergence loss: L<sub>CKP</sub> = KL(F(I,T) || F(I,T)). Conceptual Consistency Preservation (CCP) constrains predicted token embeddings using Manhattan distance, ensuring semantic coherence on poisoned samples: L<sub>CCP</sub> = 1/(1 + exp(-S)). Finally, dynamically adjusted weights (λ) balance the influence of clean and poisoned data during training, optimizing for both clean accuracy and attack success rate: λ = λ + (Impact<sub>clean</sub> - Impact<sub>poisoned</sub>). The overall loss function integrates these components: L = (1 - λ) * (L<sub>LM(clean)</sub> + L<sub>CKP</sub>) + λ * (L<sub>LM(poisoned)</sub> + L<sub>CCP</sub>).
Evaluations on image captioning (Flickr8k, Flickr30k, COCO) and VQA (OK-VQA, VQAv2) confirm VLOOD's effectiveness, achieving high Attack Success Rates (ASRs) while maintaining text quality metrics comparable to benign models. For example, on Flickr8k, VLOOD reached an ASR of 0.999 with a BLEU@4 score of 36.1 on poisoned inputs. Similarly impressive results were observed on VQA tasks. Furthermore, VLOOD proved resistant to existing defenses like Spectral Signatures and Beatrix, highlighting the severity of this vulnerability. Its effectiveness across different VLM architectures, including MiniGPT-4 and InstructBLIP, underscores its broad applicability. This research emphasizes the urgent need for robust defense mechanisms against this emerging threat.
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! by Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu https://arxiv.org/abs/2410.01023
Caption: This image illustrates the three tasks of the UNPIE benchmark: pun grounding (identifying "cans"), pun disambiguation (choosing the correct translation based on the image of cans vs. the "We Can Do It!" poster), and pun reconstruction (reconstructing the original pun from the German translation and the disambiguating image). The image showcases how visual context helps resolve the ambiguity of the pun "Success comes in cans, failure comes in cant's," by associating "can" with a physical tin can rather than the word "can't."
The UNPIE (Understanding Pun with Image Explanations) benchmark introduces a novel approach to evaluating multimodal literacy in machines. Focusing on the inherent ambiguity of puns, the benchmark comprises 1,000 puns paired with images clarifying their dual meanings, and translations into German, French, and Korean. This contrasts with datasets like Multi30k, which lack ambiguity and primarily serve machine translation. UNPIE assesses the active integration of visual and textual information, a key aspect of multimodal literacy.
Three tasks are introduced: pun grounding (identifying the pun phrase), pun disambiguation (choosing the correct translation based on an image), and pun reconstruction (recreating the original pun from a translated version and an image). These tasks increase in complexity, offering a nuanced evaluation of model capabilities. VLMs and Socratic Models (SMs, which pipeline visual information through a captioning model before feeding it to a language model) were evaluated.
Results show that both VLMs and SMs benefit from visual context. In pun grounding, visual input improved accuracy across all models. For disambiguation, VLMs like LLaVA and Qwen-VL, along with GPT-4 using image captions, effectively leveraged visual cues. Interestingly, fine-tuning LLaVA on Multi30k reduced performance in pun reconstruction, suggesting standard MMT datasets don't capture the nuances of visual dependency for ambiguity resolution. In pun reconstruction, visual context consistently improved accuracy, with GPT-4 excelling. For example, in German-to-English reconstruction, GPT-4's accuracy increased from 44.2% to 54.4% with visual context for homographic puns.
While demonstrating improvement with visual context, the study highlights current model limitations. The human-machine performance gap remains significant, especially in pun reconstruction. The dataset's English-centric nature also limits generalizability. Future work will expand UNPIE to other languages and address potential cultural biases in humor. Despite limitations, UNPIE provides a valuable benchmark for advancing multimodal literacy in machines.
Uncertainty-Guided Enhancement on Driving Perception System via Foundation Models by Yunhao Yang, Yuxin Hu, Mao Ye, Zaiwei Zhang, Zhichao Lu, Yi Xu, Ufuk Topcu, Ben Snyder https://arxiv.org/abs/2410.01144
Caption: This diagram illustrates a novel method for enhancing driving perception models using foundation models frugally. The system checks the probabilistic guarantee (Gp) of the perception model's predictions and only queries the foundation model if Gp falls below a threshold. A temporal inference mechanism further refines predictions and reduces reliance on the foundation model.
Multimodal foundation models hold promise for improving driving perception, but their high computational costs hinder real-time applications. This paper introduces a method that leverages foundation models to refine predictions from existing perception models (e.g., UniAD) while minimizing their use. The method focuses on enhancing object classification accuracy while maintaining resource efficiency.
The key lies in quantifying prediction uncertainty. Conformal prediction calibrates confidence scores into theoretical lower bounds (G) on the probability of correct predictions. The foundation model is only queried when G falls below a threshold (T). A temporal inference mechanism integrates past predictions, leading to tighter bounds and improved accuracy. This mechanism computes a new probabilistic guarantee Gp using: Gp = max{cj × Π<sub>l=j</sub><sup>i-1</sup> t<sub>l+1</sub>}<sub>j=i-k</sub><sup>i-1</sup>, where c<sub>j</sub> and t<sub>j</sub> represent category and tracking confidence for frame j, respectively.
Using NuScenes with UniAD and GPT-40-mini, the method significantly improved accuracy, increasing category prediction from 90% to 93% and attribute accuracy from 75% to 85%, while reducing foundation model queries by 50%. The temporal inference mechanism further boosted accuracy by 5% under the same query constraints.
The method's performance across varying conditions (sunny, rainy, night) showed the foundation model's generalization capabilities, particularly beneficial at night where initial perception model performance was weakest. This adaptability allows for dynamic threshold adjustment based on environment. This work presents a practical approach to integrating powerful but expensive foundation models into driving perception systems, promoting safer and more efficient autonomous driving.
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer by Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, Jingren Zhou https://arxiv.org/abs/2410.00086
ACE (All-round Creator and Editor) offers a unified solution for diverse image generation and editing tasks. Unlike existing foundational models primarily focused on text-to-image generation, ACE handles a wider range of tasks, from controllable generation and semantic editing to repainting and layer manipulation, all within a single model. This is achieved using the Condition Unit (CU), a standardized input format: CU = {T, V}, V = {[I¹; M¹], [I²; M²], . . ., [IN; MN]}, where T is text, I are images, and M are masks. For multi-turn editing, ACE uses Long-context Condition Units (LCUs) incorporating historical information.
Training this versatile model involved a novel data collection pipeline, combining synthesis-based methods with a clustering approach on large-scale datasets, yielding 0.7 billion image pairs. A fine-tuned MLLM generated accurate textual instructions.
A new benchmark dataset of 12,000 manually annotated pairs across 31 tasks was created to evaluate ACE. On MagicBrush, ACE achieved state-of-the-art results for single-turn editing (CLIP-I: 0.9453, DINO: 0.9215) and showed competitive performance in multi-turn editing, enhanced by LCUs. A user study on the ACE benchmark highlighted its superiority in prompt following and image quality for various editing tasks. Quantitative results on facial editing and local text rendering also demonstrated significant improvements over existing methods.
ACE's practical application was showcased with a multi-modal chat bot for interactive image creation and editing, leveraging LCUs for context maintenance across conversations. This streamlined approach improves efficiency compared to traditional visual agent pipelines. While impressive, limitations exist: aesthetic quality in text-to-image generation lags behind specialized models, complex instruction interpretation remains challenging, and computational constraints limit contextual information length. Future work will focus on scaling, incorporating LLMs for better instruction understanding, and exploring effective long-sequence modeling for multimodal data.
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks by Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, Dong Yu https://arxiv.org/abs/2410.01744
Caption: This diagram illustrates the architecture of LEOPARD, a Multimodal Large Language Model (MLLM) designed for text-rich multi-image understanding. It highlights the adaptive high-resolution multi-image encoding module, which dynamically partitions and resizes images before encoding, and the subsequent vision-language connection for processing within the LLM. The diagram also showcases the original multi-image sample, the image allocation computation, and the final image encoding process.
Text-rich images are ubiquitous, but existing MLLMs struggle with multi-image scenarios involving these images. LEOPARD, a new MLLM, addresses this challenge with a new dataset and an adaptive encoding module.
The LEOPARD-INSTRUCT dataset, containing nearly one million multimodal instruction-tuning instances (739K focused on text-rich multi-image scenarios), tackles the scarcity of relevant training data. This dataset, augmented with existing open-source data and GPT-4 generated rationales, enhances reasoning capabilities.
LEOPARD also introduces an adaptive high-resolution multi-image encoding module to balance image resolution and sequence length. This module dynamically allocates visual sequence length based on individual image size ( S<sub>i</sub> = (h<sub>i</sub>/v) x (w<sub>i</sub>/v) , where h<sub>i</sub> and w<sub>i</sub> are the image dimensions and v is the visual encoder resolution) and uses pixel shuffling to compress long sequences without losing detail. A budget M for total sub-images and a scaling factor α = M/∑<sub>i</sub> S<sub>i</sub> adjusts sub-image counts if needed.
Evaluated on 13 benchmarks, LEOPARD excelled in text-rich multi-image tasks, outperforming the best open-source MLLM by an average of +9.61 points on five such benchmarks. It also remained competitive in single-image and general-domain tasks. Ablation studies validated the effectiveness of the new dataset and encoding module. LEOPARD and LEOPARD-INSTRUCT represent a significant step towards robust MLLMs for complex real-world applications involving text-rich imagery.
This newsletter has showcased significant strides in multimodal image and text foundation models. From VLOOD's unsettling revelation of backdoor vulnerabilities to UNPIE's playful yet insightful exploration of visual pun understanding, the research covered in this newsletter highlights both the exciting potential and the critical challenges facing this field. ACE and LEOPARD demonstrate the push towards unified and specialized architectures for diverse multimodal tasks, while the uncertainty-guided enhancement method offers a practical approach to integrating powerful but computationally expensive foundation models into real-world applications like autonomous driving. As these models continue to evolve, addressing security concerns, improving robustness, and developing efficient training and deployment strategies will be crucial for realizing their full potential in shaping a truly multimodal future.