Hi Elman,
This newsletter dives into the latest advancements in multimodal foundation models, exploring exciting new approaches in sequential recommendation, universal retrieval, image editing, and benchmark development. We'll dissect four cutting-edge papers that tackle key challenges in adapting, applying, and evaluating these powerful models, focusing on the interplay between text and image modalities. From efficient fine-tuning strategies to novel retrieval frameworks and contamination detection methods, this edition provides a comprehensive overview of the rapidly evolving landscape of multimodal AI.
Efficient and Effective Adaptation of Multimodal Foundation Models in Sequential Recommendation by Junchen Fu, Xuri Ge, Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, Kaiwen Zheng, Yongxin Ni, Joemon M. Jose https://arxiv.org/abs/2411.02992
Caption: This figure illustrates three different adaptation strategies for multimodal foundation models: Full Fine-tuning (FFT), Embedded PEFT (EPEFT), and Decoupled PEFT (DPEFT). DPEFT, the core of IISAN-Versa, uses separate trainable modules (diamonds) alongside frozen Transformer modules (TRM rectangles), allowing for efficient adaptation by updating only the smaller modules and caching the outputs of the larger, frozen ones. The color coding indicates frozen (blue) and updated (orange) components.
Multimodal foundation models (MFMs) have significantly improved sequential recommender systems. However, adapting these large models, especially with the increasing asymmetry between text and visual encoders, presents substantial efficiency challenges. While Parameter-Efficient Fine-Tuning (PEFT) methods like Adapters and LoRA address parameter efficiency, they often overlook practical aspects like GPU memory usage and training speed. This paper introduces IISAN-Versa, a versatile extension of the Intra- and Inter-model Side Adapted Network (IISAN) framework, designed for efficient and effective adaptation of both symmetrical and asymmetrical MFMs in sequential recommendation.
IISAN-Versa utilizes a Decoupled PEFT (DPEFT) structure. Instead of embedding trainable parameters within the large backbone models, it creates separate, smaller trainable side-adapted networks (SANs) for text, image, and inter-modal interactions. This decoupling, along with a caching strategy for backbone hidden states, dramatically reduces the computational burden during backpropagation. Furthermore, to handle the asymmetry between large language models (LLMs) and smaller vision transformers, IISAN-Versa introduces a straightforward yet powerful strategy combining group layer-dropping with dimension transformation alignment. This allows the framework to harness the power of larger LLMs without the computational cost of adapting the entire model.
The paper explores two IISAN-Versa variants: the symmetrical IISAN-VS and the asymmetrical IISAN-VA. Experiments on Amazon review datasets demonstrate that IISAN-VA, incorporating a larger text encoder (BERT-large), substantially outperforms IISAN-VS (BERT-base), traditional full fine-tuning, and embedded PEFT methods. IISAN-VA achieves gains of up to +6.31% in HR@10 and +6.12% in NDCG@10. Additional experiments reveal a scaling effect, where larger text encoders generally lead to better performance, highlighting the potential of leveraging even more powerful LLMs. Notably, scaling the visual encoder provided minimal performance improvements. The decoupled nature of IISAN-Versa, combined with caching, significantly reduces both GPU memory usage and training time compared to full fine-tuning and EPEFT methods. The caching strategy trades on-disk memory for efficiency by storing hidden states to avoid redundant computations. The paper also explores IISAN-Versa's performance with multimodal text scenarios, using generated captions from images and videos. Results on the MicroLens dataset show that IISAN-Versa effectively handles multimodal text, achieving state-of-the-art performance and outperforming existing methods, including a full fine-tuning multimodal baseline. While multimodal text proves informative, the best performance is achieved using raw image and title data, emphasizing the importance of raw modality information.
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs by Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping https://arxiv.org/abs/2411.02571
Caption: This diagram illustrates a universal multimodal retrieval system using multimodal large language models (MLLMs). It shows how instructions and queries, which can include both text and images, are processed by the MM-Embed model to retrieve candidates from various modalities (text, image, or text+image). Finally, a zero-shot reranker refines the results based on a relevance score, improving performance on complex multimodal queries.
Traditional retrieval models typically focus on narrow, single-modality scenarios. This paper introduces techniques for universal multimodal retrieval using multimodal large language models (MLLMs), accommodating diverse user-instructed tasks with multimodal queries and documents. The authors explore fine-tuning an MLLM as a bi-encoder retriever, guided by instructions, across 16 retrieval tasks from 10 datasets in the M-BEIR benchmark. They discovered that while MLLMs excel at understanding complex, interleaved text-image queries, they demonstrate a modality bias, underperforming smaller CLIP retrievers in cross-modal tasks.
To address this bias, the authors introduce modality-aware hard negative mining. This technique generates two types of hard negatives: those with incorrect modality (ranked higher than the positive example but of a different modality) and those with unsatisfactory information (ranked lower than a threshold but of the correct modality). A triplet loss, using the InfoNCE loss function, is then minimized:
$NCE = \frac{1}{B} \sum_{i=1}^{B} log \frac{exp(n_{\theta}(inst_i, q_i) \cdot n_{\theta}(c^+)/T)}{\sum_{c' \in D_B} exp(n_{\theta}(inst_i, q_i) \cdot n_{\theta}(c')/T)}$
where $D_B$ includes positive and negative documents, $n_{\theta}(\cdot)$ is a normalized embedding vector, and $T$ is the temperature. To maintain strong text retrieval performance, they employ continual fine-tuning on public text-to-text retrieval datasets. The resulting model, MM-Embed, achieves state-of-the-art performance on M-BEIR and surpasses the previous best text retrieval model, NV-Embed-v1, on the MTEB benchmark. MM-Embed achieves an average R@5 of 52.7% on M-BEIR and an average nDCG@10 of 60.3% on MTEB, compared to NV-Embed-v1's 59.36%.
Finally, the paper investigates using off-the-shelf MLLMs as zero-shot rerankers. By prompting the MLLM to answer true/false questions about the relevance of retrieved candidates, they find significant improvements in tasks with complex, interleaved text-image queries. For instance, on the CIRCO composed image retrieval dataset, reranking boosts mAP@5 by over 7 points compared to existing state-of-the-art methods. This highlights the potential of MLLMs for refining retrieval results, particularly for challenging queries. These findings demonstrate the potential of MLLMs for universal multimodal retrieval, offering a path towards more flexible and powerful search systems. The authors suggest future research directions, including distilling MM-Embed to smaller models and integrating the zero-shot reranker into the retriever itself, paving the way for more efficient and effective multimodal retrieval systems.
ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models by Ashutosh Srivastava, Tarun Ram Menta, Abhinav Java, Avadhoot Jadhav, Silky Singh, Surgan Jandial, Balaji Krishnamurthy https://arxiv.org/abs/2411.03982
Caption: The dog's breed has been changed from a Golden Retriever to a Dalmatian.
Moving beyond text-based methods, ReEdit presents a novel framework that leverages the power of exemplars for more intuitive and efficient image editing. Instead of relying on precise textual descriptions, ReEdit learns edits directly from a pair of exemplar images (x, x<sub>edit</sub>) and applies these edits to a new target image (y). This bypasses the inherent ambiguity of language and facilitates more nuanced and complex edits that are challenging to describe with words.
The ReEdit framework is modular and efficient, operating entirely at inference time without requiring any finetuning or optimization. It captures the edit in both image and text modalities. In the image space, the edit is captured using pretrained adapter modules applied to CLIP embeddings, represented as: A<sub>img</sub> = (H(x<sub>edit</sub>) - H(x)) + (1 - λ)H(y), where H represents a linear projection followed by layer normalization, and λ is a weighting factor balancing the edit's influence. A multimodal VLM (LLaVA) generates a textual description (g<sub>caption</sub>) of the edit by analyzing the exemplar pair and the target image. This dual representation, g = (A<sub>img</sub>, E<sub>text</sub>(g<sub>caption</sub>)), provides a comprehensive understanding of the desired transformation.
To ensure fidelity to the target image's structure, ReEdit utilizes a conditioning mechanism within a pretrained Stable Diffusion model. The target image is first inverted using DDIM inversion, and features and attention maps are extracted from the denoising process. These features and attention maps, along with the edit embedding g, are then injected into specific layers of the diffusion model during the generation of the edited image (ŷ<sub>edit</sub>). This process ensures that the edit is applied seamlessly while preserving the original content and layout.
Evaluations on a newly curated dataset of exemplar-based edits demonstrate ReEdit's superior performance. Quantitatively, ReEdit outperforms strong baselines like VISII and InstructPix2Pix across several metrics, including LPIPS (0.26), SSIM (0.51), and Directional Similarity (0.05). It also achieves a competitive CLIP score of 31.38 and S-Visual score of 0.39. Additionally, ReEdit is significantly faster, about four times faster than the next best baseline. Qualitatively, ReEdit consistently produces more accurate and visually pleasing edits, maintaining the target image's integrity while faithfully applying the desired transformations. While ReEdit shows promising results, it currently has limitations in handling edits involving the addition or removal of very small objects, an area for future research.
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models by Chuhan Li, Ziyao Shangguan, Yilun Zhao, Deyuan Li, Yixin Liu, Arman Cohan https://arxiv.org/abs/2411.04075
Caption: This image illustrates the six-step process for creating the M3SCIQA benchmark, a multi-modal, multi-document scientific question answering dataset. The process begins with selecting an anchor paper and a visual element (figure/table) within it, then formulating related questions and identifying a relevant reference paper. Finally, reference-based questions are generated and combined with the visual context questions to form the final benchmark entry.
Existing benchmarks for evaluating large language models (LLMs) and large multi-modal models (LMMs) often focus on single-document, text-only tasks. These benchmarks fail to capture the complexity of real-world research where researchers consult multiple papers and interpret non-textual data such as figures and tables. To address this, researchers have introduced M3SCIQA, a multi-modal, multi-document scientific question answering benchmark. M3SCIQA consists of 1,452 expert-annotated questions across 70 natural language processing (NLP) paper clusters. Each cluster includes an anchor paper and all its cited papers. The benchmark mimics a common research workflow: a finding from an image in the anchor paper leads to investigation within a referenced paper, requiring models to cross-reference and integrate information across multiple documents.
M3SCIQA evaluates models in two stages. First, the visual context evaluation assesses the ability of LMMs to rank cited papers based on their relevance to a question about a figure in the anchor paper. Second, the reference-based evaluation tests LLMs' ability to answer questions about the top-ranked papers retrieved in the first stage. The benchmark includes diverse question types, covering comparison, data extraction, location, and visual understanding for visual context questions, and conceptual understanding, methodological analysis, results interpretation, implications, and critical analysis for reference-based questions. The evaluation included 18 foundation models, both open-source and proprietary. The results revealed significant limitations in current LMMs and LLMs. In the visual context evaluation, the best-performing model, GPT-40, achieved a Mean Reciprocal Rank (MRR) of 0.488 compared to a human expert score of 0.796. Open-source LMMs struggled significantly with this task due to limited context windows, hallucinated outputs, and formatting issues. In the reference-based evaluation, using GPT-40's top-ranked papers as context, LLMs were evaluated on their ability to answer questions based on the provided text. Performance generally improved with access to more context (up to the top 3 ranked papers).
The study highlights several key findings. Even the best-performing models struggle with visual reasoning and paper ranking on scientific data. Open-source LMMs face inherent limitations in long-range ranking tasks. There's a trade-off between precision and recall in retrieval, with performance peaking when considering the top 3 ranked papers. Finally, LLMs exhibit varying degrees of instruction compliance, especially when asked to acknowledge when an answer cannot be derived from the given information. These findings underscore the need for further development in multi-modal and multi-document understanding for foundation models in scientific domains.
This newsletter showcases the rapid progress and ongoing challenges in the field of multimodal foundation models. From efficient adaptation techniques like IISAN-Versa for sequential recommendation to the development of universal multimodal retrieval systems like MM-Embed, the research highlights the growing sophistication of these models. However, challenges remain, as evidenced by the limitations in handling complex edits involving small objects in ReEdit and the struggles of current models on the M3SciQA benchmark, which underscores the need for improved multi-document and multi-modal understanding, especially in scientific domains. The development of specialized benchmarks and evaluation frameworks, like M3SciQA and MM-Detect, is crucial for driving further advancements and ensuring robust and reliable performance. The trend towards leveraging larger LLMs within multimodal systems is evident, though efficient adaptation and mitigation of modality biases remain key areas for future research. As the field continues to evolve, we can expect even more powerful and versatile multimodal AI systems capable of tackling increasingly complex real-world tasks.