Elman, Your Multimodal Digest: Navigating the Latest in Image and Text Foundation Models

Hello Elman,

In this newsletter, we delve into the rapidly evolving landscape of multimodal image and text foundation models. We'll explore breakthroughs in complex reasoning, efficient embedding utilization, multilingual capabilities, and novel architectures designed to mimic human-like cognitive processes for continuous multimodal interaction. From enhancing reasoning with similarity computations to leveraging the power of Multimodal Large Language Models (MLLMs) for multi-reference image generation and calorie estimation, this collection of recent papers offers valuable insights into the latest advancements in this dynamic field. Prepare to be informed and inspired by the ingenuity driving this exciting area of research.

Enhancing Reasoning in Multimodal LLMs Through Similarity Computation

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation by Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao https://arxiv.org/abs/2412.09817

Image Caption: This diagram illustrates the Simignore method for enhancing multimodal LLM reasoning. It shows how image and text embeddings are compared using cosine similarity to select the most relevant image tokens. The less relevant image tokens are then ignored by the multimodal encoder, improving the LLM's reasoning ability.

Multimodal Large Language Models (LVLMs) have demonstrated impressive progress, particularly in complex reasoning tasks. However, understanding their internal mechanisms, especially during chain-of-thought reasoning, remains a challenge. This paper investigates the interaction between image and text tokens within LVLMs, shedding light on their information flow. The authors observed a crucial distinction: image tokens semantically related to the text exhibit stronger information flow convergence in the LLM's decoding layer and receive higher attention scores. Conversely, less relevant image tokens lack this convergence and receive negligible attention. This key finding suggests that not all image tokens contribute equally to the reasoning process, opening avenues for optimization.

Based on this observation, the authors introduce Simignore, a novel image token reduction method designed to enhance complex reasoning abilities. Simignore computes the similarity between image and text embeddings and selectively ignores image tokens deemed irrelevant. This method maps both image and text token embeddings into a shared similarity metric space, utilizing cosine similarity. By calculating the similarity between each image token and all text tokens, the top K image tokens with the highest similarity are retained, effectively prioritizing the most relevant visual information. The remaining tokens, deemed less important, have their attention masks set to 0, effectively removing them from subsequent processing. Formally, the cosine similarity S(i, j) between the i-th image token embedding ImgEmbnorm(i, :) and the j-th text token embedding TextEmbnorm(j, :) is calculated as: S(i, j) = ImgEmbnorm(i, :) ⋅ TextEmbnorm(j, :)ᵀ.

Evaluations on the ScienceQA dataset, a benchmark for complex visual reasoning, demonstrate Simignore's effectiveness. Integrating Simignore with state-of-the-art LVLMs, including LLaVA-1.5 (7B and 13B) and Mipha-3B, consistently improved performance. For example, LLaVA-1.5 (7B) achieved an accuracy improvement of 2.87%. Further ablation studies highlighted the impact of ignoring varying numbers of image tokens. Ignoring unimportant tokens yielded the best results, while ignoring important tokens significantly hampered performance, validating the importance of selective token retention. Experiments with different similarity metrics confirmed cosine similarity as the most effective for this task.

The impact of image tokens on LLM reasoning was further explored through k-means clustering of image token embeddings. The authors observed that ignoring certain clusters of image tokens, specifically those with low similarity to the text, led to correct answers in cases where the baseline model failed. This suggests the presence of "spy" tokens that negatively influence the reasoning process. While Simignore shows promising results, the authors acknowledge limitations and suggest future work on adaptively choosing the number of tokens to ignore based on the specific task and input. This work provides valuable insights into the interplay of image and text within LVLMs, offering a simple yet effective method for enhancing complex reasoning.

Jina AI's Multilingual and Multimodal Embeddings for Text and Images

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images by Andreas Koukounas, Georgios Mastrapas, Bo Wang, Mohammad Kalim Akram, Sedigheh Eslami, Michael Günther, Isabelle Mohr, Saba Sturua, Scott Martens, Nan Wang, Han Xiao https://arxiv.org/abs/2412.08802

Image Caption: The architecture of Jina-CLIP-V2, a multilingual, multimodal embedding model, is depicted, showcasing its dual encoder setup with JINA-XLM-ROBERTA for text and EVA02-L for images. The model processes text inputs (left, right, caption) and images, generating embeddings for various downstream tasks like cross-modal and text retrieval. The output embeddings are variable in dimension, ranging from 512 to 1024 depending on the input modality and configuration.

Jina AI introduces jina-clip-v2, a powerful new multilingual, multimodal embedding model designed for both cross-modal and text retrieval. Building on its predecessor, jina-clip-v1, this enhanced model addresses limitations in language support, handling of complex visual content, and text retrieval performance. Jina-clip-v2 utilizes a sophisticated multi-task, multi-stage contrastive learning approach across multiple languages, coupled with an improved training recipe to boost text-only retrieval. It also incorporates Matryoshka Representation Learning and vector truncation for efficient storage and computation.

The model's architecture features a dual encoder setup. The text encoder is initialized with pre-trained Jina-XLM-ROBERTa weights, incorporating Flash Attention, rotary positional embeddings, and LoRA. The image encoder utilizes the EVA02 family of ViT models, featuring 2D rotary positional embeddings and a memory-efficient attention implementation. The training data encompasses multilingual text pairs and triplets, image-caption pairs (both short and long), and visually rich datasets including PDFs, scientific graphs, and Wikipedia images. The training process unfolds in three stages: initial alignment of multimodal and text representations, refinement using longer texts and detailed image captions, and further enhancement with hard negatives and high-resolution images. The loss function used is a combination of InfoNCE loss for text-image and text-text matching, with an extended version incorporating hard negatives in the later stages:

L<sub>nce</sub>(B) = L<sub>nce</sub><sup></sup>(B) + L<sub>nce</sub><sup></sup>(B), with L<sub>nce</sub><sup></sup>(B) := E<sub>(q,p)~B</sub>[-ln(e<sup>cos(q,p)/T</sup> / Σ<sub>i=1</sub><sup>M</sup> e<sup>cos(q,p<sub>i</sub>)/T</sup> )] and L<sub>nce</sub><sup></sup>(B) := E<sub>(q,p)~B</sub>[-ln(e<sup>cos(p,q)/T</sup> / Σ<sub>i=1</sub><sup>M</sup> e<sup>cos(p,q<sub>i</sub>)/T</sup> )]

Jina-clip-v2 demonstrates impressive performance across various benchmarks. On the English CLIP Benchmark, it achieves 79.09% and 89.73% for text-to-image and image-to-text retrieval, respectively. In multilingual evaluations, it obtains competitive results on Crossmodal-3600 (81.43% and 83.23% for text-to-image and image-to-text) and multilingual MS COCO (84.87% and 86.03% for text-to-image and image-to-text). It also performs comparably to dedicated multilingual text embedding models on MTEB Retrieval (49.32% for English, 69.85% for multilingual) and STS tasks (81.29% for English, 67.77% for multilingual). Notably, jina-clip-v2 achieves state-of-the-art performance on the ViDoRe benchmark for visual document understanding with an average nDCG@5 score of 52.65%.

The paper also investigates the impact of image resolution on complex document retrieval, concluding that (512, 512) offers a good balance between performance and computational efficiency. It explores the effectiveness of unified batch training compared to the employed multi-task learning strategy, suggesting the multi-task approach is superior due to the inherent information asymmetry between visual and textual modalities. This highlights the importance of considering this asymmetry in future CLIP model development. The use of Matryoshka Representation Learning allows for efficient vector truncation with minimal performance impact, demonstrating robustness in feature acquisition.

EasyRef: Leveraging MLLMs for Omni-Generalized Group Image Reference

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM by Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li https://arxiv.org/abs/2412.09618

Image Caption: EasyRef uses a multimodal large language model (MLLM) to process reference images and text prompts, generating consistent image embeddings that are then projected into a diffusion model's latent space. The diffusion model, guided by these embeddings and user prompts, synthesizes a new image, demonstrating improved performance in multi-reference image generation compared to existing methods like IP-Adapter and LoRA. This diagram illustrates the architecture of EasyRef, showcasing the flow of information from reference images and text prompts through the MLLM and into the diffusion model.

Personalized image generation with diffusion models has made significant strides, but existing methods often struggle to maintain consistency across multiple reference images. Tuning-free methods, which typically average image embeddings, can lead to spatial misalignments and loss of crucial visual elements. Tuning-based methods like LoRA, while effective, require computationally expensive fine-tuning for each new group of reference images, limiting their practical application. EasyRef, a novel plug-and-play adaptation method, tackles these limitations by harnessing the power of Multimodal Large Language Models (MLLMs).

EasyRef conditions diffusion models on multiple reference images and text prompts using a four-part architecture. A pre-trained MLLM encodes the reference images and text prompt based on a specific instruction, capturing consistent visual elements across the references. To improve efficiency, EasyRef utilizes learned reference tokens F<sub>ref</sub> ∈ ℝ<sup>N×D</sup> and aggregates reference representations within the deepest layer of the MLLM, reducing the computational overhead associated with long context inputs. These aggregated representations are then projected into the diffusion model's latent space using a condition projector. Finally, trainable adapters integrate the image conditioning embedding into the diffusion process through cross-attention layers, defined as:

X = Softmax(QK<sup>T</sup>/√d)V + Softmax(QK<sup>T</sup>/√d)V

where K = c<sub>i</sub>Ŵ<sub>k</sub> and V = c<sub>i</sub>Ŵ<sub>v</sub>.

A progressive training scheme further enhances performance. EasyRef begins with alignment pretraining on a large image-text dataset, followed by single-reference finetuning on real-world images. Finally, multi-reference finetuning on a curated dataset of image groups sharing a common topic enables the MLLM to learn consistent element extraction across multiple references. The training objective mirrors the original stable diffusion model, minimizing the difference between the added noise and the model's denoising prediction:

L = 𝔼<sub>_x_0,ε,_c_t,_c_i,t</sub> ||ε - ε<sub>θ</sub>(x<sub>t</sub>, c<sub>t</sub>, c<sub>i</sub>, t)||²

Evaluated on the newly introduced multi-reference generation benchmark, MRBench, EasyRef demonstrates superior performance compared to IP-Adapter and LoRA. On the held-in split of MRBench, EasyRef achieved a CLIP-I score of 0.843, CLIP-T of 0.726, and DINO-I of 0.672, surpassing IP-Adapter (0.768, 0.632, and 0.527 respectively). In zero-shot evaluations on the held-out split, EasyRef (0.833, 0.709, 0.614) again outperformed IP-Adapter (0.795, 0.645, 0.579), while LoRA failed to generalize. Human evaluations further confirmed EasyRef's superiority in both aesthetic quality and reference consistency. Ablation studies validated the effectiveness of the reference aggregation strategy, multimodal instruction input, and progressive training scheme. EasyRef represents a significant advancement in multi-reference image generation, offering a powerful and efficient solution for personalized image synthesis with enhanced control and fidelity.

Embeddings: A Training-Free Approach to Medical Image Classification

Embeddings are all you need! Achieving High Performance Medical Image Classification through Training-Free Embedding Analysis by Raj Hansini Khoiwal, Alan B. McMillan https://arxiv.org/abs/2412.09445

Developing AI models for medical image analysis traditionally requires extensive training on large datasets, demanding substantial computational resources. This research explores a more efficient alternative: leveraging image embeddings from pre-trained models combined with simple linear classifiers. Instead of training a model from scratch, the researchers used pre-trained Convolutional Neural Networks (CNNs) like ResNet50 and multimodal models like CLIP to generate image embeddings. These embeddings, capturing high-level visual and semantic features, were then fed into linear classifiers such as Logistic Regression (LR) and Support Vector Machines (SVM). This approach significantly reduces computational burden, as embeddings are computed only once per image.

This embedding-based approach was evaluated across five diverse medical imaging datasets: CBIS-DDSM (mammography), CheXpert (chest radiographs), HAM10000 and PAD-UFES-20 (skin lesions), and ODIR (ocular diseases). Performance was compared against benchmark AUC scores from traditional, fully-trained models. Each dataset was split into 80% training and 20% testing sets, with dataset-specific preprocessing steps including resizing and normalization. A grid search strategy with five-fold cross-validation was employed for hyperparameter tuning, optimizing the regularization strength (C) and, for some models, the loss function and kernel parameters.

The results demonstrated that embedding-based classifiers achieved comparable or superior performance to benchmark models in four out of five datasets. Significant improvements were observed in skin lesion classification (HAM10000 and PAD-UFES-20) and ocular disease detection (ODIR), with AUC increases ranging from 0.3 to 0.5 points. In the CBIS-DDSM dataset, improvements were modest but still comparable to traditional models. CLIP embeddings generally outperformed ResNet50 embeddings, particularly when combined with LR. For example, on HAM10000, CLIP with LR achieved the highest AUC of 0.9586, compared to a benchmark AUC of 0.609. Similarly, on PAD-UFES-20, CLIP with LR achieved an AUC of 0.9145, compared to a benchmark of 0.487.

This study suggests that using pre-trained embeddings with linear classifiers can be a highly effective and computationally efficient alternative to training deep learning models from scratch, especially for specific medical imaging applications. The substantial improvements in skin and ocular image classification highlight the potential of this approach to enhance diagnostic accuracy while minimizing computational demands. However, the authors acknowledge limitations, including reliance on publicly available datasets, lack of external validation, and the potential for linear classifiers to miss complex non-linear relationships. Further research is needed to address these limitations and explore the application of this approach to more complex multi-label tasks and real-world clinical data.

CaLoRAify: Calorie Estimation using VLMs and LoRA

CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models by Dongyu Yao, Keling Yao, Junhong Zhou, Yinghao Zhang https://arxiv.org/abs/2412.09936

Image Caption: This diagram illustrates the architecture of CaLoRAify, a vision-language model for calorie estimation and ingredient analysis. It shows how visual input from a food image is processed by a Vision Transformer and combined with textual input from a question set and retrieved nutritional facts from a RAG database (like USDA) to generate a final calorie estimate and ingredient list using LLaMA 2. The system leverages techniques like LoRA and RAG to achieve accurate results.

The ongoing obesity epidemic underscores the need for effective dietary management tools. Traditional calorie estimation methods, often relying on multi-step pipelines involving food classification, portion size estimation, and caloric calculation, are limited by their dependence on specific metadata, error propagation, and hardware requirements. CaLoRAify, a novel vision-language model (VLM) framework, addresses these challenges by providing accurate calorie estimation and ingredient analysis from a single food image.

CaLoRAify leverages MiniGPT-v2, a unified interface for various vision-language tasks, and integrates it with a Retrieval-Augmented Generation (RAG) mechanism. This hybrid approach combines the visual understanding of MiniGPT-v2 with RAG's knowledge retrieval capabilities, allowing the system to access external nutritional databases like the USDA. Users simply provide a monocular food image. The system processes the image through a Vision Transformer (ViT) and extracts visual features, which are projected into the LLaMA-2 embedding space. A structured question, guided by a task-specific identifier, is then formulated and sent to the RAG module. The RAG module retrieves relevant nutritional information from the external database based on the identified ingredients. Finally, LLaMA-2 integrates the retrieved information with the visual features to generate accurate calorie estimates and detailed ingredient analysis.

A new dataset, CalData, comprising 330K image-text pairs derived from Recipe1M+ and augmented with detailed nutritional information, was curated for training and evaluation. The training pipeline incorporates Low-rank Adaptation (LoRA) and RAG techniques to enhance performance in this specialized domain. A rephrasing model diversifies the question set during training.

Experimental results demonstrate CaLoRAify's effectiveness. The fine-tuned model shows significant improvements compared to the baseline MiniGPT-4, with ROUGE-2 increasing by 55.01%, BLEU by 61.48%, and the aggregate metric (weighted average of ROUGE-L and BLEU) by 8.16%. These results highlight CaLoRAify's potential for real-world dietary applications, offering a convenient and accurate solution for calorie tracking and management. Future work will focus on optimizing the framework for real-time inference on mobile devices, expanding the dataset to include diverse cuisines, and developing more interactive features, such as personalized dietary recommendations.

InternLM-XComposer2.5-OmniLive: Streaming Multimodal Interactions

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions by Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang https://arxiv.org/abs/2412.09596

Image Caption: The architecture of InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is depicted, showcasing its specialized modules for streaming perception (audio and vision encoders), reasoning (Large Language Model), and memory. The system processes streaming multimodal data (audio and video) and uses a compressor model within its memory module to efficiently store and retrieve long-term representations, enabling continuous and adaptive AI interaction. This design allows IXC2.5-OL to overcome the limitations of single decoder-only LLMs in handling continuous multimodal interaction, mimicking human-like cognition by performing perception, memory, and reasoning concurrently.

Moving beyond the limitations of single decoder-only Large Language Models (MLLMs) for continuous interaction, researchers introduce InternLM-XComposer2.5-OmniLive (IXC2.5-OL), a system designed to emulate human-like cognition in processing streaming multimodal data. Current MLLMs struggle with simultaneous perception and reasoning, analogous to being unable to think while perceiving. Furthermore, storing extensive historical data in long contexts becomes impractical for extended interactions. IXC2.5-OL addresses these challenges with a Specialized Generalist AI approach, featuring disentangled modules for streaming perception, reasoning, and memory.

The Streaming Perception Module processes multimodal information on-the-fly. A live video perception model encodes video streams, storing key details, while an audio model recognizes speech and other sounds, triggering reasoning based on user queries. The Multi-modal Long Memory Module addresses the context window bottleneck by integrating short-term and long-term memory. It compresses short-term memories (video clip features F<sub>k</sub> ∈ ℝ<sup>T×N×C'</sup>) into long-term representations (H<sub>k</sub> ∈ ℝ<sup>P×C</sup>, Ĥ<sub>k</sub> ∈ ℝ<sup>C</sup>) using a compressor model, enhancing retrieval efficiency and accuracy. The compression and integration are formulated as:

H<sub>k</sub>, Ĥ<sub>k</sub> = Compressor([F<sub>k</sub>H<sub>k</sub>H<sub>k</sub>])

H<sub>1</sub>, H<sub>2</sub>, ..., H<sub>k</sub> = Compressor([H<sub>1</sub>H<sub>2</sub>... ○ H<sub>k</sub>Ĥ<sub>1</sub>Ĥ<sub>2</sub>... ○ Ĥ<sub>k</sub>])

The Reasoning Module, powered by an enhanced InternLM-XComposer2.5, handles queries and performs reasoning, coordinating with the perception and memory modules. This modular design enables simultaneous perception, memory, and reasoning, providing continuous and adaptive AI service.

IXC2.5-OL excels across benchmarks, achieving competitive Word Error Rates (WER) on WenetSpeech (Chinese) and LibriSpeech (English). For video understanding, it achieves state-of-the-art results among open-source models under 10B parameters on MLVU (66.2% M-Avg) and MVBench (68.7% overall accuracy), with strong performance on Video-MME (60.6%) and MMBench-Video (1.42). Impressively, IXC2.5-OL sets a new state-of-the-art for open-source models on StreamingBench (73.79%), showcasing its proficiency in real-time video interactions.

IXC2.5-OL represents a significant step towards creating AI systems capable of sustained and dynamic interaction with multimodal environments. By decoupling perception, memory, and reasoning, it overcomes limitations of traditional MLLMs, providing a promising blueprint for future development in long-term multimodal AI interaction. The open-source release of the model and code further fosters community involvement and accelerates progress in this exciting field.

Conclusion

This newsletter highlighted key advancements in multimodal image and text foundation models. We saw how similarity computations can refine reasoning in LVLMs by prioritizing relevant visual information, as demonstrated by Simignore. Jina-clip-v2 showcased impressive multilingual and multimodal capabilities, excelling in both cross-modal and text retrieval tasks. EasyRef introduced a novel approach to multi-reference image generation, leveraging the power of MLLMs to enhance consistency and control. Finally, InternLM-XComposer2.5-OmniLive presented a compelling vision for continuous multimodal interaction, mimicking human-like cognition through disentangled modules for perception, memory, and reasoning. These advancements collectively push the boundaries of multimodal AI, paving the way for more sophisticated and human-like interactions with machines.