Elman, Dive into the Latest Advancements in Multimodal Image and Text Foundation Models

This newsletter explores the cutting-edge research in multimodal image and text foundation models, covering novel architectures, training methodologies, and benchmark datasets. We'll delve into how these models are pushing the boundaries of document understanding, addressing object hallucinations, enhancing image compression, and revolutionizing medical image analysis. Prepare to discover the latest innovations that are shaping the future of multimodal AI.

TokenOCR: A New Foundation for Text-Image Understanding

A Token-level Text Image Foundation Model for Document Understanding by Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang https://arxiv.org/abs/2503.02304

Image Caption: This diagram illustrates the architecture of TokenOCR, a novel token-level Visual Foundation Model (VFM) trained on the TokenIT dataset, which contains 20 million images and 1.8 billion token-mask pairs for fine-grained text understanding. It also shows how TokenOCR is integrated into TokenVL, a document-level Multi-modal Large Language Model (MLLM), to enhance performance on tasks like text segmentation and retrieval by aligning visual and language embeddings at the token level. The diagram further depicts the downstream tasks of text segmentation and scene text retrieval, showcasing the application of TokenOCR in practical scenarios.

Recent advancements in Visual Foundation Models (VFMs) have been crucial for powering Multi-modal Large Language Models (MLLMs). However, existing VFMs, often trained with image-level supervision, struggle with the detailed understanding of text within images, especially in document-heavy contexts. This limitation impacts performance in tasks like document understanding, where accurate semantic capture of small and dense text is essential. This paper introduces TokenOCR, a novel token-level VFM specifically designed to address this gap.

At the heart of TokenOCR's development is TokenIT, a groundbreaking dataset of 20 million images and 1.8 billion token-mask pairs. This dataset, the first of its kind at this scale, provides token-level annotations, connecting individual text tokens (BPE subwords) to their precise locations within the image. The TokenIT dataset was meticulously constructed using a multi-step pipeline, including text image segmentation, text recognition, BPE tokenization, and token-level image-text pair generation, followed by stringent quality control. This rich dataset empowers TokenOCR to learn fine-grained image-as-text representations.

TokenOCR is trained by aligning token-level visual embeddings with corresponding language embeddings. For each token, a visual embedding is derived by mean-pooling the image features within its corresponding mask, while a simple token embedding layer generates the language embedding. The model's training aims to minimize the following objectives: L<sub>dis</sub>, L<sub>sim</sub>, and L<sub>sig</sub>. L<sub>dis</sub> and L<sub>sim</sub> ensure that embeddings of matching token pairs are close, while embeddings of non-matching pairs are distant. L<sub>sig</sub> employs a sigmoid loss to further refine the alignment. This approach effectively bridges the gap between visual and language modalities, creating a unified sequence representation suitable for integration with LLMs.

Further leveraging this foundation, the paper introduces TokenVL, a document-level MLLM built upon TokenOCR. TokenVL further refines the spatial visual-language token alignment at the LLM level. It employs a two-stage training process: LLM-guided Token Alignment and Supervised Instruction Tuning. The former utilizes the TokenIT dataset for VQA-based text parsing and explicit token alignment, while the latter fine-tunes the model on various document VQA datasets. Experiments demonstrate TokenOCR's superior performance in zero-shot text segmentation, achieving a top score of 34.59%, and text retrieval, reaching 63.62% on bilingual tasks. TokenVL, with 8B parameters, shows significant improvements on the OCRBench task (+38) and across ten document VQA tasks (+8.8% average).

Seeing is Believing: New Multimodal LLM Tackles Object Hallucinations

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs by Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi https://arxiv.org/abs/2503.02597

Image Caption: This image visualizes the Modality-Mutual Attention (MMA) mechanism of the AKI model. Green squares represent image tokens (V), black squares represent text query tokens (Tq), and grey squares represent the attention weights. MMA allows image tokens to attend to text query tokens (represented by no blockage between green and black squares), enabling richer cross-modal understanding. This contrasts with standard causal attention where earlier tokens cannot attend to later ones (represented by the blockage between green and blue squares).

Multimodal Large Language Models (MLLMs) have shown impressive progress in processing combined visual and textual information. However, a persistent challenge is vision-language misalignment, leading to object hallucinations where the generated text doesn't accurately reflect the visual input. Current approaches focus on specialized vision-language connectors or expanding training data, but a fundamental architectural limitation remains largely unexplored.

Most MLLMs are built on decoder-only LLMs utilizing causal attention, preventing earlier tokens (e.g., image tokens) from attending to later tokens (e.g., text tokens). This "modality blindness" hinders effective cross-modal interaction, as the visual modality can't fully incorporate information from the textual query. This paper introduces AKI, a novel MLLM proposing modality-mutual attention (MMA) to address this.

MMA modifies the causal attention mechanism by adjusting the attention mask M. Instead of the standard causal mask (M<sub>ij</sub> = -∞ if j > i), MMA allows image tokens to attend to text tokens during supervised fine-tuning. The modified mask M' is defined as:

M'<sub>ij</sub> = 0 if j < i

M'<sub>ij</sub> = 0 if 1 ≤ i ≤ |V| and |V| + 1 ≤ j ≤ |V| + |T<sub>Q</sub>|

M'<sub>ij</sub> = -∞ otherwise

where V represents image tokens and T<sub>Q</sub> represents text query tokens. This allows image tokens to "see" the text, enabling richer cross-modal understanding without adding parameters or training time. An alternative, dual-order training (DOT), doubles training time without consistent performance improvement.

AKI was evaluated on 12 multimodal benchmarks, encompassing general, knowledge-based, and vision-centric tasks. Results show AKI with MMA significantly outperforms the baseline, state-of-the-art CCA, and DOT, achieving a +7.2% average gain over CCA. A scaled-up AKI-4B model, trained on an extended schedule with OCR data, further demonstrated superior performance. This highlights MMA's potential as a generic and scalable solution for enhancing multimodal understanding and mitigating object hallucinations. While currently focused on image-text pairs, MMA is extensible to other modalities and scenarios.

Taming Large Multimodal Agents for Ultra-Low Bitrate Image Compression

Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression by Juan Song, Lijie Yang, Mingtao Feng https://arxiv.org/abs/2503.00399

A novel image compression framework, Semantically Disentangled Image Compression (SEDIC), addresses the challenge of achieving both semantic consistency and high perceptual quality at ultra-low bitrates. Traditional methods struggle with significant information loss at these extreme compression levels. SEDIC leverages the power of Large Multimodal Models (LMMs), like GPT-4 Vision, to disentangle image content into semantically rich representations, enabling more efficient encoding and high-quality reconstruction below 0.05 bpp.

SEDIC consists of a semantically disentangled encoder and a multi-stage semantic decoder. The encoder uses LMMs to extract essential semantic information: a low-quality reference image (compressed using a retrained learned image compression model), overall and object-level text descriptions, and semantic segmentation masks (generated using Grounding DINO and SAM). The decoder progressively restores the image object-by-object, starting with the compressed reference image.

Each decoding stage uses a pre-trained controllable diffusion model (ControlNet) to refine the reference image based on the extracted text descriptions and masks. This progressive, object-focused approach enables accurate placement and restoration of details, resulting in higher fidelity reconstructions. The diffusion model's attention map is guided by an energy function:

E(A, Mⱼ, k) = (1/∑ₘ∈Mⱼ Aₘ,ₖ)∑ₘ Aₘ,ₖ²

where Aₘ,ₖ represents the cross-attention map between spatial location m and token k in the object description, and Mⱼ is the semantic mask for object j. This function maximizes correlation within the mask and minimizes it outside, ensuring accurate object placement.

Experimental results on Kodak, DIV2K validation, and CLIC2020 datasets demonstrate SEDIC's superior performance compared to state-of-the-art codecs at ultra-low bitrates, consistently outperforming others across perceptual quality metrics (LPIPS, DISTS, FID, and KID). Visually, SEDIC reconstructions exhibit greater detail and fewer artifacts. Ablation studies confirm the importance of each encoding component, particularly the compressed reference image. SEDIC offers a promising new direction for ultra-low bitrate image compression by effectively leveraging the semantic understanding and generative capabilities of LMMs.

MLLMs Can Answer, But Can They Cite? New Benchmark Reveals Attribution Bottleneck

MciteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs by Caiyu Hu, Yikai Zhang, Tinghui Zhu, Yiwei Ye, Yanghua Xiao https://arxiv.org/abs/2503.02589

Image Caption: This diagram illustrates the four-stage construction process of mCiteBench, a new benchmark for evaluating multimodal citation generation in Large Language Models (MLLMs). The process begins with collecting academic papers and their rebuttals, followed by generating explanation and locating question-answer pairs using GPT-40. Subsequently, candidate evidence (figures, tables, and text) is paired with answers, and finally, a rigorous quality control process involving both automated and human filtering ensures the benchmark's reliability.

Multimodal Large Language Models (MLLMs) have progressed in integrating diverse information, but hallucination remains a challenge. Generating citations alongside text offers a verifiable chain of attribution, but current research primarily focuses on text-only content. This paper introduces MCITEBENCH, the first benchmark to evaluate and analyze MLLMs' ability to generate citations for multimodal content, addressing a crucial research gap by focusing on the challenges and opportunities of incorporating visual and tabular information.

MCITEBENCH comprises 3,000 samples from academic papers and review-rebuttals, providing rich multimodal content (text, figures, tables). The benchmark's four-stage construction involves: collecting an attribution corpus, constructing question-answer pairs, pairing answers with supporting evidence, and implementing rigorous quality control. Questions are categorized into Explanation and Locating types, with single-source and multi-source evidence examples covering various difficulty levels and modality combinations. Evaluation spans citation quality (Citation F1), source reliability (Source F1 and Source Exact Match), and answer accuracy (Accuracy).

Experiments on MCITEBENCH reveal that while MLLMs answer correctly, they struggle with citation, especially in multi-source scenarios, showing a modality bias favoring textual evidence attribution. Further analysis confirms this, showing attention to distractor information even with correct answers. Deeper analysis highlights the bottleneck lies in attribution, not understanding. When tested on both using the same input, models achieved over 90% accuracy on understanding but performed significantly worse on attribution. This crucial limitation shows MLLMs can process and understand multimodal information but struggle to accurately link outputs to sources. This research emphasizes the need for future work to focus on improving attribution capabilities for more trustworthy and verifiable responses.

ABC: Taking Control of Multimodal Embeddings with VLMs

ABC: Achieving Better Control of Multimodal Embeddings using VLMs by Benjamin Schneider, Florian Kerschbaum, Wenhu Chen https://arxiv.org/abs/2503.00329

Image Caption: The diagram illustrates the two-stage training process of the ABC model. First, contrastive pretraining with negative mining creates robust image-text representations. Then, instruction fine-tuning allows the model to dynamically adjust its embeddings based on natural language instructions, enabling more nuanced understanding and control over multimodal representations.

Visual embedding models excel at zero-shot tasks but falter with ambiguity or user instructions. Existing multimodal models (often CLIP-based) embed image and text separately before fusion, resulting in weak inter-modality interactions and limited user control. This paper introduces ABC, an open-source multimodal embedding model using a vision-language model (VLM) backbone for deep integration of image features and natural language instructions, enabling a more nuanced and controllable representation for complex visual tasks.

ABC's two-stage training involves contrastive pretraining with negative mining (using a preliminary model to select "almost plausible" negative captions) and instruction fine-tuning (a lightweight adapter trained in 100 steps to incorporate synthetic instructions). On MSCOCO image-to-text retrieval, ABC surpasses all CLIP models up to 8B parameters (69.2% R@1), and sets a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). On the new CtrlBench benchmark (requiring interleaving image content and textual instructions), ABC achieves 39.7% R@1, significantly outperforming CLIP-based models.

The paper highlights the importance of decoupling pretraining and instruction fine-tuning for faster iteration and broader accessibility. Factors like batch size and data quality during pretraining are crucial, mirroring large-scale CLIP training findings. The VLM backbone choice (Qwen2-VL-7B performed best) correlates with generative VLM benchmark performance. The authors advocate for robust multimodal benchmarks like CtrlBench, emphasizing tasks requiring true image-text integration and diverse instructions. ABC's ability to dynamically adjust embeddings based on instructions represents a significant step in multimodal learning.

Teaching Machines to Understand Distance: A New Loss Function for Multimodal Foundational Models

Teaching Metric Distance to Autoregressive Multimodal Foundational Models by Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu https://arxiv.org/abs/2503.02379

Image Caption: This image illustrates the DIST²Loss framework applied to visual grounding. An instruction prompts the model to locate a zebra, and the model responds with bounding box coordinates. The lower section details the loss calculation, contrasting the traditional cross-entropy loss with the proposed discretized distance loss, which leverages distance metrics between predicted and target coordinates.

Large Language Models (LLMs) are increasingly handling numerical data (coordinates, angles, embeddings) where metric distances are crucial. Traditional training treats these as discrete categories, ignoring inherent distance relationships. The DIScreTized DISTance Loss (DIST²Loss) framework addresses this by leveraging predefined distance relationships among output tokens to train autoregressive discrete models, enabling them to learn and preserve these relationships.

DIST²Loss transforms continuous exponential family distributions (derived from distance metrics) into discrete categorical optimization targets, compatible with existing architectures while incorporating distance information. The method calculates the distance between the target token and other candidates within a metric space, constructing a target likelihood distribution where closer tokens have higher probabilities. The model is optimized by minimizing the KL divergence between this target distribution and the model's predicted distribution. The target distribution is:

P<sub>d</sub>(v|x, t) = exp(-d(v, x, t)) / Σ<sub>v'∈V<sub>a</sub></sub> exp(-d(v', x, t))

where d is the distance metric, v is a token, x is the target subsequence, and t is the time step.

Evaluating DIST²Loss across various multimodal tasks (visual grounding, robotic manipulation, generative reward modeling, image generation with vector-quantized features) showed consistent benefits. In visual grounding, it improved bounding box accuracy, comparable to state-of-the-art models pre-trained on large object detection datasets. In robotic manipulation, it increased success rates, especially with limited data. For generative reward modeling, it significantly outperformed cross-entropy loss. For image generation, it generated higher-quality images. DIST²Loss is particularly effective in resource-constrained settings, leading to faster convergence and improved generalization. Its simplicity allows plug-and-play integration, making it a promising tool for enhancing multimodal foundational models. This research suggests moving beyond one-hot next-token prediction and incorporating structured numerical information can significantly improve performance.

Revolutionizing PET/CT Analysis with FratMAE: A New Foundation Model for Multimodal Medical Imaging

Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging by Yujin Oh, Robert Seifert, Yihan Cao, Christoph Clement, Justin Ferdinandus, Constantin Lapa, Alessandro Liebich, Michelle Amon, Johanna Enke, Sifan Song, Runqi Meng, Fang Zeng, Ning Guo, Xiang Li, Pedram Heidari, Axel Rominger, Kuangyu Shi, Quanzheng Li https://arxiv.org/abs/2503.02824

Image Caption: Figure 1 illustrates the architecture of X-FratMAE, a novel foundation model for PET/CT analysis. Subfigure (a) depicts Stage 1 unsupervised pre-training with masked autoencoding and cross-modal interaction between PET and CT data, incorporating text metadata. Subfigure (b) shows Stage 2 supervised training for multimodal/unimodal segmentation using pre-trained encoders and a layer-wise concatenation and alignment strategy.

PET/CT is crucial in oncology, providing anatomical (CT) and functional/molecular (PET) information. Existing AI analyses often rely on task-specific models, limiting generalizability. This paper introduces FratMAE (Cross-Fraternal Twin Masked Autoencoder), a foundation model for PET/CT imaging, aiming to unlock the full potential of multimodal medical image analysis.

FratMAE utilizes a unique architecture integrating whole-body anatomical and functional context. Processing coronal scan stack-based 3D patches (instead of traditional axial stacks), it captures global uptake patterns and anatomical relevance across the body. Separate ViT encoders for PET and CT scans, coupled with cross-attention decoders, enable synergistic interactions during masked autoencoder training, learning intricate cross-modal relationships. Textual metadata (radiotracer type, demographics) is incorporated using contrastive learning with the InfoNCE loss:

L<sub>InfoNCE</sub> = -E<sub>(Z<sup>CLS</sup><sub>PET</sub>, Z<sup>CLS</sup><sub>TEXT</sub>)~P<sub>data</sub></sub> log (exp(Z<sup>CLS</sup><sub>PET</sub> Z<sup>CLS</sup><sub>TEXT</sub>)<sup>T</sup> / τ) / Σ<sup>N</sup><sub>j=1</sub> exp(Z<sup>CLS</sup><sub>PET</sub> Z<sup>CLS</sup><sub>TEXT,j</sub>)<sup>T</sup> / τ)

where Z<sup>CLS</sup><sub>PET</sub> and Z<sup>CLS</sup><sub>TEXT</sub> are CLS tokens from encoded PET and text, N is batch size, τ is temperature, and P<sub>data</sub> is the joint distribution of paired PET and text. This contextualizes functional information with clinical data.

Pre-trained on AutoPET III and evaluated on the GHSG dataset for lesion segmentation and lymphoma staging, FratMAE showed significant improvements. For lesion segmentation, it achieved a Dice score of 0.640 and IoU of 0.496 with only 20% of training data, outperforming baselines. Superior Ann Arbor staging performance highlighted its robust whole-body metabolic representations. Qualitative analysis showed effective reduction of false positives/negatives. FratMAE offers a promising direction for multimodal medical image analysis. Integrating whole-body context, cross-modal relationships, and textual metadata provides significant advantages. Expanding the pre-training dataset with multi-center and multi-tracer data could further enhance performance and generalizability. Future work will explore additional downstream tasks like treatment response prediction.

Conclusion: A Multifaceted Look at the Future of Multimodal Models

This newsletter has showcased the rapid advancements happening in the field of multimodal image and text foundation models. From the granular text understanding capabilities of TokenOCR to the innovative attention mechanism of AKI, these models are addressing critical challenges like object hallucinations and improving performance on complex tasks. The introduction of SEDIC demonstrates how LMMs can be leveraged for ultra-efficient image compression while maintaining high perceptual quality. Meanwhile, MCITEBENCH provides a much-needed benchmark for evaluating citation generation in multimodal contexts, highlighting the importance of accurate attribution. Finally, FratMAE showcases the potential of foundation models to revolutionize medical image analysis by effectively integrating anatomical and functional information. These diverse approaches underscore the growing sophistication and potential of multimodal models to transform various domains, paving the way for more robust, versatile, and trustworthy AI systems.