Elman, Dive into the Latest Advancements in Multimodal Image and Text Foundation Models

This newsletter explores the cutting edge of multimodal image and text foundation models, covering novel architectures, training techniques, and applications across various domains, from scene understanding and medical imaging to artistic poster generation and remote sensing. We'll delve into the challenges faced by these powerful models, such as fine-grained alignment, computational efficiency, and domain adaptation, and examine the innovative solutions proposed by researchers.

M3: Revolutionizing Scene Understanding with 3D Spatial Multimodal Memory

M3: 3D-Spatial MultiModal Memory by Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang https://arxiv.org/abs/2503.16413

Researchers have introduced M3 (3D Spatial MultiModal Memory), a groundbreaking system designed to capture and retain comprehensive information about static scenes from video sources. Unlike existing methods primarily focused on visual reconstruction, M3 integrates the power of 3D Gaussian Splatting with foundation models (like CLIP, SigLIP, DINOv2, and LLaMA) to build a semantically rich and spatially precise memory. This approach allows M3 to store and render feature representations at varying levels of detail, mimicking the human ability to recall scenes with different granularities.

A key challenge in previous feature splatting methods is the computational bottleneck caused by storing high-dimensional features for each Gaussian primitive. Furthermore, directly distilling 2D features into 3D can lead to misalignment and information loss. M3 tackles these challenges by introducing principal scene components (PSC) and Gaussian Memory Attention. PSCs are essentially a compressed memory bank of key features extracted from the foundation models across all views of the scene. Instead of distilling features directly, M3 uses low-dimensional principal queries from the 3D Gaussians as indices to retrieve relevant information from the PSCs via Gaussian Memory Attention. This allows M3 to preserve the expressive power of the original foundation model features while maintaining a computationally efficient 3D Gaussian structure.

Formally, the rendered feature Ŕ is calculated as: Ŕ = Agm(Q) = Softmax(Q* × Wm × PSCT) × PSC*, where Q** represents the view-based principle queries, Wm is a learned memory projection matrix, and PSC are the principal scene components. This innovative approach to feature storage and retrieval allows M3 to overcome the limitations of previous methods and efficiently represent complex 3D scenes.

The researchers evaluated M3 on various datasets, including Garden, Train, Playroom, and a custom robot dataset (M3-Robot), using a combination of low-level metrics (PSNR, SSIM, LPIPS, cosine and L2 distance) and high-level metrics (mIoU, cIoU, AP, IR, and TR) to assess performance on tasks like image rendering, feature similarity, grounding, and retrieval. Quantitatively, M3 outperformed existing feature distillation methods like F-Splat and F-3DGS, achieving better feature similarity and downstream task performance while using fewer parameters. For instance, on the Train dataset, M3 achieved an mIoU of 25.4 for CLIP features compared to 24.2 for F-3DGS, while using only 35M parameters compared to F-3DGS's 61M. Qualitatively, M3 demonstrated a superior ability to preserve fine-grained details, handle overlapping objects, and generate coherent feature representations across different foundation models.

Finally, to showcase real-world applicability, M3 was deployed on a quadruped robot for a grasping task. The robot successfully used M3's memorized scene representation to locate and grasp a target object specified by a text query. This demonstration highlights the potential of M3 for robotic manipulation and navigation tasks in real-world environments. The researchers suggest future work could focus on developing a reasoning module that can directly operate on the optimized memory bank, further enhancing M3's capabilities.

RL Takes Center Stage in Guiding Medical Image Generation with Vision-Language Models

RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models by Parham Saremi, Amar Kumar, Mohammed Mohammed, Zahra TehraniNasab, Tal Arbel https://arxiv.org/abs/2503.15784

Image Caption: This diagram illustrates the RL4Med-DDPO framework, which uses reinforcement learning to enhance Stable Diffusion for medical image generation. The framework uses a policy gradient update process guided by an attribute classifier's reward, calculated as the ratio of correctly predicted attributes to the total number of attributes, to refine the alignment between generated images and text prompts, such as generating dermoscopic images with specific artifacts and diseases.

Vision-Language Foundation Models (VLFMs) have revolutionized image generation, but they often struggle with the fine-grained alignment crucial for medical imaging. Accurate localization and detection of clinical features are essential for diagnosis and analysis, requiring precise correspondence between image regions and textual descriptions. This paper introduces RL4Med-DDPO, a novel framework leveraging reinforcement learning (RL) to enhance the control and adaptability of VLFMs, specifically Stable Diffusion, for generating diverse and contextually accurate medical images.

The core of RL4Med-DDPO is a two-stage process. First, Stable Diffusion v1.5 is fine-tuned on a medical image dataset to establish a baseline alignment between text-image pairs. Then, a pre-trained multi-head Efficient-Net classifier evaluates the generated images, providing a reward signal based on the alignment between the image content and the input text prompt. This reward guides the denoising U-Net of Stable Diffusion through a policy optimization process, specifically Denoising Diffusion Policy Optimization (DDPO). DDPO reframes the diffusion process as a Markov Decision Process (MDP), allowing the RL agent to learn a policy that maximizes the cumulative reward, effectively refining the alignment between generated images and textual descriptions. The reward function, r<sub>a</sub>(.), is calculated as the ratio of correctly predicted attributes to the total number of attributes: A<sup>(i)</sup> = (# correctly predicted attributes) / (total number of attributes), where A<sup>(i)</sup> represents the attribute alignment reward function for the ith generated sample.

The authors tested their framework on the ISIC 2019 skin cancer dataset, focusing on generating images of melanoma and melanocytic nevus with specific attributes like hairs, gel bubbles, ink, or a ruler. They introduced a new metric: Artifact Prevalence Rate (APR). APR measures the proportion of synthesized images where the generated attributes accurately reflect the input text prompt, calculated as: APR = (count of x<sub>i</sub> ∈ X : f(x<sub>i</sub>) = C(input text)) / N, where X represents all synthesized samples, f(.) is the attribute classifier, and C(.) encodes the input text. Results showed that RL4Med-DDPO significantly outperformed the fine-tuned Stable Diffusion baseline in terms of APR, achieving a score of 17.13% compared to 6.08% for the baseline. The framework generated realistic and diverse images, even for subgroups with limited or no real samples in the training data. Quantitative evaluation using Fréchet Inception Distance (FID) and Learned Perceptual Image Patch Similarity (LPIPS) also favored RL4Med-DDPO, demonstrating improved image quality and alignment.

The study also explored the impact of augmented data on downstream tasks. Classifiers trained on real data augmented with synthetic images generated by RL4Med-DDPO showed improved performance, particularly for underrepresented subclasses. This suggests that the synthesized images carry valuable discriminative information, highlighting the potential of RL-guided image generation for enhancing medical image analysis. This research paves the way for more sophisticated applications of RL in medical image generation, including tasks like subgroup clustering and disease marker discovery.

LLMs Show Promise, But Struggle with HTR Consistency

Benchmarking Large Language Models for Handwritten Text Recognition by Giorgia Crosilla, Lukas Klic, Giovanni Colavizza https://arxiv.org/abs/2503.15195

This paper benchmarks the performance of Multimodal Large Language Models (MLLMs) for Handwritten Text Recognition (HTR), comparing them to established, supervised models like those available on Transkribus. The study evaluates eight MLLMs, including proprietary models like GPT-4 and Claude Sonnet 3.5, and open-source alternatives like MiniCPMV-2 and Qwen2-vl-7B, on datasets in English, French, German, and Italian, covering both modern and historical handwriting. The core methodology involves a zero-shot approach where the MLLM is presented with an image and prompted to transcribe the handwritten text. A post-correction step, where the LLM is prompted to refine its initial transcription, was also evaluated.

The results reveal a clear performance gap between MLLMs on modern vs. historical handwriting and English vs. other languages. On modern English handwriting, MLLMs, particularly GPT-40-mini, achieved impressive results, outperforming Transkribus’ supermodel on the IAM dataset with a CER of 1.71% and WER of 3.34%, and achieving similar performance on the RIMES dataset. However, performance degraded significantly on historical English texts and texts in other languages. While results on historical English were more balanced, with LLMs approaching the performance of Transkribus' "The Text Titan I" model, accuracy on non-English datasets was generally poor, with CERs often exceeding 20%. For instance, on the Italian LAM dataset, the best-performing LLM, Claude Sonnet 3.5, achieved a CER of 20.55% and WER of 27.78%.

The post-correction experiments showed limited success. While some minor improvements were observed in isolated cases, the LLMs generally struggled to consistently refine their initial predictions. The best improvement was observed with GPT-40 on the ICDAR2017 dataset, reducing the CER by 8% and WER by 4.7%. However, the overall error rates remained high, rendering the transcriptions unusable. This suggests that, at present, LLM self-correction is not a viable replacement for manual post-correction in HTR. The study also investigated the potential for dataset memorization by LLMs, finding minimal evidence of this, suggesting that the benchmark datasets used were likely not part of the LLMs' pre-training data.

POSTA: A New Framework Revolutionizes Artistic Poster Design

POSTA: A Go-to Framework for Customized Artistic Poster Generation by Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, Xinchao Wang https://arxiv.org/abs/2503.14908

Image Caption: The image illustrates the POSTA framework for artistic poster design, showing its three-stage process: background generation, design planning by MLLM, and artistic text stylization. Each stage is depicted with inputs, outputs, and the specific model or technique employed, demonstrating how POSTA transforms a user prompt into a visually appealing and informative poster.

Creating visually stunning and informative posters requires a delicate balance of artistry and precision. Existing automated poster design methods often fall short in terms of text accuracy, customization options, and overall aesthetic appeal. POSTA (A Go-to Framework for Customized Artistic Poster Generation) aims to address these limitations by combining the power of diffusion models and multimodal large language models (MLLMs). This allows for a highly controllable and customizable design process, enabling the creation of professional-quality posters for artistic domains like movies and exhibitions.

POSTA tackles poster creation in three stages. First, the Background Generation Module uses a diffusion model and trained LORA models to create a professional-grade background image. Users can also upload custom backgrounds. Next, the Design Planning Module, powered by MLLMs, intelligently plans the layout, text placement, and font attributes, ensuring 100% text accuracy and full editability. Finally, the Artistic Text Stylization Module uses a mask-guided inpainting model to add artistic flair to key text elements, seamlessly integrating them with the background and enhancing the overall aesthetic. The blending process is described by the formula: I<sub>blended</sub> = M ⊙ I<sub>1</sub> + (1 − M) ⊙ I<sub>2</sub>, where M is the mask, I<sub>1</sub> is the generated image, and I<sub>2</sub> is the original image.

To train these models, the researchers developed the PosterArt dataset, comprising PosterArt-Design (high-quality posters annotated with layout, typography, and background information) and PosterArt-Text (focused on artistic text stylization with pixel-wise segmentation). In a comparative study, POSTA demonstrated significant improvements in text accuracy, user description understanding, and artistic stylization, particularly for longer text sequences. Quantitative analysis involving human evaluators and GPT-4V showed POSTA consistently scored higher in visual appeal, text readability, and prompt relevance. OCR-based comparisons revealed POSTA achieved the highest text consistency with user input.

MLLMs Struggle to See the World Through Remote Sensing Lenses

A Vision Centric Remote Sensing Benchmark by Abduljaleel Adejumo, Faegheh Yeganli, Clifford Broni-bediako, Aoran Xiao, Naoto Yokoya, Mennatullah Siam https://arxiv.org/abs/2503.15816

Image Caption: This image illustrates the process of creating the Remote Sensing Multimodal Visual Patterns (RSMMVP) benchmark. CLIP-blind pairs are identified by comparing similarity scores from CLIP and DINOv2, and then these pairs are used to construct a Visual Question Answering (VQA) task to evaluate the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) on remote sensing imagery. The example shows two visually distinct images that CLIP incorrectly assigns a high similarity score to, leading to an incorrect answer on the VQA task.

MLLMs have excelled in vision-language tasks with natural images, but their performance in remote sensing (RS) remains a challenge due to the unique characteristics of RS imagery, like fine-grained spatial structures and diverse sensor modalities. This paper investigates the limitations of CLIP-based MLLMs in RS, focusing on their weaknesses in visual grounding and spatial reasoning.

The core issue is the inability of these models to differentiate visually distinct yet semantically similar RS images. The authors introduce the Remote Sensing Multimodal Visual Patterns (RSMMVP) benchmark to address this. RSMMVP identifies CLIP-blind pairs – pairs of RS images with high CLIP similarity scores (> 0.95) but low DINOv2 similarity scores (< 0.6) – and uses a VQA task built on these pairs to assess MLLMs' ability to ground visual information in high-resolution RS imagery.

The benchmark evaluated state-of-the-art MLLMs, including domain-adapted models like GeoChat and RS-LLaVA. Results revealed a significant performance gap between humans (91.7% accuracy) and MLLMs, with GPT-4 achieving the highest accuracy (45.3%). The lower performance of domain-adapted models highlights the limitations of simply fine-tuning existing architectures on RS data. The analysis revealed specific visual patterns in RS imagery that challenge CLIP-based models, including object counting, object detection, geometric relationships, orientation, color recognition, and size comparisons.

Conclusion: A Multifaceted Look at Multimodal Models

This newsletter has showcased the exciting progress and persistent challenges in the field of multimodal image and text foundation models. From innovative memory systems for 3D scene understanding to the application of reinforcement learning for fine-grained control in medical image generation, the research covered here highlights the diverse approaches being explored to enhance the capabilities of these models. The limitations of current MLLMs in specialized domains like handwritten text recognition and remote sensing underscore the need for continued research into domain adaptation and the development of more robust architectures. The innovative solutions presented, such as M3's principal scene components and Gaussian Memory Attention, and RL4Med-DDPO's use of reinforcement learning, offer promising directions for future development. As the field continues to evolve, we can anticipate even more powerful and versatile multimodal models capable of tackling increasingly complex tasks across a wide range of applications.