The convergence of vision and language is rapidly transforming the AI landscape. This newsletter delves into the latest advancements in multimodal image and text foundation models, exploring novel architectures, training methodologies, and evaluation benchmarks designed to push the boundaries of AI capabilities. From universal text-driven image segmentation to unifying Earth observation data and assessing expert-level reasoning, these papers highlight the exciting progress and persistent challenges in this dynamic field.
Towards Universal Text-driven CT Image Segmentation by Yuheng Li, Yuxiang Lai, Maria Thor, Deborah Marshall, Zachary Buchwald, David S. Yu, Xiaofeng Yang https://arxiv.org/abs/2503.06030
Caption: The diagram illustrates the OpenVocabCT framework for universal text-driven 3D CT image segmentation. It leverages LLMs (LLaMa 3 and GPT-4) to process radiology reports and generate granular organ-level captions, which are then used alongside image encoders and decoders for multi-granularity contrastive learning. This approach enables accurate segmentation based on diverse text prompts, as demonstrated by the final segmented abdominal CT scan.
Accurate and efficient segmentation of Computed Tomography (CT) images is crucial for medical diagnosis and treatment planning. While deep learning has revolutionized medical image analysis, existing models often struggle with the diversity and complexity of real-world clinical data. OpenVocabCT addresses these limitations by introducing a universal text-driven model specifically pre-trained on large-scale 3D CT images and paired radiology reports.
The key innovation of OpenVocabCT lies in its pretraining framework. Leveraging the extensive CT-RATE dataset, the model employs Large Language Models (LLMs) like LLaMa 3 and GPT-4 to break down complex diagnostic reports into granular, organ-level descriptions. This granular approach tackles the challenge of aligning lengthy and intricate radiology reports with corresponding image data. Furthermore, a novel multi-granularity contrastive learning strategy is employed, utilizing both organ-level captions and full reports. This approach is formulated as a combined loss function: LFinal = LCLIP + LMGCL, where LCLIP represents the standard CLIP loss using the full report, and LMGCL is the multi-granularity contrastive loss using the generated short captions. This combined loss allows the model to capture both fine-grained details and broader contextual information.
Evaluations across nine public datasets for organ and tumor segmentation demonstrate OpenVocabCT's superior performance. On the TotalSegmentator dataset, OpenVocabCT achieves a mean Dice Similarity Coefficient (DSC) of 90.7%, outperforming both vision-only models (nnUNetv2 at 86.9% and UniMiSS at 88.0%) and text-driven models (CLIP-Driven at 84.6% and SAT-Pro at 87.6%). For tumor segmentation, OpenVocabCT achieves comparable performance to the best text-driven method and surpasses the best vision-only method by an average of 4.3% DSC. Importantly, the model exhibits remarkable generalizability to diverse and unseen text prompts, including combined organ descriptions (e.g., "left and right lung") and synonymous terms (e.g., "renal organs" for "kidney"). This robustness to varied clinical terminology highlights the model's potential for real-world clinical deployment.
GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models by Zhitong Xiong, Yi Wang, Weikang Yu, Adam J Stewart, Jie Zhao, Nils Lehmann, Thomas Dujardin, Zhenghang Yuan, Pedram Ghamisi, Xiao Xiang Zhu https://arxiv.org/abs/2503.06312
Caption: This figure showcases the diverse Earth observation (EO) data modalities incorporated in the GeoLangBind-2M dataset, including Sentinel-2 multispectral, Sentinel-1 SAR, EnMAP hyperspectral, elevation maps, infrared, and aerial imagery. Each modality provides unique information about the landscape, ranging from land cover and vegetation types to terrain elevation and object detection, which GeoLangBind leverages to create a unified vision-language foundation model for EO. These diverse data sources are integrated using language as a unifying medium, enabling cross-modal analysis and a more comprehensive understanding of Earth systems.
Earth observation (EO) data, derived from a multitude of sensors with varying imaging principles, poses a significant challenge for creating unified analytical frameworks. GeoLangBind tackles this challenge by introducing an agglomerative vision-language foundation model that utilizes language as a unifying bridge between heterogeneous EO data modalities.
The core of GeoLangBind is the GeoLangBind-2M dataset, a massive collection of two million image-text pairs spanning six EO data modalities: RGB, SAR, multispectral, hyperspectral, infrared, and elevation. This diverse dataset allows the model to learn a shared language embedding space, enabling seamless integration and complementary feature learning across different sensor types. The model architecture incorporates a wavelength-aware dynamic encoder to handle the variable number of input channels across modalities, a Modality-aware Knowledge Agglomeration (MaKA) module for refined understanding, and a progressive weight-space merging strategy for efficient scaling. The MaKA module utilizes wavelength as a modality-specific condition when distilling features from teacher models (SigLIP, DINOv2, and ViT). The progressive merging strategy addresses data imbalance by initially training separate models on RGB and non-RGB subsets, then merging their weights with pre-trained SigLIP weights using a linear strategy: θ = (1 - m₁)θsiglip + m₁θrgb and θ = (1 - m₂)θ** + m₂θothers, where m₁ and m₂ are weighting ratios.
Evaluations across 23 datasets demonstrate GeoLangBind's superior performance in zero-shot classification, semantic segmentation, and cross-modal image retrieval tasks. It achieves state-of-the-art results on several scene classification benchmarks, outperforming existing CLIP-based models. Furthermore, ablation studies confirm the effectiveness of the MaKA module and the progressive weight merging strategy, highlighting their contribution to the model's robust performance.
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks by Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li https://arxiv.org/abs/2503.06885
Caption: The ProBench framework evaluates Multimodal Large Language Models (MLLMs) using crowdsourced, professional-level queries, spanning diverse fields and languages. After filtering and categorizing the queries, various MLLMs generate responses which are then judged by another MLLM, with debiasing techniques ensuring fair and robust rankings on a leaderboard.
Evaluating the ability of Multimodal Large Language Models (MLLMs) to solve expert-level tasks is crucial for assessing their true potential. ProBench introduces a challenging benchmark comprising 4,000 open-ended queries sourced directly from professionals across 10 fields and 56 sub-fields, reflecting the demands of their daily work.
ProBench employs a novel MLLM-as-a-Judge evaluation methodology, leveraging the reasoning capabilities of advanced MLLMs like GPT-4 to assess the quality of responses from other models. To mitigate bias, a de-biasing technique based on the Bradley-Terry model is implemented. This model refines the ELO rating system by accounting for stylistic variations and presentation order biases, using the formula: r_i^{ref} = C + K × β_i
, where β = arg min ∑ lbce(β^T X^{win} + γ^T X^{win}, S_{i,j})
. Here, β
and γ
represent model strength and style coefficients, respectively, and S_{i,j}
is the comparison outcome.
Evaluation of 24 leading MLLMs on ProBench reveals that while top open-source models compete with proprietary ones, significant challenges remain in visual perception, textual understanding, domain-specific knowledge, and advanced reasoning. The benchmark also demonstrates the robustness of the MLLM-as-a-Judge approach, showing high correlation with human expert evaluations. To address the cost of using powerful MLLMs for evaluation, a distilled version of Llama-vision is offered as a cost-effective local evaluator.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation by Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, Li Yuan https://arxiv.org/abs/2503.07265
Caption: This image contrasts previous straightforward T2I benchmarks (a), which focused on basic image attributes like color and object position, with the new WISE benchmark (b), which evaluates a model's ability to apply world knowledge to prompts like "Einstein's favorite musical instrument." The WISE benchmark assesses reasoning capabilities by comparing model-generated images (violin - correct, piano - incorrect) against prompts requiring cultural, spatio-temporal, or natural science knowledge.
While Text-to-Image (T2I) models excel at generating visually appealing images, their ability to integrate and apply world knowledge remains a crucial area for improvement. WISE (World Knowledge-Informed Semantic Evaluation) addresses this gap by introducing a benchmark comprising 1,000 carefully crafted prompts across 25 subdomains within Cultural Common Sense, Spatio-Temporal Reasoning, and Natural Science. Accompanying this benchmark is WiScore, a novel evaluation metric that emphasizes accurate depiction of objects and entities within the generated image. WiScore is calculated as: WiScore = 0.7 × Consistency + 0.2 × Realism + 0.1 × Aesthetic Quality.
Evaluation of 20 T2I and unified multimodal models reveals limitations in their world knowledge integration. Surprisingly, unified multimodal models often underperform dedicated T2I models. Even after simplifying prompts using GPT-4, performance improvements remain limited, suggesting the need for improved training methodologies rather than just prompt engineering.
Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models by Boyu Jia, Junzhe Zhang, Huixuan Zhang, Xiaojun Wan https://arxiv.org/abs/2503.04801
Caption: This image illustrates the four novel evaluation tasks (Single-Image Recognition, Multi-Image Recognition, Multi-Image Retrieval, and Knowledge Association) used to assess the consistency of Multimodal Large Language Models (MLLMs) in knowledge reasoning. Each task presents a different challenge related to image and text understanding, requiring the MLLM to identify individuals, teams, and their relationships based on varying numbers of images and reasoning steps. These tasks reveal inconsistencies in MLLM reasoning, even when individual steps are correctly processed.
Multimodal Large Language Models (MLLMs) are making strides in understanding various data modalities. However, a key challenge lies in their ability to maintain consistency in multimodal knowledge reasoning. This study introduces four novel evaluation tasks and a corresponding dataset to explore this issue. The Consistency Rate (CR) metric, defined as CR = |{qₖ | qₖ ∈ S, qₖ is correctly answered.}| / |S|, where S is the set of samples where all individual reasoning steps and the textual chain are correct, quantifies the MLLMs' consistency.
Experiments on prominent MLLMs demonstrate a concerning level of inconsistency, especially in complex tasks involving multiple images and reasoning steps. The study also highlights the effectiveness of visual consistency enhancement prompts, which prioritize visual feature extraction before textual reasoning, leading to improved performance.
This newsletter has showcased the latest advancements and ongoing challenges in multimodal image and text foundation models. From specialized models like OpenVocabCT for medical image segmentation to broader frameworks like GeoLangBind for Earth observation, the field is rapidly evolving. However, benchmarks like ProBench and WISE reveal persistent limitations in expert-level reasoning and world knowledge integration. The exploration of consistency in multimodal knowledge reasoning further emphasizes the need for more robust training methodologies and evaluation strategies to ensure the reliability and trustworthiness of future MLLMs. These developments underscore the exciting trajectory of multimodal AI and the continuous pursuit of more sophisticated and capable models.