Hi Elman,
This newsletter dives into the latest advancements in multimodal image and text foundation models. We'll explore novel approaches to improve image generation, enhance retrieval capabilities, address hallucinations in medical imaging, and boost interpretability. The research highlighted here pushes the boundaries of cross-cultural representation, tackles the challenges of limited datasets, and introduces innovative frameworks for evaluating and understanding these increasingly complex models. Let's get started!
Multi-Agent Multimodal Models for Multicultural Text to Image Generation by Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat https://arxiv.org/abs/2502.15972
Caption: This diagram illustrates the MosAIG multi-agent interaction model for enhanced multicultural text-to-image generation. A moderator agent assigns tasks to social agents (age/gender, country, landmark) who converse to create detailed image captions, summarized for input to AltDiffusion/FLUX, ultimately generating a culturally nuanced image. This process aims to improve the representation of diverse demographics and landmarks in generated images.
The increasing diversity of our globalized world demands AI systems that reflect and respect this richness. Text-to-image generation models, often trained on Western-centric datasets, frequently fall short in depicting multicultural scenarios accurately. This research introduces MosAIG, a multi-agent framework designed to enhance the cultural sensitivity of these models. Accompanying this framework is a new dataset of 9,000 multicultural images, spanning diverse demographics (five countries, three age groups, two genders) and 25 historical landmarks.
MosAIG's innovative approach utilizes five interacting agents: a Moderator, three Social Agents (representing the person's culture, age/gender, and the landmark), and a Summarizer. The Moderator orchestrates the process, assigning tasks based on input demographics and landmark information. The Social Agents engage in a dynamic question-and-answer dialogue, refining their descriptions to incorporate cultural nuances and contextual details. Finally, the Summarizer synthesizes these descriptions into a comprehensive caption for image generation.
Evaluation involved automated metrics and human assessment. Automated metrics included CLIPScore (text-image alignment), Inception Score (image quality), a SigLIP-based aesthetic predictor, and metrics for Fairness and Knowledge. Results revealed that MosAIG significantly outperformed simpler, non-agent models in Alignment, Aesthetics, Quality, and Knowledge. For instance, MosAIG achieved substantially higher Inception Scores (0.77 vs. 0.48 for Alt-En and 0.65 vs. 0.45 for Flux-En).
However, a trade-off emerged: MosAIG scored lower in Fairness. This is attributed to the richer, more detailed captions, which amplified differences when demographic attributes were altered. This difference is quantified by: AS =| S(c, I) – S(c', I') |, where S is CLIPScore, c is the caption, and I is the image. Human evaluation corroborated these findings, noting improvements in landmark and person rendering while also revealing persistent challenges in accurately depicting body structures and backgrounds. This research underscores the potential of multi-agent models but also highlights the need for future work to balance expressiveness and demographic consistency.
Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation by Yun-Wei Chu, Kai Zhang, Christopher Malon, Martin Renqiang Min https://arxiv.org/abs/2502.15040
Caption: This diagram illustrates the Visual Retrieval-Augmented Generation (v-RAG) framework for medical imaging. It shows how v-RAG combines a medical report, an x-ray image, and retrieved similar image/text pairs to answer medical questions posed to a Med-MLLM, comparing the model's predictions to ground truth answers for evaluation. This approach aims to reduce hallucinations in Med-MLLMs by providing richer visual and textual context.
Hallucinations in Multimodal Large Language Models (MLLMs) are a critical concern, especially in healthcare where accuracy is paramount. This research introduces Visual Retrieval-Augmented Generation (V-RAG), a framework designed to mitigate these hallucinations by grounding the MLLM's responses in retrieved visual and textual data from similar images.
The effectiveness of V-RAG was tested on the MIMIC-CXR (chest X-rays) and MultiCaRe datasets using entity probing. This involved posing yes/no questions about the presence of specific medical entities and comparing the MLLM's answers to ground truth. V-RAG significantly outperformed text-only RAG baselines, highlighting the value of incorporating visual context. Furthermore, fine-tuning the MLLM with tasks designed to enhance image-text association further improved performance.
Recognizing that some MLLMs are trained only on single images, the researchers developed specific fine-tuning tasks. These tasks focused on image-text awareness (matching text to the correct image), image focus (generating text based on a specific image), and learning from extracted similar information (using information from multiple retrieved images and reports). Applying these tasks to a single-image-trained MLLM (LLaVA) significantly improved its performance in V-RAG.
Finally, V-RAG was applied to chest X-ray report generation. Entity probing with V-RAG identified potential hallucinations in initial reports, which were then revised using a text-only LLM. This strategy led to a substantial improvement in RadGraph-F1 scores, demonstrating V-RAG's practical utility in enhancing the quality of generated medical reports. This research provides a promising direction for developing more reliable and trustworthy AI systems for healthcare.
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval by Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman https://arxiv.org/abs/2502.15682
Caption: The architecture of the Enhanced Language-Image Pre-training (ELIP) framework is shown, which enhances text-to-image retrieval by generating text-guided visual prompt vectors (v) from the text query (T) and the CLS token (tCLS). These vectors condition the ViT image encoder processing the input image (I), leading to a query-aware image embedding used for improved ranking. The framework incorporates learned queries within a self-attention and cross-attention mechanism, along with feed-forward networks and a Q-Former, to generate a refined image representation for retrieval.
This research introduces Enhanced Language-Image Pre-training (ELIP), a framework designed to enhance text-to-image retrieval by boosting the performance of large-scale pre-trained vision-language models during the re-ranking stage. ELIP achieves this by using the text query to generate visual prompt vectors that condition the Visual Transformer (ViT) image encoder. This text-guided prompting mechanism makes the image embedding contextually aware, leading to more accurate ranking.
Training large vision-language models requires substantial computational resources. The authors address this challenge with a 'student-friendly' approach. This involves a global hard sample mining strategy, where training batches are constructed by grouping semantically similar image-text pairs. Additionally, a data curation technique based on learnability is employed, further optimizing training with limited resources.
ELIP's performance was evaluated on standard benchmarks (COCO and Flickr30k) and two new out-of-distribution (OOD) benchmarks: Occluded COCO and ImageNet-R. These OOD benchmarks assess generalization capabilities in challenging scenarios. ELIP consistently improved the performance of CLIP and SigLIP, achieving state-of-the-art results on the BLIP-2 backbone. Qualitative analysis and visualization of attention maps further demonstrated ELIP's effectiveness in aligning image and text embeddings. This research provides a valuable tool for enhancing text-to-image retrieval performance with limited computational resources.
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment by Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu https://arxiv.org/abs/2502.15167
Caption: M3-AGIQA, a novel framework for assessing AI-generated image quality, leverages a multi-round conversational approach with a fine-tuned MLLM to analyze aspects like quality, prompt alignment, and authenticity. This three-stage process involves initial description generation with a large MLLM, inference with a smaller fine-tuned MLLM, and training with a regression head to predict Mean Opinion Scores (MOS), ultimately outperforming existing methods on benchmark datasets.
Evaluating AI-generated images (AGIs) requires considering not only perceptual quality but also prompt correspondence and authenticity. M3-AGIQA, a Multimodal, Multi-Round, and Multi-Aspect framework, addresses this challenge by leveraging Multimodal Large Language Models (MLLMs).
M3-AGIQA operates in three stages. First, it distills the advanced captioning capabilities of a powerful online MLLM into a smaller, local MLLM using Low-Rank Adaptation (LoRA). This is achieved by prompting the online MLLM to generate aspect-specific descriptions (quality, correspondence, authenticity) and then fine-tuning the local MLLM on this data using a structured conversational format. Second, during inference, the fine-tuned MLLM uses zero-shot Chain of Thought (CoT) reasoning to refine its quality assessment. Finally, an xLSTM and a regression head process the MLLM's output to predict Mean Opinion Scores (MOSs), represented by the formula: ŷ = f (i, p), where ŷ is the predicted MOS, i is the AGI, and p is the text prompt.
Evaluated on AGIQA-3k, AIGCIQA2023, and AIGCIQA-20k datasets, M3-AGIQA outperforms existing methods. Ablation studies confirm the importance of each component, particularly the distilled image descriptions and the use of the MLLM as an encoder. Cross-dataset validation demonstrates strong generalizability, although some limitations were observed with larger datasets. While acknowledging limitations related to computational resources and potential ethical concerns, M3-AGIQA offers a promising direction for evaluating AGIs.
Audio Visual Segmentation Through Text Embeddings by Kyungbok Lee, You Zhang, Zhiyao Duan https://arxiv.org/abs/2502.16359
Caption: The AV2T-SAM framework enhances audio-visual segmentation by projecting combined audio and visual features (f<sub>CLIP∩CLAP</sub>) into the text embedding space of a pre-trained text-prompted SAM. This architecture, incorporating adaptable SAM decoder blocks and a multimodal encoder, leverages the semantic richness of large text-image datasets to improve performance on tasks like identifying sounding objects in video. The framework uses frozen CLIP, CLAP and SAM encoders, and only the projectors and adapters are trainable components.
Audio-visual segmentation (AVS) aims to segment sounding objects in videos, a task hampered by limited datasets. AV2T-SAM (Audio-Visual to Text SAM) addresses this challenge by connecting audio features to the text embedding space of pre-trained text-prompted SAM models.
AV2T-SAM projects audio features into the text embedding space, enabling it to leverage the knowledge of pre-trained text-prompted SAM. This is achieved using a novel feature, f<sub>CLIP∩CLAP</sub>, calculated as the element-wise multiplication of CLIP and CLAP embeddings: f = f<sub>CLIP</sub> * f<sub>CLAP</sub>. This captures shared semantics between audio and visual modalities while filtering out irrelevant noise. Adapters in the SAM decoder further facilitate the fusion of projected audio-visual information with image features. The training objective combines Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss: L<sub>total</sub> = L<sub>BCE</sub> + L<sub>IoU</sub>.
Evaluated on AVSBench, AV2T-SAM achieves state-of-the-art performance on both the Single Sound Source (S4) and Multi Sound Source (MS3) subsets. Notably, the authors identified a vision bias in S4, achieving state-of-the-art results using only visual features, highlighting the need for more robust datasets. This work represents a significant advancement in AVS by leveraging text embeddings and pre-trained SAM to overcome data limitations.
Chitrarth: Bridging Vision and Language for a Billion People by Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Akshat Patidar, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham Agarwal https://arxiv.org/abs/2502.15392
Caption: Chitrarth's two-stage training process: Stage 1 aligns visual and textual features using translated image-text pairs, while Stage 2 refines the model through instruction tuning on a diverse multilingual dataset incorporating translated and culturally relevant data. This approach enables Chitrarth to effectively handle complex visual reasoning and generate descriptions in multiple Indian languages.
Existing multimodal models often lack support for languages beyond English and high-resource European languages. Chitrarth, meaning "Image Meaning," aims to address this gap by focusing on 10 prominent Indian languages.
Chitrarth integrates a SOTA multilingual LLM (Krutrim) with a vision module. Training occurs in two stages: feature alignment using translated image-text pairs and instruction tuning on a diverse multilingual dataset reflecting Indian cultural diversity. Along with Chitrarth, the researchers introduce BharatBench, a comprehensive evaluation benchmark for 10 under-resourced Indic languages.
Chitrarth achieves SOTA results on several English academic datasets and sets new benchmarks for the multilingual datasets in BharatBench. It exhibits strong performance across various tasks, including creative writing, attribute extraction, and anomaly detection, in multiple Indian languages. Qualitative analysis suggests a better understanding of images with Indian cultural context compared to models like GPT-4. This work pushes the boundaries of multilingual multimodal capabilities, offering improvements over existing models and establishing a foundation for future advancements.
This newsletter showcased a diverse array of advancements in multimodal image and text foundation models. From mitigating hallucinations in medical imaging and improving cross-cultural representation in image generation to enhancing retrieval capabilities and developing new evaluation frameworks, these research efforts demonstrate the ongoing evolution and expanding applications of multimodal AI. The innovative approaches presented, such as multi-agent interactions, visual retrieval augmentation, and text-guided visual prompting, offer promising directions for future research and development in this exciting field. The emphasis on inclusivity, as seen with Chitrarth's focus on Indian languages, highlights the growing importance of addressing the needs of a diverse global community. Addressing the limitations of existing datasets and evaluation metrics, as highlighted in several papers, is also crucial for driving further progress in this field.