ArXiv Pulse - Stay updated with the latest research papers

Elman, Dive into the Latest Advancements in Multimodal Image and Text Foundation Models

This newsletter explores the cutting edge of multimodal image and text foundation models, covering exciting new developments in text generation from EEG signals, debiasing techniques, specialized agricultural models, and vision-centric benchmarks. These advancements represent significant steps toward more robust, fair, and specialized applications of these powerful models.

Thought2Text: Generating Text Directly from Brainwaves

Thought2Text: Text Generation from EEG Signal using Large Language Models (LLMs) by Abhijit Mishra, Shreya Shukla, Jose Torres, Jacek Gwizdka, Shounak Roychowdhury https://arxiv.org/abs/2410.07507

Decoding thoughts into text has long been a sci-fi dream, but Thought2Text brings us closer to reality. This groundbreaking system uses the power of Large Language Models (LLMs) to translate EEG signals into coherent text, offering a potential paradigm shift in Brain-Computer Interface (BCI) technology. This opens doors to more accessible and portable "thoughts-to-text" applications.

The system's multimodal approach uses visual stimuli to evoke EEG signals, training LLMs to interpret these signals and generate corresponding text descriptions. This three-stage process begins with training a multichannel EEG encoder, inspired by ChannelNet, to extract meaningful embeddings from raw EEG signals. This encoder utilizes a combined loss function: Mean Squared Error (MSE) loss for aligning EEG embeddings with image embeddings from a pre-trained CLIP model, and cross-entropy loss for accurate object label prediction.

Next, LLMs are primed with image embeddings and corresponding text descriptions, training a projector to map the visual embeddings into the LLM's token embedding space. Finally, this projector is refined using EEG embeddings, enabling the LLM to directly generate text from EEG signals during inference.

Evaluations on the CVPR2017 dataset, containing EEG data from six participants viewing various images, showed promising results. Fine-tuned instruction-based LLMs like LLaMa-v3, Mistral-v0.3, and Qwen2.5-7B were tested. Using standard NLG metrics (BLEU, METEOR, ROUGE, and BERT Score) and GPT-4 based assessments, the complete Thought2Text pipeline significantly outperformed chance-based baselines and methods omitting the crucial EEG-image embedding alignment. LLaMa3-8B_ALL achieved a BLEU-N (N=1) score of 25.5%, Mistral-7B_ALL scored 26%, and QWEN2.5-7B_ALL reached 22.7%.

Surprisingly, using only EEG embeddings during inference performed comparably to using both EEG embeddings and object labels, suggesting the robustness of EEG embeddings. Subject-wise analysis further highlighted the consistent improvements across individuals. While the results are encouraging, the research acknowledges the inherent noise and variability of EEG data and potential object misclassification. Future work will address these limitations, focusing on optimizing model architectures, improving EEG-text alignment, and exploring practical applications while considering the ethical implications.

Debiasing Vision-Language Models: A Unified Approach

A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks by Hoin Jung, Taeuk Jang, Xiaoqian Wang https://arxiv.org/abs/2410.07593

While VLMs excel at multimodal tasks, they often inherit and amplify societal biases. This paper introduces Selective Feature Imputation for Debiasing (SFID), a novel and efficient method to mitigate these biases without retraining. SFID combines feature pruning and Low Confidence Imputation (LCI) to address the limitations of existing debiasing techniques, which are often modality- or task-specific and require costly retraining.

SFID utilizes the interpretability of RandomForest to identify bias-related features within the frozen representations of VLMs. By ranking feature importance in predicting sensitive attributes like gender, SFID pinpoints and prunes the most influential features contributing to bias. LCI then replaces these pruned features with the average value of corresponding features from low-confidence samples in the validation set – ambiguous instances identified by RandomForest and less likely to exhibit strong bias. This imputation preserves semantic integrity and dimensionality while neutralizing bias.

The paper demonstrates SFID's effectiveness across various VLM tasks. In zero-shot classification, SFID reduced the Average Demographic Disparity (ADP) while maintaining accuracy. In text-to-image retrieval, it decreased Skew@100, promoting a more balanced gender distribution without sacrificing recall. For image captioning, SFID improved misclassification rates, reducing gender mismatches while maintaining caption quality (measured by METEOR and SPICE). In text-to-image generation, it reduced gender-specific prompt mismatches and improved the Skew metric for neutral prompts.

Caption: This diagram illustrates the Selective Feature Imputation for Debiasing (SFID) method. It shows how SFID identifies and prunes bias-amplifying features from both the image and text encoders of a Vision-Language Model (VLM), then imputes them with average feature values from low-confidence samples (represented by the lighter green and beige bars) to mitigate bias without retraining. This process is shown for both image and text modalities, demonstrating SFID's cross-modal applicability.

Consistently outperforming existing methods like DeAR and CLIP-clip, SFID offers a practical and efficient debiasing solution due to its no-retraining approach and simple implementation. Its versatility and cross-modal applicability make it a significant advancement in promoting fairness and reliability in VLMs.

AgroGPT: Cultivating Expertise in Agricultural Vision-Language Models

AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning by Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, Rao Muhammad Anwer https://arxiv.org/abs/2410.08405

Large Multimodal Conversational Models (LMMs) often struggle with specialized domains like agriculture due to the lack of domain-specific data. AgroGPT tackles this challenge by leveraging vision-only agricultural datasets to create expert-level instruction-tuning data, eliminating the need for pre-existing image-text data.

The key innovation is AgroInstruct, a pipeline that uses vision-only datasets of plant diseases, weeds, insects, and fruits. General-purpose LLMs generate image descriptions, extract attributes, and incorporate class-specific knowledge from agricultural sources. This information is then used to create instruction-following examples and system prompts, fed into language-only LLMs to generate contextually relevant conversations, forming a 70k expert-tuning dataset.

This dataset is used to train AgroGPT-3B and AgroGPT-7B, efficient conversational models specialized for agriculture through a two-step visual instruction and expert tuning process. AgroEvals, a set of visual question-answering (VQA) tasks focusing on agricultural issue identification, general categorization, and fine-grained concept recognition, was developed for evaluation.

AgroGPT significantly outperformed baselines and ChatGPT in fine-grained concept identification, achieving 51.37% accuracy in disease identification compared to ChatGPT's 30.82%. It also excelled in multi-turn, complex conversations, providing more insightful guidance than general-purpose models. Expert evaluations further validated AgroGPT's detailed and relevant answers.

AgroGPT and the AgroInstruct pipeline represent a significant advancement in applying LMMs to specialized fields. This work opens doors for similar advancements in other domains where joint image-text data is limited.

MRAG-Bench: Evaluating Vision-Centric Multimodal Models

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models by Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, Nanyun Peng https://arxiv.org/abs/2410.08182

Existing multimodal retrieval benchmarks often prioritize text retrieval, overlooking the importance of visual information. MRAG-BENCH addresses this gap by providing a vision-centric evaluation for retrieval-augmented Large Vision-Language Models (LVLMs). It focuses on scenarios where visual knowledge is more crucial than textual information.

Comprising 16,130 images and 1,353 multiple-choice questions across nine scenarios (categorized by perspective and transformative aspects, plus "others" for geographic knowledge), MRAG-BENCH includes a ground-truth image corpus for evaluation. This framework systematically assesses how LVLMs utilize retrieved visual information compared to text.

Evaluation of 14 LVLMs (four proprietary, ten open-source) confirmed MRAG-BENCH's vision-centric nature, with all models performing better with image augmentation. GPT-40 achieved the highest accuracy (74.5% with ground-truth retrieval), significantly outperforming open-source models. However, even GPT-40 showed limited improvement with ground-truth information compared to human performance, highlighting the challenge of effectively using retrieved visual knowledge.

Caption: This image contrasts previous multimodal models' text-centric approach with the new vision-centric MRAG-BENCH. Previous models relied on text retrieval, while MRAG-BENCH leverages image retrieval to augment large vision-language models (LVLMs), enabling them to better answer questions like "What car model is this?" by utilizing visual knowledge. This shift allows LVLMs to handle scenarios where images provide more information than text, as demonstrated by the improved performance when using image retrieval.

Further analysis revealed open-source models' struggles with noisy retrieved images, contrasting with proprietary models' robustness. This suggests a gap in discerning high-quality visual information. The performance correlation with retriever quality emphasizes the importance of robust multimodal retrieval.

MRAG-BENCH provides a valuable tool for evaluating and improving LVLMs' visual reasoning capabilities, highlighting the need for further research in handling noisy images and leveraging visual information in multimodal generation.

Conclusion: A Multimodal Future

This newsletter has showcased a range of exciting developments in multimodal image and text foundation models. From generating text from brainwaves with Thought2Text to debiasing models with SFID, creating specialized models like AgroGPT, and evaluating them with vision-centric benchmarks like MRAG-BENCH, the field is rapidly advancing. These advancements promise to unlock new possibilities in various applications, from healthcare and assistive technologies to agriculture and beyond. However, challenges remain, particularly in robustly handling noisy data, effectively leveraging retrieved information, and addressing ethical implications. Continued research and development in these areas are crucial for realizing the full potential of these powerful models and ensuring their responsible deployment.