This newsletter explores the latest advancements in multimodal foundation models, focusing on their transformative impact on image and text understanding. From revolutionizing affective computing to enhancing medical image retrieval, these models are rapidly reshaping the landscape of AI.
Affective Computing Has Changed: The Foundation Model Disruption by Björn Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M. Amin, Anastasia Semertzidou, Lukas Christ, Shahin Amiriparian https://arxiv.org/abs/2409.08907
Caption: This image illustrates the use of foundation models for generating and classifying emotions in text and images. Neutral sentences are transformed into emotionally charged ones using large language models like LLaMA, Mistral, and Mixtral. These outputs, along with surprise sentences, are then classified by RoBERTa, GPT-3.5, and GPT-4.
The world of Affective Computing is undergoing a seismic shift, thanks to the emergence of Foundation Models (FMs). These models, trained on massive datasets and boasting billions of parameters, are revolutionizing how we generate and analyze affective data across various modalities – vision, linguistics, and to a lesser extent, speech.
One of the most striking capabilities of FMs is their ability to generate synthetic affective data. The authors demonstrate this by using Stable Diffusion XL to create a dataset of facial images conveying various emotions, styles, and demographic features. Interestingly, these synthetic images proved to be quite convincing when analyzed by a pre-trained Vision Transformer for Facial Expression Recognition (ViT - FER), achieving an accuracy of 35-57.5% across different generation styles.
Similarly, in the linguistic domain, the authors explored the affective style transfer capabilities of LLMs like LLaMA2, Mistral, and Mixtral. They tasked these models to inject specific emotions into neutral phrases, creating a corpus of emotionally charged sentences. Evaluation with RoBERTa and GPT models revealed that while these LLMs tend to exaggerate the injected emotions, they demonstrate a clear understanding and ability to express emotions in text. In a zero-shot emotion recognition task on the GoEmotions dataset, these LLMs achieved accuracy scores surpassing chance level (14.3%), with the best model reaching 48.59% UAR.
However, the speech modality seems to be lagging behind in this FM revolution. While models like UniAudio and PromptTTS2 show promise for adaptation to emotional speech synthesis, they haven't yet demonstrated emergent capabilities in this area. The authors remain optimistic, predicting that future multimodal FMs, enriched with affective speech data, will eventually master the art of generating and analyzing emotions in spoken language. This rapidly evolving landscape necessitates a shift in evaluation paradigms for Affective Computing, moving away from potentially biased "Internet-sourced" data towards more robust and generalizable benchmarks.
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval by Amirreza Mahbod, Nematollah Saeidi, Sepideh Hatamikia, Ramona Woitek https://arxiv.org/abs/2409.09430
Caption: This bar graph showcases the performance (ACC@1) of various foundation models and CNNs on the DermaMNIST dataset across different image sizes. Foundation models, particularly UNI and CONCH, consistently achieve higher accuracy compared to CNNs, highlighting their growing importance in medical image retrieval tasks.
A new study delves into the rapidly evolving field of Content-Based Medical Image Retrieval (CBMIR), comparing the performance of pre-trained Convolutional Neural Networks (CNNs) and cutting-edge foundation models. The research team used eight diverse datasets from the MedMNIST V2 collection, encompassing both 2D and 3D medical images, to evaluate retrieval accuracy across different image sizes.
The results highlight a significant trend: foundation models, particularly UNI and CONCH, consistently outperformed CNNs on 2D datasets. Notably, UNI, trained solely on histological images, demonstrated remarkable generalization capabilities, surpassing even models trained on diverse medical image types. This finding underscores the potential of specialized medical foundation models for broader applications.
While the performance gap between foundation models and CNNs narrowed for 3D datasets, CONCH, a model trained on histological images, still achieved the highest overall retrieval accuracy. Interestingly, increasing image size yielded marginal improvements in retrieval performance, suggesting that even with smaller image sizes, competitive results can be achieved.
This comprehensive analysis provides valuable insights into the evolving landscape of CBMIR, emphasizing the rising prominence of foundation models. The superior performance of UNI, exceeding the best CNN model (DenseNet121) by 4.40%, 3.65%, 5.38%, 2.29%, and 1.28% for mAP@5, mMV@5, ACC@1, ACC@3, and ACC@5 scores respectively, underscores the potential of foundation models to revolutionize medical image retrieval. This research paves the way for future investigations into optimizing feature extraction techniques, enhancing 3D retrieval, and expanding the application of foundation models in diverse medical imaging tasks.
Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models by Dewen Zhang, Wangpeng An, Hayaru Shouno https://arxiv.org/abs/2409.09306
Caption: This image showcases a person cross-country skiing, with yellow dots highlighting their key body joints. This technique of integrating human keypoints into image data helps multimodal models better understand human poses and actions, leading to more accurate image descriptions and richer human-computer interactions.
Researchers have introduced a novel method for generating language-image instruction-following data by integrating human keypoints, which represent the specific locations of joints and other critical body parts, alongside traditional bounding box information. This approach significantly enhances the multimodal model's understanding of human poses and actions, enabling more robust conversations about human activities and deeper understanding of human-related visual contexts.
The researchers utilized the LLaVA-7B architecture for their study, fine-tuning it with datasets focused on conversation, detailed description, and complex reasoning related to human poses and actions. This fine-tuning process involved using a carefully curated set of 200,328 instruction-following samples derived from the COCO image training dataset using GPT-40.
The results showed significant improvements across all categories compared to the original LLaVA-7B model. The aggregate performance score increased by 21.18%, highlighting the efficacy of incorporating conversation, detailed description, and complex reasoning datasets, enriched with human keypoints, into the training regimen. This method allows pre-existing architectures to be significantly enhanced to meet the demands of more sophisticated AI applications that require a deeper understanding of human interactions.
This research paves the way for developing more intuitive and capable multimodal systems that can operate effectively in human-centric environments. Future work could explore the integration of temporal information to further enhance the model's reasoning abilities in dynamic environments, making it even better suited for real-world applications.
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection by Yaning Zhang, Tianyi Wang, Zitong Yu, Zan Gao, Linlin Shen, Shengyong Chen https://arxiv.org/abs/2409.09724
Caption: This image illustrates the architecture of MFCLIP, a novel multi-modal fine-grained CLIP model designed for detecting diffusion-based face forgeries. It highlights key components such as the fine-grained text generator, noise encoder, and sample pair attention (SPA) module, which contribute to MFCLIP's superior performance in capturing forgery artifacts and generalizing across different generators and forgery types. The diagram also depicts the data flow and operations involved in both the training and inference phases of the model.
The rise of sophisticated face forgery techniques, particularly those leveraging diffusion models, demands robust and generalizable detection methods. Existing approaches often struggle with generalization across different generators and forgery types, particularly when it comes to diffusion-generated images. This paper introduces MFCLIP, a novel multi-modal fine-grained CLIP model designed to address these limitations.
Unlike traditional methods that primarily rely on image data, MFCLIP incorporates fine-grained noise patterns extracted from the richest patches of images, alongside global image features. This multi-modal approach allows for a more comprehensive capture of forgery artifacts. The model utilizes a fine-grained text generator to create hierarchical text prompts, enhancing the learning of general visual forgery patterns across image-noise modalities via text-guided representation learning. Furthermore, a novel plug-and-play sample pair attention (SPA) module is introduced to adaptively emphasize relevant negative pairs and suppress irrelevant ones during cross-modal feature alignment, thereby improving generalization.
Extensive experiments on various datasets, including GenFace, FF++, DFDC, Celeb-DF, and DF-1.0, demonstrate the superiority of MFCLIP over state-of-the-art methods. Notably, the model achieves significant improvements in cross-generator and cross-forgery evaluations, highlighting its strong generalization capabilities. For instance, in cross-generator evaluation on GenFace, MFCLIP outperforms CLIP, DIRE, and FreqNet by a significant margin, achieving an AUC improvement of approximately 40.60%, 27.26%, and 26.36%, respectively, on images generated by IAFaces after training on Diffae. The effectiveness of individual components, such as the noise encoder, fine-grained language encoder, and SPA module, is validated through ablation studies.
The paper highlights the importance of incorporating multi-modal information and fine-grained analysis for robust and generalizable diffusion face forgery detection. The proposed MFCLIP model, with its innovative use of noise patterns, hierarchical text prompts, and adaptive sample pair attention, offers a promising avenue for tackling the growing challenge of detecting sophisticated face forgeries. Future work will focus on further enhancing generalization across diverse datasets and reducing computational complexity, potentially by integrating pre-trained components with multi-modal large language models.
PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage by Denis Zavadski, Damjan Kalšan, Carsten Rother https://arxiv.org/abs/2409.09144
Caption: This image depicts the architecture of PrimeDepth, a new method for zero-shot monocular depth estimation. It leverages Stable Diffusion to extract a "preimage" from a single denoising step, which is then processed by a refiner network to predict depth. The frozen Stable Diffusion components are marked with snowflakes, while trainable components are marked with gears.
This paper introduces PrimeDepth, a novel method for zero-shot monocular depth estimation that leverages the power of Stable Diffusion, a latent Text-to-Image diffusion model. Unlike previous diffusion-based methods that fine-tune the diffusion model itself or require multiple denoising steps, PrimeDepth extracts a rich, frozen image representation called the preimage from a single denoising step of Stable Diffusion. This preimage, consisting of feature maps and attention maps, is then fed into a refiner network with an architectural inductive bias, specifically designed to process multi-scale features.
The key advantage of PrimeDepth lies in its efficiency and performance. By utilizing a single diffusion step, it achieves significantly faster inference speeds compared to multi-step diffusion approaches like Marigold, being on average over 100x faster. Despite using a significantly smaller amount of synthetically labeled training data (74K vs. 1.5M), PrimeDepth achieves results that are quantitatively on par with Depth Anything, the current leading data-driven method. Notably, PrimeDepth even surpasses Depth Anything in terms of detail and robustness in challenging scenarios like nighttime scenes.
The authors demonstrate the effectiveness of PrimeDepth through extensive experiments on various datasets, including KITTI, NYUv2, ETH3D, rabbitai, and a challenging subset of nuScenes. They show that PrimeDepth consistently outperforms Marigold and achieves comparable or even superior results to Depth Anything, while being significantly faster. This highlights the potential of leveraging pre-trained generative models like Stable Diffusion for downstream tasks, particularly in the realm of depth estimation.
The authors also provide insights into the importance of architectural choices and the benefits of using the complete preimage representation. They argue that retaining the frozen weights of the pre-trained Stable Diffusion model contributes to the robustness of PrimeDepth, as opposed to fine-tuning the model itself. Furthermore, they demonstrate that utilizing the full preimage, including both feature maps and attention maps, is crucial for achieving optimal performance. The authors' findings suggest a promising direction for future research in zero-shot depth estimation and highlight the potential of combining data-driven and diffusion-based approaches for improved generalization capabilities.
This newsletter has showcased the remarkable advancements being made in multimodal image and text foundation models. From generating synthetic affective data to detecting deepfakes and estimating depth from single images, these models are pushing the boundaries of AI capabilities. As research in this field continues to evolve, we can expect even more groundbreaking applications and a deeper understanding of the interplay between vision and language. The studies highlighted in this newsletter demonstrate the potential of these models to revolutionize various domains, including healthcare, human-computer interaction, and content authenticity verification.