This newsletter explores the cutting edge of multimodal image and text foundation models, showcasing novel approaches to zero-shot learning, compositional retrieval, aerial detection, and remote sensing analysis. We delve into four recent papers that leverage large language models, contrastive learning, and innovative training strategies to push the boundaries of performance and generalization in these rapidly evolving fields. Prepare to discover how these models are transforming diverse applications from e-commerce and web search to surgical workflow analysis and remote sensing.
SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval by Bhavin Jawade, Joao V. B. Soares, Kapil Thadani, Deen Dayal Mohan, Amir Erfan Eshratifar, Benjamin Culpepper, Paloma de Juan, Srirangaraj Setlur, Venu Govindaraju https://arxiv.org/abs/2501.08347
Compositional Image Retrieval (CIR) allows users to refine image searches by providing textual modifications to a reference image. Think of searching for a "red dress" and then refining it to a "red dress with long sleeves." Current CIR methods heavily rely on supervised learning with labeled image-text-image triplets (reference image, modification text, target image). This approach limits their ability to generalize to unseen objects and domains, as models are confined to the specific relationships present in the training data.
SCOT (Self-Supervised COmpositional Training) offers a novel solution by introducing a zero-shot compositional pretraining strategy. This approach leverages existing image-text pair datasets and large language models (LLMs) to overcome the limitations of supervised learning. Instead of relying on labeled triplets, SCOT exploits the semantic alignment between visual and textual representations within pretrained vision-language models like CLIP, BLIP, and BLIP-2.
The SCOT method uses image-caption pairs and prompts an LLM to generate both a modification text (e.g., "with long sleeves") and a corresponding modified caption (e.g., "a red dress with long sleeves"). A composition function f<sub>c</sub> (such as the Combiner network) is trained to combine the reference image embedding V<sub>i</sub> and modification text embedding T<sub>m</sub> to produce a composed embedding V<sub>c</sub>. The training objective is a contrastive loss that pulls V<sub>c</sub> towards the embedding of the LLM-generated modified caption T<sub>u</sub> and pushes it away from embeddings of other captions, including the original caption. This contrastive approach allows the model to learn the compositional relationships between images and text without needing explicit target image examples. The loss function can be represented as: L = α<sub>pos</sub> ⋅ L<sub>pos</sub> + α<sub>neg</sub> ⋅ L<sub>neg</sub>, where L<sub>pos</sub> and L<sub>neg</sub> are the positive and negative loss components, respectively, and α<sub>pos</sub> and α<sub>neg</sub> are scaling factors.
Evaluating SCOT's zero-shot capabilities on FashionIQ and CIRR datasets reveals impressive results. On FashionIQ, SCOT achieves state-of-the-art zero-shot performance, surpassing existing methods like SEARLE-XL by a substantial margin (11.78% on R@10 and 13.8% on R@20). Remarkably, SCOT's zero-shot performance even approaches that of some fully-supervised methods, highlighting its ability to generalize effectively. Similar improvements are observed on CIRR, with SCOT outperforming SEARLE-XL by 12.58% at R@1 and 10.86% at R@5. Further analysis through ablation studies demonstrates the impact of different vision-language backbones, the benefits of using text embeddings as supervision targets, and the influence of training dataset size and distribution.
A Simple Aerial Detection Baseline of Multimodal Language Models by Qingyun Li, Yushi Chen, Xinya Shu, Dong Chen, Xin He, Yi Yu, Xue Yang https://arxiv.org/abs/2501.09720
Multimodal language models (MLMs) have proven their effectiveness in various remote sensing (RS) tasks like visual question answering and visual grounding. However, their application to aerial detection, a crucial task involving the detection of multiple object categories, has remained unexplored. This paper introduces LMMRotate, a pioneering baseline for applying MLMs to aerial detection, demonstrating their potential to rival conventional detectors.
The core challenge addressed by LMMRotate is the fundamental difference between the autoregressive textual output of MLMs and the numerical coordinate-based output of traditional detection models. LMMRotate tackles this by introducing a normalization method that transforms detection outputs into a textual format compatible with MLMs. Specifically, the 8-parameter polygon coordinates of detected objects are normalized to integers between 0 and 1000, and object categories are represented textually. This allows the MLM to process detection information as a sequence of text tokens. The input to the language model is a concatenated sequence of visual tokens (derived from the image) and text tokens (representing the detection instruction). The model is then trained using a standard language modeling objective with cross-entropy loss: L = - Σⱼ logP(rⱼ|r<ⱼ, T), where r is the sequence of output indices and T is the input sequence (visual and text tokens).
To ensure a fair comparison between MLMs and conventional detectors, the authors introduce a novel evaluation metric: mAPnc (mean average precision without confidence). This metric addresses the observation that confidence scores significantly impact mAP, giving conventional detectors an unfair advantage. By disregarding confidence scores, mAPnc provides a more equitable comparison. Experiments conducted on benchmark datasets like DOTA-v1.0, DIOR-R, FAIR1M-v1.0, SRSDD, and RSAR reveal impressive results. Using fine-tuned versions of Florence-2, a general-purpose MLM, LMMRotate achieves detection performance comparable to, and in some cases exceeding, conventional detectors in terms of mAPnc. Furthermore, LMMRotate demonstrates superior performance in terms of mF₁ on certain datasets, highlighting the potential of MLMs as powerful aerial detectors. The authors explore both single and multi-dataset training strategies, with joint training across multiple datasets leading to further performance improvements, particularly for smaller datasets like SRSDD.
FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing by Isaac Corley, Simone Fobi Nsutezo, Anthony Ortiz, Caleb Robinson, Rahul Dodhia, Juan M. Lavista Ferres, Peyman Najafirad https://arxiv.org/abs/2501.08490
Remote sensing imagery, abundant in visual information, has seen increasing use of multimodal learning, particularly by pairing images with text captions. While methods like CLIP excel at vision-language alignment and zero-shot classification, they often struggle with visually dense tasks like semantic segmentation. FLAVARS addresses this trade-off by combining contrastive learning, masked modeling, and geospatial alignment.
Building on the FLAVA framework (which integrates masked-image-modeling (MIM), masked-language-modeling (MLM), and contrastive learning), FLAVARS adds a crucial element for remote sensing: contrastive location-image alignment. This is achieved by incorporating a location encoder, initialized with SatCLIP weights and pretrained on SkyScript dataset coordinates, to align image, text, and location embeddings. This geospatial awareness enhances the model's understanding of remote sensing data. Pretraining on a subset of SkyScript allows for direct comparison with SkyCLIP and original FLAVA.
Evaluation across 12 remote sensing scene recognition datasets and the SpaceNet1 semantic segmentation dataset demonstrates FLAVARS' effectiveness. In KNN image classification, FLAVARS significantly outperforms SkyCLIP on most datasets, showcasing its superior visual representations. Location encoding further boosts this performance. While CLIP remains the champion in zero-shot classification, FLAVARS retains respectable zero-shot capabilities while improving visual encoder performance compared to FLAVA. Most notably, FLAVARS achieves a +6% mIoU improvement over SkyCLIP on SpaceNet1 semantic segmentation, demonstrating its strength in dense prediction tasks. Its performance also compares favorably to vision-only pretraining methods, achieving a mIoU of 78.1% with location encoding.
Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis by Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy https://arxiv.org/abs/2501.09555
Surgical workflow analysis, vital for improving surgical safety and efficiency, traditionally relies on large annotated datasets, posing challenges in cost, scalability, and expert annotation dependence. Surg-FTDA (Few-shot Text-driven Adaptation) offers a groundbreaking solution by adapting surgical foundation models to various downstream tasks with minimal paired image-label data.
Surg-FTDA employs a two-stage approach. First, few-shot selection-based modality alignment selects a small subset of images (data anchors) and aligns their embeddings with text embeddings from the downstream task. This alignment is achieved by training an MLP to minimize the MSE loss between aligned image embeddings and text embeddings. Specifically, they minimize (1/K) * Σ<sub>i=1</sub><sup>K</sup> ||v'<sub>image</sub> - v<sub>text</sub>||<sup>2</sup> where v'<sub>image</sub> = MLP(v<sub>image</sub>; θ) are the aligned image embeddings, v<sub>text</sub> are text embeddings, and K is the number of sampled pairs. This bridges the modality gap, a key challenge in adapting foundation models. Second, text-driven adaptation trains a decoder using only text data, eliminating the need for paired image-text data. This decoder, applied to aligned image embeddings, enables image-related tasks without explicit image-text pairs.
Evaluating Surg-FTDA on generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition) reveals promising results. For triplet recognition, Surg-FTDA significantly outperforms baselines, achieving an F1 score of 23.87% using SurgVLP and 20.33% using CLIP. Similar improvements are observed in phase recognition, where Surg-FTDA approaches the performance of fully supervised models. Ablation studies confirm the effectiveness of the few-shot selection strategy and the choice of foundation model.
This newsletter has showcased a diverse range of advancements in multimodal image and text foundation models. From zero-shot compositional retrieval with SCOT to aerial detection with LMMRotate, and from enhanced remote sensing analysis with FLAVARS to efficient surgical workflow analysis with Surg-FTDA, these papers highlight the growing power and versatility of these models. The common thread connecting these works is the innovative use of techniques like contrastive learning, large language models, and few-shot learning to overcome traditional limitations and unlock new possibilities in various applications. These advancements pave the way for more robust, generalizable, and efficient models that can tackle complex real-world challenges with less reliance on extensive labeled data.