This newsletter explores the cutting edge of multimodal AI, focusing on the exciting intersection of image and text understanding. We'll delve into five recent papers that showcase innovative approaches to leveraging the power of large language models (LLMs) and vision models, addressing challenges in text-to-image generation, video generation, 3D scene understanding, and efficient model adaptation. Prepare to be immersed in a world of dynamic KL-weighting, contrastive learning, prompt-guided encoders, and massive datasets, all pushing the boundaries of what's possible in multimodal AI.
Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models by Julian Perry, Frank Sanders, Carter Scott https://arxiv.org/abs/2502.00826
This paper presents a novel method that marries the strengths of LLMs and diffusion models for improved text-to-image generation. This hybrid approach aims to achieve both higher quality and efficiency in image synthesis. The key innovation lies in a dynamic KL-weighting strategy that optimizes the diffusion process, coupled with semantic understanding from pre-trained LLMs to guide the generation.
The method employs a generative diffusion model that progressively denoises random noise to create images. Crucially, LLMs are integrated to provide contextual embeddings z<sub>t</sub> at each timestep t of the reverse diffusion process. This conditions the model on the textual description, ensuring semantic alignment between the text and the generated image. The reverse diffusion process, guided by the LLM embedding, is mathematically represented as:
p<sub>θ</sub>(x<sub>t-1</sub>|x<sub>t</sub>, z<sub>t</sub>) = N(x<sub>t-1</sub>; μ<sub>θ</sub>(x<sub>t</sub>, t, z<sub>t</sub>), Σ<sub>θ</sub>(x<sub>t</sub>, t))
A cross-attention mechanism further refines this alignment by computing the attention between image features x<sub>t</sub> and text features z<sub>t</sub>: a<sub>t</sub> = Attention(x<sub>t</sub>, z<sub>t</sub>).
The model is trained using a variational objective, minimizing the Kullback-Leibler (KL) divergence between the true and generated distributions. A dynamic weighting strategy for the KL divergence terms prioritizes learning in early timesteps where the image structure is rough, shifting focus to finer details in later timesteps. The weighted loss function is:
L<sub>weighted</sub> = E<sub>q</sub>[Σ<sup>T</sup><sub>t=1</sub> α<sub>t</sub> D<sub>KL</sub>(q(x<sub>t</sub>|x<sub>t-1</sub>) || p<sub>θ</sub>(x<sub>t-1</sub>|x<sub>t</sub>, z<sub>t</sub>))]
Evaluated on the COCO dataset, this model outperformed traditional GAN-based models and other state-of-the-art text-to-image diffusion models, achieving a lower Fréchet Inception Distance (FID) of 30.5 and a higher Inception Score (IS) of 5.4. Human evaluations confirmed these findings, rating the model highly (4.6 out of 5) for realism, text alignment, and overall quality. An ablation study confirmed the crucial role of both LLM guidance and dynamic KL-weighting. The model also demonstrated improved efficiency, avoiding computationally expensive adversarial training and reducing training time.
Efficient Domain Adaptation of Multimodal Embeddings using Constrastive Learning by Georgios Margaritis, Periklis Petridis, Dimitris J. Bertsimas https://arxiv.org/abs/2502.02048
This research addresses the challenge of adapting powerful, but computationally demanding, foundational models (FMs) to specific downstream tasks in resource-constrained environments. The proposed method leverages frozen embeddings from LLMs and vision models, avoiding the computational burden of fine-tuning. Instead, it employs contrastive learning to train a small, task-specific nonlinear projection that maps the original high-dimensional embeddings into a lower-dimensional space optimized for the downstream task.
Caption: This diagram illustrates the proposed method for adapting multimodal embeddings using contrastive learning. Frozen embeddings from feature extractors (e.g., BERT) are passed through nonlinear projections trained via contrastive loss to generate task-specific embeddings. These low-dimensional embeddings are then concatenated and fed into a supervised ML model for downstream task prediction.
The contrastive loss function encourages embeddings with the same labels to cluster together while pushing apart those with different labels:
C<sup>(j)</sup> = Σ<sub>(u,v,l)∈C<sup>(j)</sup></sub>[l · log<sub>σ</sub>(g<sub>f<sub>j</sub></sub>(u)g<sub>f<sub>j</sub></sub>(v)/T) + (1 − l) · log<sub>σ</sub>(g<sub>f<sub>j</sub></sub>(u)g<sub>f<sub>j</sub></sub>(v)/T)]
where C<sup>(j)</sup> is the set of contrastive pairs for modality j, u and v are projected embeddings, l is 1 if u and v share the same label and 0 otherwise, g<sub>f<sub>j</sub></sub> represents the projection function, σ is the sigmoid function, and T is a temperature parameter.
Evaluations on a healthcare dataset (predicting diabetes and hypertension) and a movie dataset (genre prediction) demonstrated significant performance improvements (up to 20% improvement in F1 score) compared to unprojected baselines. Importantly, the method demonstrated minimal computational overhead, running in minutes even on CPU-only settings. This makes it an attractive solution for resource-constrained environments where fine-tuning is impractical.
AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis by Basit Alawode, Iyyakutti Iyappan Ganapathi, Sajid Javed, Naoufel Werghi, Mohammed Bennamoun, Arif Mahmood https://arxiv.org/abs/2502.01785
This work introduces AquaticCLIP, a specialized VLM designed for the unique challenges of underwater scene analysis. Addressing the scarcity of annotated underwater data, the researchers curated a massive 2 million image-text pair dataset from diverse sources and augmented it using MarineGPT, a specialized VLM for marine environments.
AquaticCLIP utilizes a dual-encoder architecture featuring a prompt-guided vision encoder and a vision-guided text encoder. The prompt-guided vision encoder uses learnable prompts to aggregate patch features, capturing global context effectively. The vision-guided text encoder integrates image information into the text encoder, improving alignment between modalities. These encoders are trained using a cross-modal contrastive loss: L<sub>cont</sub> = L<sub>i2t</sub> + L<sub>t2i</sub>, where L<sub>i2t</sub> and L<sub>t2i</sub> represent the image-to-text and text-to-image contrastive losses, respectively.
Caption: This diagram illustrates the AquaticCLIP architecture, a Vision-Language Model (VLM) designed for underwater scene analysis. It highlights the dual-encoder structure with prompt-guided vision and vision-guided text encoders trained using a cross-modal contrastive loss on a 2M image-text pair dataset. The dataset creation process, including MarineGPT augmentation and a cleaning module, is also depicted.
AquaticCLIP achieved state-of-the-art performance in zero-shot classification across seven datasets, including a remarkable 96.80% accuracy on the Coral Species Classification dataset. It also excelled in downstream tasks like segmentation, object detection, and counting, demonstrating its versatility and robustness in underwater environments.
The in-context inductive biases of vision-language models differ across modalities by Kelsey Allen, Ishita Dasgupta, Eliza Kosoy, Andrew K. Lampinen https://arxiv.org/abs/2502.01530
This paper delves into the inductive biases of VLMs during in-context learning, exploring how generalizations vary based on the modality (vision or text) of stimuli and their textual descriptions. Using three experimental paradigms (one-category generalization, two-category cue conflict, and odd-one-out) and a dataset of geometric shapes, the study investigated the shape-color bias in VLMs.
Caption: This bar graph visualizes the shape vs. color bias in several Vision-Language Models (VLMs) across three experimental paradigms (one-category, two-category, odd-one-out). It demonstrates that VLMs generally exhibit a stronger shape bias when learning from images compared to text, highlighting the influence of modality on inductive biases. Error bars represent standard error.
The findings reveal that VLMs exhibit stronger shape biases when learning from images compared to text, suggesting differences in internal representations across modalities. Moreover, adjective order in textual descriptions influences generalization, with models favoring the first-mentioned feature. These results highlight the impact of modality and format in few-shot learning with VLMs and suggest further research into the interplay of cognitive science principles and VLM behavior.
IPO: Iterative Preference Optimization for Text-to-Video Generation by Xiaomeng Yang, Zhiyu Tan, Xuecheng Nie, Hao Li https://arxiv.org/abs/2502.02088
This paper introduces Iterative Preference Optimization (IPO), a novel post-training framework for aligning text-to-video (T2V) models with human preferences. IPO utilizes a critic model, trained on a preference dataset with pairwise ranking and pointwise scoring, to guide the optimization of T2V models through reinforcement learning. IPO supports both Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO). The DPO loss function is:
L_diffusion-dpo(θ) = -E_(x_w, x_l~D, t~U(0,T), x^c~q(x^c|x_w), x^l~q(x^l|x_o))[log σ(-βΤω(ε)(||ε - ε_θ(x,t)||² - ||ε_w - ε_ref(x,t)||² - (||ε - ε_θ(x,t)||² - ||ε_l - ε_ref(x,t)||²))]
The KTO loss function is:
L_diffusion-kto(θ) = max E_(x_o~D, t~Uniform([0,T]))[U(w(x_o)(log(π_θ(x_t-1|x_t)/π_ref(x_t-1|x_t)) - Q_ref))]
Caption: This diagram illustrates the Iterative Preference Optimization (IPO) framework for enhancing text-to-video generation. It shows the process of collecting prompts, generating video variants, training a critic model on preference data, and iteratively refining the T2V model through reinforcement learning using pairwise and pointwise feedback. This approach allows smaller T2V models to achieve higher quality video generation compared to larger baseline models.
Experiments on the VBench benchmark demonstrated significant quality improvements using IPO. Notably, a 2B parameter model optimized with IPO outperformed a larger 5B parameter baseline model. This highlights the efficiency and effectiveness of IPO in aligning T2V models with human preferences, improving subject consistency, motion smoothness, and aesthetic quality.
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation by Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy https://arxiv.org/abs/2502.02548
This paper tackles the challenging task of open-vocabulary 3D scene understanding by introducing Mosaic3D, a new approach combining a novel data generation pipeline and a powerful foundation model. The automated data generation engine leverages 2D VFMs to create Mosaic3D-5.6M, a massive dataset of 3D mask-text pairs. The Mosaic3D model uses a two-stage training process: contrastive learning of a 3D encoder to align point cloud features with text embeddings (L<sub>point</sub> = (1/K) Σ<sub>k=1</sub><sup>K</sup> Pool(s<sub>k</sub>, σ(Z, w<sub>k</sub>))) and training of a lightweight mask decoder for open-vocabulary 3D segmentation.
Caption: This image shows the two-stage training process of the Mosaic3D model. Stage 1 involves training a 3D encoder using contrastive learning to align point cloud features with text embeddings. Stage 2 trains a mask decoder to predict object instances from these language-aligned features, utilizing a combined loss function.
Mosaic3D achieved state-of-the-art results on open-vocabulary 3D semantic and instance segmentation benchmarks, surpassing previous methods by significant margins. This work demonstrates the power of scaling data and leveraging advanced 2D VFMs for complex 3D scene understanding tasks.
This newsletter has showcased a range of exciting advancements in multimodal image and text foundation models. From leveraging LLMs to guide diffusion models and enhance video generation, to efficiently adapting embeddings with contrastive learning and creating specialized models for underwater scenes, the field is rapidly evolving. The development of large-scale datasets like Mosaic3D-5.6M and innovative architectures like AquaticCLIP and Mosaic3D underscores the progress being made towards more robust, versatile, and efficient multimodal AI systems. The insights into inductive biases further our understanding of how these models learn and generalize. These advancements pave the way for a future where AI can seamlessly integrate and interpret information across modalities, opening up new possibilities in various domains.