This newsletter explores the cutting edge of multimodal AI, showcasing exciting new research in image and text generation, understanding, and forecasting. From enhancing fidelity in generated images to boosting temporal reasoning in video analysis and accelerating multimodal model inference, these papers offer a glimpse into the future of AI. Prepare to delve into novel architectures, benchmarks, and training strategies that push the boundaries of what's possible with multimodal models.
Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment by Minh-Quan Le, Gaurav Mittal, Tianjian Meng, A S M Iftekhar, Vishwas Suryanarayanan, Barun Patra, Dimitris Samaras, Mei Chen https://arxiv.org/abs/2502.05153
Caption: This diagram illustrates the architecture of Hummingbird, a novel diffusion model for multimodal context-aware image generation. It leverages a Multimodal Context Evaluator with a Multimodal Large Language Model (MLLM) to generate a context description, which guides a UNet Denoiser alongside image and text embeddings from CLIP encoders to produce a generated image. This process is optimized using global semantic and fine-grained consistency rewards to ensure high fidelity and diversity in the generated output.
Generating diverse synthetic images while preserving specific scene attributes has been a persistent challenge, especially in scene-aware tasks like Visual Question Answering (VQA) and Human-Object Interaction (HOI) reasoning. Existing diffusion models often struggle to maintain fidelity (accurate attribute preservation) while maximizing diversity (generating visually distinct images). Hummingbird, a novel diffusion-based image generator, addresses this trade-off by leveraging multimodal context alignment.
Hummingbird takes a multimodal context as input—a reference image and accompanying text guidance (e.g., a question about the image). At its core is a novel Multimodal Context Evaluator, utilizing a Multimodal Large Language Model (MLLM) to generate a detailed textual description of relevant scene attributes. This description, along with the reference image, fine-tunes a pre-trained SDXL diffusion model. The evaluator simultaneously maximizes two novel rewards: a Global Semantic Reward, ensuring overall scene context alignment via cosine similarity between image and text features, and a Fine-grained Consistency Reward, capturing detailed multimodal alignment at the token level using BLIP-2's QFormer. The total loss function combines these rewards: L<sub>total</sub> = -(λ<sub>1</sub>R<sub>global</sub> + λ<sub>2</sub>R<sub>fine-grained</sub>).
A new benchmark leveraging the MME Perception and Bongard HOI datasets rigorously evaluates Hummingbird. MME Perception employs Test-Time Augmentation (TTA) with real and synthetic images to evaluate fidelity on attributes like spatial existence, count, position, color, and scene. Bongard HOI uses Test-time Prompt Tuning (TPT) to assess fidelity in capturing complex human-object interactions. Diversity is measured using feature-based distance metrics. Experimental results demonstrate Hummingbird's superior performance, achieving superior fidelity while maintaining diversity, validating its potential for complex visual tasks.
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting by Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang https://arxiv.org/abs/2502.04395
Caption: The diagram illustrates the architecture of Time-VLM, a multimodal time series forecasting framework. It highlights the three key learners: Retrieval-Augmented Learner (RAL), Vision-Augmented Learner (VAL), and Text-Augmented Learner (TAL), each processing temporal, visual, and textual data respectively. These components interact with pre-trained Vision-Language Models (VLMs) and a gated fusion mechanism to generate predictions.
Augmenting time series forecasting with text or vision has shown promise, but existing methods face limitations. Text often lacks temporal detail, while vision lacks semantic context. Time-VLM addresses this by leveraging pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities, marking the first attempt to integrate all three using VLMs for this task.
Time-VLM comprises three core components. The Retrieval-Augmented Learner (RAL) extracts enriched temporal features via patch-based processing and memory bank interactions, capturing local and global dependencies. The Vision-Augmented Learner (VAL) encodes time series as images using multi-scale convolution, frequency, and periodic encoding, preserving fine-grained details and high-level structures. The Text-Augmented Learner (TAL) generates contextual textual descriptions, including statistics, domain context, and image descriptions. These modules collaborate with frozen pre-trained VLMs to produce multimodal embeddings, fused with temporal features for final prediction through a gated mechanism: G = σ(Wg[F<sub>tem</sub>; F<sub>mm</sub>] + b<sub>g</sub>), F<sub>fused</sub> = G ⊙ F<sub>attn</sub> + (1 - G) F<sub>mm</sub>, where F<sub>tem</sub> are temporal memory embeddings, F<sub>mm</sub> are multimodal embeddings, and G is a learnable gate.
Extensive experiments demonstrate Time-VLM's superior performance, particularly in few-shot and zero-shot scenarios. Ablation studies highlight each component's importance, with the RAL demonstrating the most significant impact. Time-VLM establishes a new direction for multimodal time series forecasting, showcasing the potential of VLMs in this domain.
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation by Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng https://arxiv.org/abs/2502.05415
Show-o, a unified model for multimodal understanding and generation, has shown promise but suffers from inference inefficiency. Show-o Turbo addresses this by unifying the generation process and shortening the denoising trajectories.
The key idea is to treat text generation as a denoising process, akin to image generation. By using parallel decoding algorithms like Jacobi Decoding, Show-o Turbo establishes a unified denoising perspective for both modalities, enabling the application of Consistency Distillation (CD). This trains Show-o Turbo to map any point on Show-o's sampling trajectory to the same endpoint, shortening the trajectory for faster generation. Trajectory segmentation and curriculum learning further improve training convergence.
Show-o Turbo achieves impressive results. In text-to-image generation, it outperforms the original Show-o with fewer steps and without classifier-free guidance (CFG). In image-to-text generation, it exhibits a 1.5x speedup without significant performance loss. The consistency loss is formulated as: L = E<sub>k~U(0,K)</sub> d (p<sub>Φ</sub>(·|u<sup>k</sup>, v), p<sub>θ</sub>(u<sup>K</sup>, v)), where p<sub>Φ</sub> is the student (Show-o Turbo), p<sub>θ</sub> is the teacher (Show-o), u<sup>k</sup> are image tokens at step k, v are text tokens, u<sup>K</sup> is the final image, and d is a divergence measure. This work offers a promising direction for efficient multimodal model design.
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding by Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui https://arxiv.org/abs/2502.06020
While Multimodal Foundation Models (MFMs) excel in tasks like video captioning and question answering, their limited internal capacity hinders processing extended temporal sequences. Temporal Working Memory (TWM) addresses this by selectively retaining task-relevant information. Inspired by human working memory, TWM enhances MFMs' temporal modeling by preserving critical details during video and audio processing.
TWM utilizes query-guided attention to focus on the most informative segments within temporal sequences. A search engine mechanism, guided by the input query, identifies and retains key frames and audio segments. Frame selection is based on a Similarity Score: S(vᵢ) = α₁D(vᵢ) + α₂R(vᵢ, q), where D(vᵢ) is the distinctiveness of frame vᵢ, R(vᵢ, q) is its relevance to the query q, and α₁ and α₂ are adaptive weights. Audio segments are selected using visual embeddings as queries, and cross-modal alignment is optimized with InfoNCE loss.
Integrating TWM into nine state-of-the-art MFMs resulted in significant performance improvements across tasks like audio-visual question answering, video captioning, and video-text retrieval. Ablation studies confirmed the effectiveness of TWM's components—Visual Working Memory (VWM) and Auditory Working Memory (AWM). TWM proves valuable for enhancing multimodal temporal reasoning in MFMs.
Goku: Flow Based Video Generative Foundation Models by Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu https://arxiv.org/abs/2502.04896
Goku, a family of joint image-and-video generation models, achieves industry-leading performance using rectified flow Transformers. Focusing on data curation, model architecture, flow formulation, and training infrastructure, Goku generates high-quality visual content.
A robust data pipeline with aesthetic filtering, OCR analysis, and subjective evaluations built a massive dataset of video-text and image-text pairs. Goku's architecture features a 3D joint image-video VAE compressing both inputs into a shared latent space, enabling seamless joint training. The model uses plain full attention, patch n' pack, 3D ROPE embeddings, and Q-K normalization. Training is multi-stage, starting with text-semantic pairing, followed by joint training, and modality-specific fine-tuning.
Goku's innovation lies in its use of rectified flow (RF), using linear interpolation: x<sub>t</sub> = t • x<sub>1</sub> + (1 − t) • x<sub>0</sub>, where x<sub>t</sub> is the training sample, x<sub>1</sub> is real data, x<sub>0</sub> is noise, and t is the interpolation coefficient. The model predicts velocity (v<sub>t</sub> = dx<sub>t</sub>/dt), guiding transformations towards real data during inference. Robust infrastructure with advanced parallelism, checkpointing, and fault tolerance supports training.
Goku achieves state-of-the-art results on benchmarks like GenEval, DPG-Bench, and VBench, surpassing leading models in both text-to-image and text-to-video generation. Goku represents a significant advancement in generative AI, demonstrating the power of rectified flow with transformers.
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models by He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, Laizhong Cui https://arxiv.org/abs/2502.04424
Caption: The radar chart visualizes the performance of several Multimodal Large Language Models (MLLMs) across 13 emotion-centric tasks within EmoBench-M, a new benchmark designed to assess MLLM Emotional Intelligence (EI). The chart compares the models' accuracy against human performance and a random baseline, highlighting the gap between MLLM and human EI capabilities. Example tasks from the benchmark, such as recognizing emotion from video dialogues and detecting sarcasm, are showcased alongside the chart.
Embedding emotional intelligence (EI) into Multimodal Large Language Models (MLLMs) is crucial for human-robot interaction and AI applications. Existing benchmarks, however, overlook the dynamic, multimodal nature of emotional expressions. EmoBench-M addresses this by using video and audio data, providing a more realistic evaluation of MLLM EI across three dimensions: Foundational Emotion Recognition, Conversational Emotion Understanding, and Socially Complex Emotion Understanding.
Evaluating open-source and closed-source MLLMs on 13 diverse scenarios revealed a significant performance gap between MLLMs and humans. While Gemini-2.0-Flash performed best among the tested models, human performance remained significantly higher. The results emphasize the need for further development in MLLM EI, particularly in understanding complex social cues and nuanced emotions.
This newsletter highlights the rapid advancements in multimodal image and text foundation models. From generating high-fidelity images with context awareness to leveraging VLMs for time series forecasting and accelerating multimodal inference, these papers showcase innovative approaches to complex challenges. The introduction of new benchmarks like EmoBench-M underscores the ongoing effort to evaluate and improve MLLMs' emotional intelligence, a critical aspect for future AI applications. The research presented here paves the way for more robust, efficient, and emotionally intelligent multimodal models, pushing the boundaries of AI capabilities and opening exciting possibilities for future research and applications.