This newsletter explores the cutting edge of multimodal AI, focusing on enhancing the capabilities of image and text foundation models. We'll delve into four recent papers that tackle key challenges in this field, from controllable data synthesis and multi-granular visual generation to novel benchmark development and knowledge transfer techniques. Prepare to uncover innovative approaches for boosting the performance, robustness, and adaptability of these powerful models.
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning by Qingqing Cao, Mahyar Najibi, Sachin Mehta https://arxiv.org/abs/2410.11963
Training robust vision and multimodal foundation models like CLIP heavily relies on massive, often imperfect datasets. CtrlSynth offers a solution through controllable image-text synthesis, addressing the limitations of existing synthetic data augmentation methods. The core innovation lies in its ability to decompose visual semantics into basic elements, allowing users to apply specific control policies (e.g., remove, add, or replace operations) and recompose these elements into diverse and natural synthetic images and texts.
Leveraging pre-trained foundation models like LLMs and diffusion models, CtrlSynth operates in a closed-loop, training-free, and modular fashion. A vision tagging model (VTM) extracts visual tags (objects, attributes, relationships) from an image. A text controller, guided by user-defined policies, uses these tags to generate instructions for the LLM, which synthesizes new text descriptions. An image controller then guides a text-to-image model to generate synthetic images from these prompts. This closed-loop design allows for iterative refinement and filtering of low-quality samples.
Extensive experiments across 31 datasets showcase CtrlSynth's impact. In zero-shot image classification, CtrlSynth boosts accuracy by 2.5% to 9.4% on CLIP models trained on CC3M and CC12M. Image-text retrieval tasks see remarkable improvements, with Recall@1 increasing by up to 24% and 36% for Flickr and an average of 23.4% for CC3M models. Compositional reasoning also benefits significantly, with accuracy gains of 4.5% for CC3M and 3% for CC12M on the SugarCrepe benchmark. Furthermore, CtrlSynth shines in long-tail tasks, improving tail class accuracy by 21.3% on ImageNet-LT and 16.2% on Places-LT.
Ablation studies confirm the importance of CtrlSynth's components. A stronger LLM is crucial for high-quality text generation, while incorporating visual relations significantly enhances compositional reasoning. The closed-loop design with self-filtering further refines sample quality. CtrlSynth's flexibility, allowing for easy swapping of pre-trained models and adaptation to different synthesis paths, makes it a versatile tool for data augmentation in multimodal learning.
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation by Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu https://arxiv.org/abs/2410.13861
Existing MLLMs often struggle to balance the need for diversity in text-to-image generation with the precise controllability required for tasks like image editing. PUMA introduces a novel approach using multi-granular visual features as both inputs and outputs, addressing the varying granularity demands of diverse image generation tasks within a unified MLLM framework.
PUMA incorporates a multi-granular image encoder (based on CLIP) extracting features at different resolutions, dedicated diffusion-based image decoders fine-tuned on SDXL for various granularity levels, and an autoregressive MLLM. The MLLM is trained with a combined loss function:
L = Σᵢ log P(tᵢ|t<ᵢ, F<ᵢ) + Σᵢ αᵢ Σⱼ |fᵢ,ⱼ - fᵢ,ⱼ|²
where tᵢ represents text tokens, fᵢ,ⱼ and fᵢ,ⱼ are ground truth and predicted feature tokens at granularity level i, and αᵢ weights the importance of each level. This combined loss effectively balances text and image feature prediction.
A two-stage training process, involving multimodal pretraining and task-specific instruction tuning, allows PUMA to acquire broad capabilities and then specialize. Evaluation across various tasks demonstrates its effectiveness. On ImageNet validation, PUMA achieves a PSNR of 18.16 and LPIPS of 0.2215 for fine-grained image reconstruction, exceeding existing methods. On MSCOCO 30K, it shows improved text-to-image generation (CLIP-I: 0.736, CLIP-T: 0.317). For image editing on Emu-Edit, PUMA surpasses the state-of-the-art (CLIP-T: 0.270) while maintaining strong preservation (CLIP-I: 0.846, DINO: 0.785). PUMA also demonstrates competitive performance on image understanding benchmarks. These results highlight the power of the multi-granular approach for unifying visual understanding and generation.
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion by Jiacheng Ruan, Yebin Yang, Zehao Lin, Feiyu Xiong, Zeyun Tang, Zhiyu Li https://arxiv.org/abs/2410.12564
FTII-Bench introduces a challenging task: Flow Text with Image Insertion, requiring LVLMs to integrate image comprehension, instruction understanding, and long-text interpretation. Given flowing text paragraphs and candidate images, the model must select the most appropriate image to insert after each paragraph.
The benchmark leverages 318 Chinese and 307 English news articles, providing a gold standard for image-text sequencing. It includes two question types: single-choice (four difficulty levels) and flow-insertion (three difficulty levels), offering granular performance assessment.
Caption: The image showcases examples from the FTII-Bench dataset, demonstrating the image insertion task. Given a text paragraph and a set of images (A-H), the model must select the most appropriate image to insert, with its confidence level, while the ground truth is also provided for comparison. This example highlights the challenge of aligning images to evolving textual narratives, a key focus of the FTII-Bench.
Evaluation of various LVLMs and CLIP-based models revealed that even state-of-the-art models, like GPT-4o, struggle with this task. GPT-4o achieved only 61.0% accuracy on the hardest single-choice questions. In flow-insertion tasks, many LVLMs exhibited near-random performance, highlighting their limitations in handling long text, complex instructions, and multiple images. FTII-Bench offers a valuable resource for pushing the boundaries of LVLM evaluation and driving further research.
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration by Yiwei Guo, Shaobin Zhuang, Kunchang Li, Yu Qiao, Yali Wang https://arxiv.org/abs/2410.12183
TransAgent addresses the domain shift challenge in vision-language models by leveraging the diverse knowledge of 11 heterogeneous agents through multi-source knowledge distillation. These agents include models specialized in various aspects of vision and language processing. A Mixture-of-Agents (MoA) gating mechanism dynamically weights agent contributions based on the target domain.
A unified distillation process transfers knowledge from these agents to a CLIP-like model. Visual features from vision agents are gated using MoA and distilled to CLIP's visual prompts. Textual knowledge from language agents is similarly distilled to enhance CLIP's textual representations. For multi-modal agents, score vectors are extracted, gated, and used to align CLIP's visual and textual prompts. The overall loss function combines cross-entropy and distillation losses:
L<sub>TransAgent</sub> = L<sub>CE</sub> + λ<sub>1</sub>L<sub>VAC</sub> + λ<sub>2</sub>L<sub>LAC</sub> + λ<sub>3</sub>L<sub>MAC</sub>
Experimental results on 11 visual recognition benchmarks demonstrate TransAgent's effectiveness. It outperforms state-of-the-art methods like CoOp, achieving around 10% average improvement and 20% on EuroSAT. TransAgent remains efficient in deployment as the external agents are unloaded after training.
This newsletter has highlighted various innovative approaches to enhance multimodal image and text foundation models. From generating controllable synthetic data with CtrlSynth to leveraging multi-granular features with PUMA, and from establishing challenging benchmarks like FTII-Bench to transferring knowledge from heterogeneous agents with TransAgent, these papers offer valuable insights into pushing the boundaries of multimodal AI. These advancements promise more robust, adaptable, and efficient models capable of tackling complex real-world applications.