The convergence of vision and language in AI is rapidly reshaping how we interact with and generate information. This newsletter dives into the latest advancements in multimodal image and text foundation models, exploring novel architectures, training strategies, and applications that are pushing the boundaries of AI-driven creativity and understanding. From precise engineering design synthesis to the intricacies of image-text communication within VLMs, this collection of papers offers a glimpse into the exciting future of multimodal AI.
Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesis by Rui Zhou, Yanxia Zhang, Chenyang Yuan, Frank Permenter, Nikos Arechiga, Matt Klenk, Faez Ahmed https://arxiv.org/abs/2412.04707
Caption: The architecture of Parametric-ControlNet fuses parametric data, assembly graphs, component images, and text descriptions through specialized encoders and a multimodal fusion module. This joint embedding conditions a ControlNet-like module, which modifies a foundation model (e.g., Stable Diffusion) to generate engineering drawings adhering to the multimodal input specifications. This approach allows for precise control over generated designs, enabling accurate reflection of input parameters, component arrangements, and textual descriptions.
Engineering design is being revolutionized by AI, with generative models leading the charge. However, current models, while adept at creating realistic images, often lack the precision and adherence to specifications crucial for engineering. Parametric-ControlNet addresses this limitation by offering multimodal control over text-to-image foundation models like Stable Diffusion, specifically designed for engineering design synthesis. This innovative approach goes beyond text prompts, incorporating parametric data, assembly graphs, and component images, empowering designers with unprecedented control.
The architecture revolves around a ControlNet-like module that modifies the foundation model layer by layer. This module receives input from a parametric encoder (with diffusion-based autocompletion for handling incomplete data), a component encoder (processing assembled component images based on assembly graphs), and a CLIP text encoder for semantic understanding. A multimodal fusion module with attention layers synthesizes these diverse inputs into a joint embedding, which then conditions the generative process. This allows the model to tackle complex engineering design tasks involving specific functional requirements and spatial constraints—tasks beyond the capabilities of existing models.
Evaluated on a bike design problem using the BIKED dataset augmented with synthetic data, Parametric-ControlNet demonstrated impressive accuracy in reflecting input parameters, component images, and text descriptions, even with conflicting information. Quantitative metrics like R² scores reaching up to 0.98 for parametric accuracy and Intersection over Component (IoC) scores averaging above 0.8 for component fidelity confirm its superior performance. Standard image quality metrics like PSNR and SSIM were also used. Furthermore, a Diversity Score was introduced, showcasing the model's ability to generate a range of unique designs. For instance, the Diversity Score was 0.16 when using the imputation model and 0.12 without, demonstrating increased design variety with the imputation capability. This model paves the way for a new era of engineering design, combining creative flexibility with technical precision.
The Narrow Gate: Localized Image-Text Communication in Vision-Language Models by Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga https://arxiv.org/abs/2412.06646
Caption: This figure visualizes the impact of ablating increasing numbers of layers on the cross-modal attention between image and text tokens (χ<sub>out,gt</sub>). It shows a steep decline in performance for Chameleon models (7B and 34B) as layers connected to the [EOI] token are ablated, while Pixtral-12B exhibits a more gradual decline, highlighting the localized information flow in multimodal-output models compared to the distributed processing in unimodal-output models.
This study investigates how Vision-Language Models (VLMs) process and transfer visual information to the textual domain. Researchers compared multimodal-output models (e.g., Chameleon, generating both images and text) with unimodal-output models (e.g., Pixtral, generating only text), focusing on differences in information flow during image understanding tasks. Using tools like cross-modal attention, neighborhood overlap, attention knockout, and activation patching, they probed the internal mechanisms of these models.
A key discovery was the contrasting image-text communication patterns. Chameleon relied on a highly localized flow, primarily through a single "narrow gate" token—the end-of-image token ([EOI]). Ablating this token drastically reduced performance on tasks like VQA and image captioning, with accuracy on VQAv2 dropping from 0.51 to 0.25 in Chameleon-7B and from 0.59 to 0.39 in Chameleon-34B. In contrast, Pixtral exhibited a distributed communication pattern, with information flowing through multiple image tokens. Ablating individual tokens in Pixtral had minimal impact, suggesting a more robust processing strategy.
Furthermore, manipulating the [EOI] token in Chameleon allowed researchers to steer the model's understanding of image semantics. Patching its activation changed the model's interpretation of an image and its textual description, demonstrating the power of targeted interventions to control global behavior. This raises possibilities for image editing and content creation, but also necessitates further investigation into potential vulnerabilities and biases.
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies by Recep Firat Cekinel, Pinar Karagoz, Cagri Coltekin https://arxiv.org/abs/2412.05155
Caption: This diagram illustrates the two fusion methods used to evaluate Vision Language Models (VLMs) for fact-checking. The top section depicts intrinsic fusion, where embeddings are taken directly from the VLM after processing both text and image inputs. The bottom section shows extrinsic fusion, where separate text and image encoders generate embeddings that are then combined before classification. Both methods utilize a feed-forward veracity classifier to predict claim veracity.
This study explores the effectiveness of Vision Language Models (VLMs) in multimodal fact-checking. The key questions are whether visual information enhances performance compared to text-only models and how effectively VLMs utilize both modalities. The researchers propose a probing classifier approach, extracting embeddings from pre-trained VLMs (Qwen-VL, Idefics2, PaliGemma-3b) and feeding them into a neural classifier for veracity classification on the Mocheg and Factify2 datasets. The study compares intrinsic fusion (using VLM embeddings directly) with extrinsic fusion (combining embeddings from separate text and image encoders).
The methodology involves extracting last hidden layer representations from the VLMs, applying mean pooling to obtain a single embedding per instance. For extrinsic fusion, embeddings are extracted separately from text and image encoders. The probing classifier consists of two linear layers with ReLU activation and dropout, trained with a weighted cross-entropy loss to address class imbalance. Zero-shot experiments with text-only and multimodal models were also conducted. Finally, an ablation study compared the neural classifier with KNN and SVM baselines.
Results indicate that multimodality can improve fact-checking, with Idefics2-8b and LVLM4FV outperforming text-only counterparts in zero-shot tests. However, extrinsic fusion generally yielded superior results. On Mocheg, the highest F1-macro scores with extrinsic fusion were 0.514 (Qwen-VL) and 0.528 (Idefics2-8b). On Factify2, the highest scores were 0.629 (Qwen-VL), 0.670 (Idefics2-8b), and 0.590 (PaliGemma-3b). The neural classifier significantly outperformed KNN and SVM. The study suggests future research explore using VLMs as assistants to text-only models, providing summaries or justifications to enhance fact-checking rather than acting as primary fact-checkers.
Compositional Image Retrieval via Instruction-Aware Contrastive Learning by Wenliang Zhong, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Junzhou Huang https://arxiv.org/abs/2412.05756
Caption: This diagram contrasts the InstructCIR model with previous State-of-the-Art (SOTA) models for Zero-Shot Composed Image Retrieval (ZS-CIR). InstructCIR leverages a Multimodal Large Language Model (MLLM) trained in two stages: first, aligning visual and textual modalities, and second, fine-tuning for instruction awareness using a triplet dataset of source images, modifier text, and target captions. Previous SOTAs, relying on vision encoders, VLM, and adapter/modifier text inputs, struggle with the nuanced instruction-following aspect of ZS-CIR, unlike the novel InstructCIR approach.
Zero-Shot Composed Image Retrieval (ZS-CIR)—finding images based on a source image and a modification instruction—is a challenging task. Existing CLIP-based models struggle with the inherent instruction-following aspect. This paper introduces InstructCIR, which leverages instruction-tuned Multimodal Large Language Models (MLLMs) to enhance ZS-CIR performance. A two-stage training strategy addresses the challenge of adapting MLLMs, designed for text generation, to embedding extraction.
The first stage aligns visual and textual modalities in a shared embedding space using contrastive learning and an InfoNCE loss: L₁ = ½ (Li2c + Lc2i), where Li2c = -log((hi,hc) / (hi,hc) + ∑(hi, hnc))) and Lc2i = -log((hc, hi) / ((hc,hi) + ∑(hc, hni))). The second stage focuses on instruction awareness. Using a triplet dataset (CC3M-Instruct) of source image, modifier text, and target caption, the model is fine-tuned with another InfoNCE loss: L2 = -log((hit, hcr) / (hit, her) + ∑(hit, hn))).
Evaluated on FashionIQ, CIRR, GeneCIS, and CIRCO, InstructCIR achieves significant improvements. On CIRCO, it reaches an mAP@5 of 22.32%, a substantial 13.60% and 10.64% improvement over Pic2Word and SEARLE. On CIRR, it surpasses the state-of-the-art by 11.28% and 10.94% in R@1. Similar gains are observed on FashionIQ and GeneCIS. Ablation studies confirm the importance of both training stages and the triplet dataset, and show performance scales with larger MLLMs.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance by Chunwei Wang, Guansong Lu, Junwei Yang, Runhui Huang, Jianhua Han, Lu Hou, Wei Zhang, Hang Xu https://arxiv.org/abs/2412.06673
Caption: This diagram illustrates the architecture of ILLUME, a unified multimodal large language model. (a) shows the MLLM architecture for understanding and generation tasks, using a vision encoder and adapter for understanding and a text tokenizer for generation. (b) details the vision tokenizer, which uses a denoising U-Net and upsampling to reconstruct the input image from Gaussian noise conditioned on discrete image tokens, enabling efficient image-text alignment.
ILLUME is a unified multimodal large language model (MLLM) that integrates understanding and generation capabilities within a single architecture. Remarkably, it achieves competitive performance using only 15M image-text pairs for pretraining, significantly less than models like Janus. This efficiency is attributed to a semantic vision tokenizer and a progressive three-stage training procedure, including an image reconstruction task.
ILLUME extends existing VLMs with a vision vocabulary for generating discrete vision tokens. A UNIT encoder and vision adapter handle understanding tasks, aligning visual features with the LLM's input space. For generation, the vision tokenizer converts images into discrete indices, which the model predicts using a shared prediction head. The model is optimized with the standard LM objective: L = − ∑{i=1} log P_θ(y_i|y{≤i}).
A novel self-enhancing multimodal alignment scheme further strengthens the synergy between understanding and generation. This involves corpus self-generation (generating images from text), assessment generation (evaluating consistency between generated images and original text), and supervised fine-tuning using these assessments.
ILLUME excels on various benchmarks, outperforming other unified MLLMs and specialized models in understanding tasks. It achieves competitive results in image generation (FID of 7.76 on MJHQ30K) and image editing.
SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation by Leigang Qu, Haochuan Li, Wenjie Wang, Xiang Liu, Juncheng Li, Liqiang Nie, Tat-Seng Chua https://arxiv.org/abs/2412.05818
Text-image alignment, especially in compositional scenarios, remains a challenge for Large Multimodal Models (LMMs). SILMM (Self-Improving LMMs) introduces a model-agnostic iterative framework for self-feedback and alignment optimization. It operates in five steps: Compositional Prompt Generation, Diverse Image Generation (using nucleus sampling or a novel DropDiv strategy for continuous LMMs), Decompositional Self-Questioning, VQA-based Self-Feedback, and Learning from Self-Feedback (using Direct Preference Optimization (DPO) or a novel Kernel-based Continuous DPO (KC-DPO) with a quadruplet objective: L<sub>KC-DPO</sub> = -𝔼<sub>(𝑥,𝐻𝑤,𝐻𝑙)~𝐷</sub>[log𝜎(𝛾 - 𝑘(𝐻, 𝐻𝑤) + 𝑘(𝐻𝑟, 𝐻𝑤) + 𝑘(𝐻, 𝐻𝑙) - 𝑘(𝐻𝑟, 𝐻𝑙))].
Evaluated on T2I-CompBench++, DPG-Bench, and TIFA, SILMM demonstrates substantial improvements. On T2I-CompBench++, it achieves over 30% improvement, and around 20% on DPG-Bench. Discrete LMMs show greater self-improvement than continuous LMMs, likely due to DPO stability. While improvements are more pronounced in attribute categories, the results highlight the potential of self-improvement in LMMs for enhanced text-to-image generation.
This newsletter highlighted several key advancements in multimodal image and text foundation models. We've seen a push towards greater control and precision in design generation with Parametric-ControlNet, a deeper understanding of the intricacies of image-text communication within VLMs, and innovative approaches to leverage both visual and textual information for tasks like fact-checking and image retrieval. The development of efficient training strategies, like those employed by ILLUME, and self-improvement frameworks like SILMM, pave the way for more robust and capable multimodal models. The trends showcased in this newsletter underscore the rapid pace of innovation in this field, promising even more sophisticated and powerful multimodal AI systems in the near future.