The world of multimodal foundation models is rapidly evolving, with new architectures and training paradigms emerging at a breathtaking pace. This newsletter dives into four recent papers that push the boundaries of what's possible in this exciting field, exploring novel approaches to any-to-any generation, ontological commitment extraction, automated dataset creation, and multi-task learning. From generating interleaved video-text narratives to uncovering the hidden knowledge within these complex models, these papers offer valuable insights into the future of multimodal AI.
MIO: A Foundation Model on Multimodal Tokens by Zekun Wang, King Zhu, Chunpu Xu, et al. https://arxiv.org/abs/2409.17692
Caption: MIO's architecture processes interleaved multimodal sequences of text, speech, and images, using specialized tokenizers for each modality. These multimodal tokens are then fed into an LLM backbone and trained with a four-stage process involving alignment, interleaved, and speech-enhanced pre-training, followed by supervised fine-tuning. This allows MIO to perform tasks such as interleaved video-text generation and chain-of-visual-thought reasoning.
The rise of Large Language Models (LLMs) has been transformative, yet their limited multimodal capabilities hinder broader applications. While Multimodal LLMs (MM-LLMs) have emerged, they often lack true any-to-any understanding and generation, relying on separate encoders and decoders for different modalities. MIO (Multimodal Input and Output) breaks this mold, offering seamless understanding and generation of speech, text, images, and videos in an end-to-end, autoregressive manner. Unlike its predecessors, MIO treats different modalities as "foreign languages" within a shared token space, facilitating interleaved multimodal sequence generation – a crucial feature absent in many existing MM-LLMs.
MIO's architecture relies on discrete multimodal tokens generated by specialized tokenizers for each modality. Image tokens are derived from a ViT based on BLIP-2 and quantized using a Causal Q-Former, while speech tokens utilize an 8-layer RVQ with separate content and timbre codebooks. These tokens, along with text tokens, are fed into an LLM backbone (Yi-6B-Base) trained using causal multimodal modeling with a next-token-prediction objective. The four-stage training process involves: (1) alignment pre-training to align multimodal representations with the language space, (2) interleaved pre-training to enrich contextual understanding, (3) speech-enhanced pre-training to optimize speech capabilities, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks.
MIO demonstrates competitive performance across various benchmarks, achieving comparable or superior results to dual-modal, any-to-any, and even modality-specific baselines. Its strength lies in its any-to-any and interleaved output features, enabling novel capabilities like interleaved video-text generation, chain-of-visual-thought reasoning, and instructional image editing. While limitations remain regarding fine-grained visual details and speech timbre control, MIO’s innovative approach paves the way for more versatile and interactive MM-LLMs.
Unveiling Ontological Commitment in Multi-Modal Foundation Models by Mert Keser, Gesina Schwalbe, Niki Amini-Naieni, et al. https://arxiv.org/abs/2409.17109
Deep Neural Networks (DNNs), particularly multi-modal foundation models, have achieved remarkable success, but their internal workings remain largely opaque. This opacity hinders validation, verification, and adaptation, particularly regarding the underlying ontological commitment – the concepts, relations, and assumptions employed by the model. This paper introduces a method to extract the learned superclass hierarchy from a multi-modal DNN, providing a crucial step towards understanding and verifying these complex models.
The proposed method leverages the inherent properties of vision DNNs and foundation models, capitalizing on the encoding of semantic similarities via vector distances. Using hierarchical clustering on leaf concept embeddings obtained from the DNN's textual input modality, the method extracts superclass representations. These parent concepts are then labeled using a search within available ontologies. The method assumes effective text-to-image alignment within the DNN and a concentric distribution of subconcept representations around their parent concept, formalized as: e(Parent) = mean<sub>Child∈Cs</sub> e(child), where e is the embedding function and Cs is the set of children concepts.
Evaluation using the CLIP model and the CIFAR-10 dataset demonstrates the method's potential. Experiments explored the impact of various hyperparameters, including affinity, linkage, and inference similarity, on the human-alignedness of the extracted ontology. Prompt engineering, specifically using the prompt "a photo of a classname," significantly improved results. The best-performing model achieved 92% accuracy in classification tasks, showcasing the feasibility of extracting meaningful ontologies. The paper further explores ontology validation and verification, revealing inconsistencies between the learned knowledge and external ontologies, particularly for technical terms. This opens new avenues for understanding and refining DNN knowledge representations.
GeoBiked: A Dataset with Geometric Features and Automated Labeling Techniques to Enable Deep Generative Models in Engineering Design by Phillip Mueller, Sebastian Mueller, and Lars Mikelsons https://arxiv.org/abs/2409.17045
Deep Generative Models (DGMs) hold immense promise for engineering design, but their application has been limited by the lack of suitable datasets. GeoBiked addresses this gap, providing a rich resource of 4,355 bicycle images annotated with structural, technical, and geometric features. This detailed annotation, including bicycle styles, wheel diameters, and key geometric reference points, enables exploration of conditional control of DGMs, crucial for generating feasible designs.
Beyond the dataset, the researchers explored automated labeling techniques using foundation models. Diffusion Hyperfeatures, consolidated latent features from Stable Diffusion, were used to detect geometric correspondences, achieving improved accuracy with multiple source images. Additionally, GPT-40 was employed to generate text descriptions, exploring image-only, label-only, and combined input configurations. Image-only grounding yielded diverse but sometimes hallucinatory descriptions, while label-only grounding improved accuracy but restricted diversity. Combining both inputs offered a balance between these aspects.
The results demonstrate the potential of Diffusion Hyperfeatures for automating geometric annotation in technical images and the power of GPT-40 for generating descriptive text, albeit with the need for careful prompt engineering. GeoBiked and the proposed labeling techniques represent a significant step towards integrating DGMs into engineering design workflows.
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE by Xun Zhu, Ying Hu, Fanbin Mo, et al. https://arxiv.org/abs/2409.17508
Caption: The architecture of Uni-Med, a unified medical foundation model, is depicted, highlighting its key components: a visual feature extraction module, a Connector Mixture-of-Experts (CMoE) module, and a large language model. The CMoE module dynamically routes visual features through projection experts to mitigate multi-task interference, enabling Uni-Med to effectively handle various medical tasks, including question answering, report generation, and image classification across diverse modalities like X-ray, CT, and MRI. The diagram also shows the flow of data from input modalities and text data through the model to the final response.
Building unified MLLMs for multi-task learning in medicine faces the "tug-of-war" problem, where simultaneous optimization across tasks leads to interference. Uni-Med addresses this challenge by focusing on the connector module. It comprises a visual feature extraction module, a Connector Mixture-of-Experts (CMoE) module, and an LLM. The CMoE, with its mixture of projection experts and a router, dynamically aligns visual and language embedding spaces, mitigating the tug-of-war problem.
The effectiveness of CMoE is demonstrated through ablation studies, showing up to 8% performance gains. Analysis of gradient optimization and parameter statistics reveals that CMoE balances gradient updates and reduces parameter conflicts. The tug-of-war index (G(GD, GM) = [Σ<sup>T</sup><sub>j=1</sub> GD<sub>ij</sub>GM<sub>ij</sub>]<sup>T</sup><sub>i=1</sub>) quantifies interference, while the parameter statistics score (Σ|∇oLi|/Σ∇oLi) analyzes parameter-level conflicts. Uni-Med achieves competitive or superior performance compared to existing medical MLLMs across six diverse tasks, showcasing its potential as a unified and generalist foundation model for medical AI.
This newsletter highlights the diverse and rapid advancements in multimodal foundation models. From MIO's innovative any-to-any generation capabilities to Uni-Med's targeted approach to multi-task learning, these papers underscore the growing sophistication of these models. The development of specialized datasets like GeoBiked and methods for extracting ontological commitment further pave the way for more robust, understandable, and applicable multimodal AI systems. While challenges remain, the progress showcased in these papers points towards a future where multimodal models can seamlessly integrate and interpret diverse data sources, opening up exciting possibilities across numerous domains.