This newsletter dives into the cutting edge of multimodal AI, exploring recent breakthroughs in image and text foundation models. We'll examine novel architectures, training techniques, and applications, highlighting key advancements in bridging the gap between visual and textual understanding. From efficiently adapting existing LLMs for multimodal generation to developing specialized models for specific languages and tasks, this newsletter offers a comprehensive overview of the evolving landscape of multimodal AI. We'll also delve into the critical challenges of uncertainty calibration and explainability in these increasingly complex models, paving the way for more robust and reliable multimodal applications.
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation by Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu https://arxiv.org/abs/2412.15188
Caption: The LlamaFusion architecture diagram showcases the integration of pre-trained LLMs (text module) with dedicated image processing modules. Shared self-attention layers facilitate cross-modal interactions, while modality-specific components (FFN, QKV) and a U-Net structure enable independent processing of text and images. The VAE encoder and decoder handle image generation and understanding, respectively.
Large language models (LLMs) have revolutionized text processing, but their multimodal capabilities are still developing. Training multimodal models from scratch is computationally expensive, and simply fine-tuning existing LLMs on multimodal data often degrades their core language abilities. LlamaFusion offers an innovative solution by adapting pretrained text-only LLMs for multimodal generation, leveraging existing weights and incorporating dedicated modules for visual processing. This framework enables LLMs to understand and generate both text and images in arbitrary sequences, opening up exciting new possibilities for AI communication.
The key innovation of LlamaFusion lies in its architecture, which integrates the pretrained Llama-3 model with dedicated transformer modules for visual understanding and generation. Instead of starting from scratch, LlamaFusion builds upon the strengths of Llama-3 for text processing while introducing new, parallel transformer modules for image processing using diffusion. Crucially, these modules employ modality-specific feedforward layers, query-key-value (QKV) projections, and normalization layers. This allows independent processing of each modality (text and image). However, shared self-attention layers enable crucial cross-modal interactions, allowing the model to learn relationships between text and images.
During training, the text-specific modules are frozen (η<sub>text</sub> = 0), preserving the language capabilities of the original LLM. Only the image-specific modules are trained on image data. This strategic freezing and training regime allows for efficient learning of visual understanding and generation without sacrificing text proficiency.
The effectiveness of LlamaFusion was evaluated against Transfusion, a state-of-the-art multimodal generative model trained from scratch. Using the same image data, LlamaFusion demonstrated significant improvements. With only half the FLOPs of Transfusion, LlamaFusion achieved a 20% improvement in image understanding and a 3.6% improvement in image generation while maintaining Llama-3's text-only performance, which, notably, outperformed Transfusion by 11.6%. Further experiments showed that LlamaFusion can also adapt existing vision-language models like LLaVA with multimodal generation abilities. Ablation studies confirmed the importance of modality separation and the freezing of text modules for preserving language capabilities while developing visual skills.
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models by Kunat Pipatanakul, Potsawee Manakul, Natapong Nitarach, Warit Sirichotedumrong, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai https://arxiv.org/abs/2412.13702
Caption: This diagram illustrates the agentic framework used for data curation and refinement within Typhoon2-Vision, particularly for Thai Chart Visual Question Answering (VQA). The process involves a chain of agent generations (Grandparents, Parents, Grandchildren) leveraging Typhoon 1.5X, cutting-edge models, and Meta-Prompt-CoT to iteratively enhance ground truth labels and generate high-quality Thai VQA data from chunked page overviews. The final output is markdown formatted content.
SCB 10X has released Typhoon 2, a new series of open large language models (LLMs) specialized in Thai language and multimodal understanding. Building upon their previous Typhoon models, this release leverages state-of-the-art architectures like Llama 3 and Qwen2.5 and incorporates novel training techniques for significant performance gains. The Typhoon 2 family includes text models ranging from 1 billion to 70 billion parameters, in both base and instruction-tuned versions, alongside dedicated models for vision and audio. A key focus is improved Thai language capability while retaining strong English performance.
For the text models (Typhoon2-Text), a high-quality, diverse Thai training corpus was prioritized. They expanded upon their existing Typhoon 1 corpus, gathering data from diverse sources including culturally relevant texts, high-quality web content filtered using an iterative classifier approach, synthetically generated textbook-style content, and educational resources like Thai Wikipedia. This data, mixed with English data, was used for continual pre-training (CPT) on the base models. Post-training involved supervised fine-tuning (SFT) on a combined English and Thai instruction dataset, including a newly created TyphoonIF dataset based on AutoIF. Enhancements include domain-specific SFT for math and coding, long context extension up to 128,000 tokens, function calling capabilities, and model merging using techniques like DARE + linear.
The vision model (Typhoon2-Vision) builds upon Qwen2-VL, specializing in image-based tasks. It focuses on enhancing Thai document understanding, particularly in Optical Character Recognition (OCR) and Chart Visual Question Answering (VQA). A combination of translation and distillation adapted the Cambrian-737K dataset for Thai. An agentic framework, extending Chain-of-Thought (CoT) to Tree-of-Thought (ToT), was used for data curation, refining ground truth labels, and generating Thai VQA data. Fine-tuning was performed using LoRA.
Typhoon2-Audio introduces a novel end-to-end architecture for speech processing and generation. It combines a Whisper/BEATS-based audio encoder, a Q-Former adapter, a Typhoon LLM, a non-autoregressive speech decoder, and a unit vocoder. This architecture allows parallel text and speech generation, reducing latency. Training involved a two-phase approach: pre-training on ASR and audio captioning data, followed by SFT on various tasks, including newly created Speech Instruction Following (SpeechIF) and Complex Instruction Following (ComplexIF) datasets. Typhoon2-Audio achieves strong performance on Thai ASR (WER below 15.0), translation, and spoken QA, outperforming existing open-source models. It also demonstrates promise on SpeechIF and ComplexIF tasks and functions as a non-autoregressive Text-to-Speech (TTS) system, achieving a CER below 20% on a Thai test set.
Finally, Typhoon2-Safety, a lightweight binary classifier, detects Thai-sensitive content. A dedicated dataset combined a translated WildGuard dataset with a new dataset focused on Thai cultural sensitivities. This mDeBERTa-v3 based model outperforms existing safety classifiers, effectively handling Thai-specific and universal safety concerns.
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future by Shilin Sun, Wenbin An, Feng Tian, Fang Nan, Qidong Liu, Jun Liu, Nazaraf Shah, Ping Chen https://arxiv.org/abs/2412.14056
Caption: This figure visualizes the historical progression of AI model development, categorized by parameter count and era (Traditional Machine Learning, Deep Learning, Discriminative Foundation Models, and Generative Large Language Models). The color-coding differentiates model types (machine learning-based, DNN-based, transformer-based, and generative LLMs-based), showcasing the increasing complexity and parameter scale over time. This timeline aligns with the paper's focus on the evolution of Multimodal Explainable AI (MXAI) and its adaptation to increasingly complex models.
Explainable AI (XAI) is increasingly crucial as AI models become more complex. This paper reviews Multimodal Explainable AI (MXAI), focusing on methods utilizing multiple data modalities for prediction or explanation. The authors divide MXAI's evolution into four eras: Traditional Machine Learning (2000-2009), Deep Learning (2010-2016), Discriminative Foundation Models (2017-2021), and Generative Large Language Models (2022-2024). Within each era, MXAI techniques are categorized into data explainability (pre-model), model explainability (in-model), and post-hoc explainability (post-model).
The Traditional Machine Learning era favored interpretable methods like decision trees and rule-based systems due to limited data and computational power. Data explainability relied on dimensionality reduction, model explainability on inherent transparency, and post-hoc methods like causal inference provided further insights. The Deep Learning era saw the rise of complex neural networks, prompting techniques to interpret these black boxes. Data explainability shifted towards analyzing data quality, model explainability explored techniques like decomposability and attention mechanisms, and post-hoc methods like Grad-CAM++ offered visual explanations.
The Discriminative Foundation Models era, marked by Transformer models, focused on interpreting these models through techniques like attention visualization and feature attribution. Data explainability involved analyzing multimodal datasets and constructing structural relationships using graph-based methods. Model explainability delved into behavioral explanations for Transformers and CLIP models. Post-hoc methods explored counterfactual reasoning and bias mitigation.
The Generative LLMs era brought new challenges and opportunities. LLMs' interactivity allowed for adaptive explanations, while complex multimodal data demanded sophisticated techniques. Data explainability focused on explaining datasets and graph modeling. Model explainability explored process explanations like In-Context Learning (ICL) and Chain of Thought (CoT), and inherent interpretability through probing. Post-hoc explainability leveraged example-based explanations, including counterfactual and adversarial examples. The paper also summarizes common datasets and evaluation metrics in MXAI research, categorized by tasks and explanation types.
FedPIA -- Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning by Pramit Saha, Divyanshu Mishra, Felix Wagner, Konstantinos Kamnitsas, J. Alison Noble https://arxiv.org/abs/2412.14424
Caption: The figure presents an overview of the FedPIA framework for federated learning of vision-language models. It depicts the server-level and client-level permutation and integration of adapters, along with a visualization of the loss contour. This approach enables efficient and stable convergence in multi-modal FL settings by aligning and integrating adapters trained on diverse client data.
Fine-tuning large Vision-Language Models (VLMs) in federated learning (FL) presents challenges due to data privacy, limited resources, and data/task heterogeneity. FedPIA (Federated Learning via Permuting and Integrating Adapters) addresses these challenges through a novel approach to adapter fusion.
FedPIA leverages Wasserstein Barycenters to align and integrate adapters trained on diverse client data. At the server level, client adapters are permuted to match the initialized global adapter before integration using: W(l,l-1)<sub>G</sub> = (1/K) Σ<sub>k=1</sub><sup>K</sup> W(l,l-1)<sub>k</sub> ||W<sub>k</sub> - W<sub>G</sub>||<sup>-γ</sup> , where W(l,l-1)<sub>G</sub> is the global adapter, W(l,l-1)<sub>k</sub> is the client adapter, and γ is a hyperparameter. This process bridges the parameter space gap between local adapters. At the client level, a similar process aligns the global adapter with the client-specific adapter. This two-fold approach promotes stable convergence.
FedPIA was evaluated across five medical vision-language FL tasks, using 48 medical image datasets, ten modalities, and two VLM backbones (ViLT and ALBEF). It consistently outperformed state-of-the-art PEFT-FL baselines. In modality-specific VQA, FedPIA achieved a 5.18% mean improvement over FedDAT. In heterogeneous tasks combining VQA and disease classification, it surpassed full fine-tuning by 1.81%.
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models by Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong https://arxiv.org/abs/2412.14660
Caption: This bar graph compares the Expected Calibration Error (ECE) of several Multimodal Large Language Models (MLLMs) across different calibration techniques. It shows the effectiveness of Temperature Scaling (TS), Prompt Tuning (PT), and their combination (TS+PT) in reducing ECE compared to the original uncalibrated models (Origin) for Qwen-VL, Qwen-VL-Chat, LLaVA-7B, and LLaVA-13B. The results demonstrate that while all models are initially miscalibrated, the combination of TS and PT offers the most significant improvement in calibration across the tested MLLMs.
Multimodal Large Language Models (MLLMs) face challenges in reliable uncertainty calibration. This paper investigates MLLMs like LLaVA and Qwen-VL, exploring their behavior across various scenarios, including before and after visual fine-tuning and multimodal training. While MLLMs maintained consistent calibration before and after fine-tuning, and showed minimal impact on linguistic task calibration after multimodal training, they consistently exhibited miscalibration.
The study also explored uncertainty across modalities, observing lower uncertainty in text compared to images. Integrating both modalities effectively reduced overall uncertainty. The "I Don't Know" (IDK) dataset revealed that MLLMs often prefer providing answers even when uncertain, mitigated through prompt adjustments.
To address miscalibration, the paper proposes and evaluates calibration techniques. Temperature scaling (TS) adjusts output probabilities: min<sub>T</sub> - Σ<sup>M</sup><sub>i=1</sub>Σ<sup>|y|</sup><sub>j=1</sub> 1<sub>yᵢ=j</sub> log [softmax (lᵢ/T)], where M is the number of samples, |y| is the number of classes, yᵢ is the true label, 1<sub>yᵢ=j</sub> is an indicator function, and lᵢ are the logits. Prompt tuning optimizes prompt suffixes. Combining TS and prompt tuning showed promising results.
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models by Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, Huaping Liu https://arxiv.org/abs/2412.14058
Caption: This figure categorizes various Vision-Language-Action (VLA) models based on their action space (continuous or discrete) and history integration method (one-step, interleaved, or policy-head). Markers differentiate between policy head (square), interleaved (triangle), and one-step (circle) models, with color indicating open-source (green) or closed-source (red) availability. This categorization aids in understanding the design space explored by the RoboVLMs framework, which systematically evaluates different VLA configurations for generalist robot policies.
This paper explores Vision-Language-Action Models (VLAs) for generalist robot policies. VLAs, built by fine-tuning pre-trained Vision-Language Models (VLMs), leverage the VLMs' multi-modal representation learning. The authors introduce RoboVLMs, a flexible framework for integrating any VLM into a VLA architecture.
The study evaluates various VLA configurations across simulated and real-world tasks. It encompasses various VLM backbones (Flamingo, LLaVA, Kosmos, Paligemma), four VLA structure formulations (categorized by action space and history integration), and three training data recipes incorporating cross-embodiment data. Action prediction for continuous actions uses a combined loss: L<sub>VLA</sub> = Σ<sup>t+L-1</sup><sub>i=t</sub> MSE(a<sub>i,pose</sub>, ā<sub>i,pose</sub>) + λ * BCE(a<sub>i,gripper</sub>, ā<sub>i,gripper</sub>), where a represents predicted and ā ground truth actions.
The best-performing VLA (Kosmos backbone, policy head) significantly outperformed existing methods on benchmarks like CALVIN, achieving a 30.3% improvement for 5 consecutive tasks zero-shot, averaging 4.25 completed tasks out of 5. It also showed robustness in real-world scenarios. Continuous action spaces proved more effective, historical context enhanced performance, and the policy head structure outperformed interleaved modeling.
While cross-embodiment data pre-training alone didn't consistently improve performance, post-training after cross-embodiment pre-training showed potential benefits in few-shot learning. In-domain data proved more effective than solely cross-embodiment data.
This newsletter has highlighted significant advancements in multimodal image and text foundation models. From efficient adaptation of existing LLMs like in LlamaFusion, to the development of language-specific models like Typhoon 2, the field is rapidly evolving. The exploration of architectural choices for VLAs in RoboVLMs and the focus on calibration in MLLMs further underscore the multifaceted nature of this research area. While performance gains are impressive, the ongoing challenge of explainability, highlighted in the MXAI review, emphasizes the need for continued research into transparency and trustworthiness as these models become increasingly powerful and integrated into real-world applications. The open-sourcing of frameworks like RoboVLMs and the creation of specialized datasets like the IDK dataset demonstrate a commitment to collaborative progress in this exciting field.