This newsletter explores the cutting-edge research in multimodal image and text foundation models, covering advancements in syntactic understanding, data synthesis, unified architectures, and multilingual capabilities. We'll delve into new architectures, training methodologies, and the challenges researchers are tackling to enhance the performance and robustness of these powerful models. Prepare for a deep dive into the intricacies of multimodal AI!
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models by Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, Hassan Sajjad https://arxiv.org/abs/2412.08111
Caption: This figure compares the syntactic probing accuracy of different VLMs (CLIP and Sent-MiniLM) and a ULM (RoBERTa) across layers. It demonstrates that VLMs, especially CLIP, significantly underperform ULMs on various dependency parsing metrics (LAS, UAS, UUAS, Label, and Root accuracy), highlighting the impact of pre-training objectives on syntactic knowledge encoding. The middle layers generally exhibit the highest accuracy for most models, except CLIP, whose performance degrades across layers.
Vision-Language Models (VLMs) have revolutionized multimodal tasks, but their linguistic capabilities remain a topic of ongoing research. This paper delves into the syntactic knowledge encoded by VLM text encoders, comparing them to Uni-modal Language Models (ULMs) and exploring the impact of pre-training objectives, model size, and data volume. The research utilizes DepProbe, a probing classifier that decodes syntactic dependency trees from word representations, to analyze the performance of various CLIP and FLAVA variants against ULMs like BERT and ROBERTa on the EWT Universal Dependencies Treebank.
The methodology involves predicting Universal Dependencies (UD) trees from the text encoder representations. DepProbe uses two matrices, L and B, to predict dependency labels and distances, respectively. The label prediction employs a softmax function over the output of the L matrix applied to the word representation h<sub>i</sub>: p(r<sub>i</sub> = l<sub>k</sub>|w<sub>i</sub>) = softmax(Lh<sub>i</sub>)<sub>k</sub>. The B matrix projects word representations into a "syntactic subspace" where distances between vectors correspond to distances in the dependency tree. This distance is calculated as: d<sub>B</sub>(h<sub>i</sub>, h<sub>j</sub>) = √(Bh<sub>i</sub> - Bh<sub>j</sub>) (Bh<sub>i</sub> - Bh<sub>j</sub>). Performance is evaluated using standard dependency parsing metrics: Labeled Attachment Score (LAS), Unlabeled Attachment Score (UAS), Undirected UAS (UUAS), Label accuracy, and Root accuracy.
The results reveal a significant disparity between VLMs and ULMs. ULMs consistently outperform VLMs, particularly CLIP, in encoding syntactic information. CLIP, trained with a contrastive learning objective, struggles with predicate-argument relations and word order, often representing text in a bag-of-words fashion. FLAVA, incorporating a masked language modeling (MLM) objective alongside contrastive loss, performs better, approaching ULM performance. Surprisingly, increasing CLIP's model size and training data volume does not significantly improve its syntactic abilities. The study also finds that middle layers of most models, except CLIP, are richest in syntactic knowledge, while CLIP's performance deteriorates across layers.
Further analysis suggests that CLIP's limitations stem from its inability to fully capture the core syntactic structure of sentences. While it can identify individual words, it fails to accurately represent their relationships, particularly predicate-argument structure and functional word attachments. Interestingly, Sentence Language Models (SLMs), trained for sentence-level embeddings like CLIP, exhibit similar behavior, performing well in initial layers but degrading towards the final layer. This suggests that sentence-level training objectives may not prioritize encoding fine-grained syntactic information. Probing on nonsensical sentences with known syntactic structures further confirms that the syntactic information encoded by the models is largely independent of semantic (word co-occurrence) information. The key takeaway is that pre-training objectives significantly influence syntactic learning, with MLM-based objectives proving more effective than purely contrastive loss for encoding syntactic knowledge in these models.
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering by Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji https://arxiv.org/abs/2412.07030
Caption: This diagram illustrates the FM2DS (Few-Shot Multimodal Multihop Data Synthesis) framework, a five-stage pipeline for generating multimodal multihop question-answering training data. The process begins with a multimodal Wikipedia document and leverages LLMs, existing MMQA datasets, and validation steps to create high-quality question-answer pairs with corresponding queries.
Multimodal multihop question answering (MMQA) requires reasoning across diverse sources like images and text, posing a significant challenge in AI. Existing MMQA datasets often rely on short snippets or repetitive templates, hindering model generalization. Creating new datasets is also resource-intensive due to the need for extensive human annotation. This research introduces FM2DS (Few-Shot Multimodal Multihop Data Synthesis), a novel framework designed to address these limitations by automatically synthesizing high-quality MMQA training data.
The FM2DS framework employs a five-stage pipeline. First, it retrieves related documents from Wikipedia using hyperlinks and topic modeling. Next, it leverages few-shot samples from existing datasets like MultiModalQA. The core of the framework lies in its question generation and validation process. Using large language models (LLMs), FM2DS generates multimodal, multihop questions, ensuring they require information from multiple modalities and documents. Rigorous validation steps filter out questions that are unrelated, open-ended, or solvable with a single modality. Subsequently, answers are generated and validated using information extraction techniques and image captions. Finally, queries are created to guide information retrieval, further enhancing the training process for smaller vision-language models (VLMs).
To evaluate the effectiveness of FM2DS, the researchers trained various VLMs (LLaVA, InternVL, Idefics) on the synthesized data and compared their performance against models trained on human-annotated datasets (MultiModalQA, WebQA). Remarkably, models trained on FM2DS data outperformed those trained on human-collected data, achieving an average improvement of 1.9 in Exact Match (EM) on WebQA and 1.81 EM on MultiModalQA. Further analysis revealed that FM2DS data facilitated faster model convergence, requiring fewer samples to achieve comparable performance. A new benchmark, M²QA-Bench, was also introduced to evaluate LVLMs on more complex MMQA tasks involving full documents.
The superior performance of models trained on FM2DS data can be attributed to several factors. The framework's emphasis on multihop reasoning and multimodal integration ensures that the generated data truly reflects the complexity of real-world MMQA scenarios. The rigorous validation steps further enhance data quality by filtering out noisy or irrelevant samples. Moreover, the inclusion of queries provides a valuable learning signal for smaller VLMs, enabling them to effectively retrieve and integrate information from multiple sources. This work demonstrates the potential of data synthesis for addressing the data scarcity challenge in MMQA and paves the way for more efficient and robust question-answering systems.
Multimodal Latent Language Modeling with Next-Token Diffusion by Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei https://arxiv.org/abs/2412.08635
Caption: Latent Language Modeling (LatentLM) uses a causal Transformer and σ-VAE to process both continuous (image, audio, video) and discrete (text) data. Next-token diffusion generates latent representations of continuous data, while standard next-token prediction handles discrete data, all within a unified architecture. This enables multimodal generation and understanding tasks like text-to-image, image-to-text, and text-to-speech synthesis.
Researchers have introduced Latent Language Modeling (LatentLM), a novel approach to multimodal generative modeling that seamlessly integrates continuous and discrete data using causal Transformers. This unified approach addresses the limitations of existing methods that often rely on pipelines or external tools, leading to information loss and hindering end-to-end optimization. LatentLM leverages a variational autoencoder (VAE) to represent continuous data like images, audio, and video as latent vectors. The key innovation lies in the introduction of next-token diffusion, which autoregressively generates these latent vectors one by one, conditioned on the Transformer's hidden state. For discrete data like text and code, LatentLM employs standard next-token prediction with softmax heads within the same shared Transformer backbone. To enhance robustness to exposure bias during autoregressive generation, the authors developed σ-VAE, a modified VAE that maintains variance in the latent space using a fixed variance sampled from a normal distribution: z = µ + σ ⊙ ϵ, where ϵ ∼ N (0, 1), σ ∼ N (0, Cσ).
The effectiveness of LatentLM is demonstrated across diverse modalities. In image generation on ImageNet, LatentLM surpasses Diffusion Transformers in both performance and scalability, achieving competitive FID scores and higher throughput. For multimodal large language models trained on interleaved image-text data, LatentLM outperforms both Transfusion and vector-quantized models in language modeling and various multimodal generation tasks. The scalability of LatentLM is highlighted by its favorable performance with increasing training tokens.
In text-to-speech synthesis, LatentLM achieves superior performance compared to VALL-E 2, demonstrating better speaker similarity and robustness while requiring significantly fewer (10x) decoding steps. This efficiency gain stems from the higher compression ratio achieved by the continuous representation tokenizer. Ablation studies confirm the benefits of using a higher CFG scale (around 4) and a moderate number of diffusion sampling steps (around 3-5) for optimal performance. LatentLM offers a compelling unified approach to multimodal modeling, streamlining implementation by leveraging existing large language model training infrastructure. Its general-purpose interface allows seamless integration of various modalities, enabling both generation and understanding within a single framework.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities by Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal https://arxiv.org/abs/2412.07769
Caption: This diagram illustrates the architecture of BiMediX2, a bilingual medical LMM. It shows the process of analyzing a medical image, tokenizing the query, processing it through the Llama 3.1 language model with LoRA adapters, and generating a response in both English and Arabic, verified by a medical expert. The model leverages a bilingual dataset and incorporates visual and textual modalities for comprehensive medical understanding.
This research introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical Expert Large Multimodal Model (LMM) built on the Llama 3.1 architecture. It integrates text and visual modalities within a unified framework, enabling advanced image understanding and diverse medical applications. Unlike models requiring separate checkpoints for different tasks, BiMediX2 handles a wide range of medical applications, including Multi-turn Conversations, Report Summarization, Report Generation, and image analysis across various medical specialties, all within a single model.
Central to BiMediX2's development is the creation of BiMed-V, a novel bilingual and multimodal instruction set comprising 1.6M samples. This dataset combines existing public datasets with custom-curated data and Arabic translations verified by medical experts. The model's training is a two-stage process: aligning visual embeddings to the language embedding space and fine-tuning LoRA adapters for multimodal medical instruction alignment. This ensures precise alignment of visual and textual representations, enabling context-aware medical insights in both languages. Researchers also introduce BiMed-MBench, a bilingual GPT-40 based medical LMM benchmark.
Evaluation shows BiMediX2 achieves state-of-the-art performance across benchmarks, outperforming models like GPT-4 in clinical LLM benchmarks and UPHILL factual accuracy evaluations. On BiMed-MBench, BiMediX2 showcases strong bilingual capabilities. It also excels in medical Visual Question Answering, Report Summarization, and Report Generation. BiMediX2 represents a significant step towards inclusive and accessible medical AI. Its bilingual support and multimodal capabilities address the needs of diverse linguistic populations. While limitations like potential hallucinations and biases exist, the release of model weights aims to foster further research in safety and alignment strategies.
Maya: An Instruction Finetuned Multilingual Multimodal Model by Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth.S, Snehanshu Mukherjee, Alham Fikri Aji https://arxiv.org/abs/2412.07112
Caption: This diagram illustrates the workflow for creating Maya's multilingual dataset. It begins with extracting GPT values from the original English LLaVA dataset and translating them into seven additional languages using Aya-23 35B, with batch processing and prompt optimization. The resulting translated data, along with logs, are then compiled into the final pretraining dataset.
Current Vision-Language Models (VLMs) struggle with low-resource languages and diverse cultural contexts due to a lack of high-quality, diverse, and safety-vetted multilingual multimodal data. Existing datasets often contain toxic content, further exacerbating the problem. Researchers introduce Maya, an open-source multilingual multimodal model, to address these challenges.
The researchers build Maya upon the LLaVA framework, developing a new multilingual image-text pretraining dataset in eight languages (English, Chinese, French, Spanish, Russian, Hindi, Japanese, and Arabic), totaling 4.4 million image-text pairs. A crucial aspect is rigorous toxicity filtering using LLaVAGuard and Toxic-BERT, creating a safer training corpus. A hybrid translation method ensures high-quality translations for the new languages.
The Maya model architecture leverages the pretrained multilingual Aya-23 8B model and SigLIP as its vision encoder, chosen for its performance, multilingual capabilities, and support for variable-length patch sizes. The model is pretrained on the multilingual dataset and its toxicity-free version, focusing on image-text alignment. Subsequent instruction finetuning is performed using the PALO 150K dataset.
Evaluation shows Maya achieves competitive performance across ten languages, outperforming other 7B parameter models in several languages. Removing toxic content has a minimal impact on overall English benchmark performance. Qualitative analysis reveals Maya's responses are comparable to LLaVA-7B but lack the nuance of GPT-4. Further analysis on MMVeT shows toxicity filtering improves performance in some areas but declines in others requiring complex reasoning. Maya is a significant step towards building more inclusive and culturally sensitive VLMs.
A multimodal ensemble approach for clear cell renal cell carcinoma treatment outcome prediction by Meixu Chen, Kai Wang, Payal Kapur, James Brugarolas, Raquibul Hannan, Jing Wang https://arxiv.org/abs/2412.07136
This study introduces a multi-modal ensemble model (MMEM) for predicting treatment outcomes in clear cell renal cell carcinoma (ccRCC) patients. The model integrates pre-treatment clinical information, multi-omics data (mRNA, miRNA, DNA methylation), and histopathology whole slide images (WSIs) from the TCGA-KIRC dataset, aiming for a more comprehensive and personalized prognostic approach. The study focuses on predicting overall survival (OS) and disease-free survival (DFS).
MMEM combines traditional statistical methods and deep learning. Cox proportional hazards (CPH) models with iterative forward feature selection are used for clinical and multi-omics data. For WSI data, four pre-trained encoder models extract features, and a deep learning-based CPH model predicts outcomes. Predicted risk scores from each model are combined using weighted averaging based on training performance, with the risk score r<sub>i</sub> for patient i calculated as: r<sub>i</sub> = Σ<sub>m=1</sub><sup>M</sup> r<sub>i,m</sub>w<sub>m</sub>, where m is the modality index, M is the total number of modalities (5), r<sub>i,m</sub> is the risk score from modality m, and w<sub>m</sub> is the weight for modality m.
MMEM outperforms single-modality models, achieving C-indices of 0.820 and 0.833 for OS and DFS, respectively. For binary outcome prediction, MMEM achieves AUROCs of 0.831 and 0.862 for patient death and cancer recurrence at 3 years. The clinical model has the highest weight in the ensemble, followed by the WSI model using the UNI feature encoder. Even with uniform weighting, MMEM still outperforms single-modality models. This research highlights the potential of integrating diverse data modalities for improved prognostic accuracy in ccRCC, emphasizing the use of general-purpose foundation models like UNI. While promising, external validation is needed to confirm generalizability.
This newsletter has highlighted key trends in multimodal image and text foundation models. From grappling with syntactic nuances in VLMs to synthesizing complex multihop data for question answering, researchers are pushing the boundaries of what's possible. The development of unified architectures like LatentLM, capable of handling both continuous and discrete data, streamlines multimodal processing, while specialized models like BiMediX2 demonstrate the potential of these technologies in domain-specific applications like healthcare. The emphasis on multilingualism and cultural sensitivity in models like Maya underscores the importance of inclusive AI development. While challenges remain, such as the modality gap in contrastive learning and the need for robust evaluation benchmarks, the rapid pace of innovation in this field promises exciting advancements in the near future.