This newsletter dives into the latest advancements in multimodal foundation models, focusing on two exciting developments: the rise of smaller, more efficient vision-language models and the innovative use of foundation models for complex tasks like survival prediction in oncology. We'll explore how these innovations are pushing the boundaries of multimodal AI, making it more accessible and applicable to a wider range of real-world challenges.
Small Vision-Language Models: A Survey on Compact Architectures and Techniques by Nitesh Patnaik, Navdeep Nayak, Himani Bansal Agrawal, Moinak Chinmoy Khamaru, Gourav Bal, Saishree Smaranika Panda, Rishi Raj, Vishal Meena, Kartheek Vadlamani https://arxiv.org/abs/2503.10665
Caption: This diagram provides a taxonomy of Small Vision-Language Models (sVLMs), categorizing them into transformer-based, mamba-based, and hybrid models, and outlines key research areas for advancing sVLM development, including bias mitigation, multimodal integration, and efficient architectures for edge devices. It also visually represents the evolution of sVLMs and explores various aspects like performance metrics, benchmarking, and data efficiency.
Small vision-language models (sVLMs) are revolutionizing multimodal AI by offering a compelling balance between performance and computational efficiency. Unlike their resource-intensive predecessors, sVLMs are designed for resource-constrained environments, bringing advanced AI capabilities to devices like mobile phones and embedded systems. This survey provides a comprehensive overview of sVLM development, categorizing architectures into three primary paradigms: transformer-based, mamba-based, and hybrid models.
The survey traces the evolution of sVLMs, starting from foundational models like CLIP, which showcased the power of contrastive learning for aligning visual and textual representations. It then delves into subsequent advancements, including ViLT's transformer-only approach, VirTex's data-efficient pretraining using semantically rich captions, and SimVLM's simplified vision-language pretraining with a single PrefixLM objective. Key models like BLIP, Flamingo, and MiniGPT-4 are highlighted, showcasing their contributions to unified vision-language understanding and generation, few-shot learning, and the integration of visual encoders with large language models.
A core contribution of this survey is its taxonomy of sVLM architectures. Transformer-based models, such as TinyGPT-V and TinyViT, leverage self-attention mechanisms for efficient multimodal processing. Mamba-based models, like VL-Mamba and Simba, utilize state-space models for linear scalability and efficient handling of long sequences. Hybrid models, exemplified by SAMBA and Zamba, combine elements from transformers, CNNs, and other lightweight mechanisms to optimize for both performance and computational demands. The survey also discusses relevant evaluation metrics and benchmarks used to assess sVLM performance, including accuracy, mIoU, PSNR, AUC, and FLOPS.
Finally, the survey identifies key challenges and future research directions in the sVLM domain. Addressing data biases, improving generalization to complex tasks, and developing advanced architectures specifically for edge devices are crucial areas for exploration. Expanding multimodal integration beyond text and vision to incorporate other modalities like audio and haptics is another promising avenue. Furthermore, research into bias mitigation, explainability, scalable training strategies, and emerging evaluation paradigms is essential for ensuring the responsible and effective deployment of sVLMs. This work underscores the transformative potential of sVLMs for accessible AI, paving the way for broader adoption and innovative applications.
Multi-Modal Mamba Modeling for Survival Prediction (M4Survive): Adapting Joint Foundation Model Representations by Ho Hin Lee, Alberto Santamaria-Pang, Jameson Merkov, Matthew Lungren, Ivan Tarapov https://arxiv.org/abs/2503.10057
Accurate cancer survival prediction relies on integrating diverse imaging data, but traditional methods often struggle to effectively combine information from different modalities like radiology and pathology. M4Survive (Multi-Modal Mamba Modeling for Survival Prediction) offers a novel framework that leverages the power of foundation models to enhance the accuracy and interpretability of survival predictions. This approach addresses the limitations of existing methods by dynamically fusing information from multiple modalities, accommodating missing data, and capturing complex cross-scale interactions.
M4Survive employs a two-step process. First, pre-trained foundation models like MedImageInsight and Prov-GigaPath are used to extract embeddings from radiology and pathology images, respectively. These embeddings are then projected into a unified semantic space using small encoder networks. Second, the resulting joint embeddings are fed into a lightweight Mamba adapter network. This adapter, based on a selective state-space model, facilitates fine-grained cross-modal interactions, enabling effective combination of information from both imaging domains. The Mamba adapter's linear complexity offers a significant computational advantage over transformer-based approaches, which have quadratic complexity.
The model is trained using the Cox ranking loss function:
L<sub>cox</sub> = ∑<sub>(i:E<sub>i</sub>=1)</sub>(F<sub>θ</sub>(m<sub>i</sub>) - log ∑<sub>(j:S<sub>i</sub>≤S<sub>j</sub>)</sub> exp F<sub>θ</sub>(m<sub>j</sub>))
where m<sub>i</sub> represents the joint multi-modal patient representation, E<sub>i</sub> denotes the censor status, and F<sub>θ</sub> is the Mamba adapter network that predicts patient-specific hazard values. The Cox proportional hazards model is then used to derive time-dependent survival probabilities.
Experimental evaluations on a glioma tumor dataset demonstrate M4Survive's effectiveness. Achieving a concordance index (c-index) of 81.27±0.56, it outperformed state-of-the-art methods by 5.37%. Ablation studies confirmed the importance of both foundation model selection and adapter architecture. Using specialized foundation models like MedImageInsight for radiology and UNI2-h for pathology yielded the best results, highlighting the value of domain-specific pre-training. The Mamba adapter consistently outperformed other fusion architectures, demonstrating its ability to capture complex cross-modal interactions. These results underscore the potential of foundation model-driven multi-modal fusion in advancing precision oncology.
This newsletter has highlighted two key trends in multimodal AI. The development of smaller, more efficient vision-language models like those surveyed in the first paper democratizes access to powerful AI capabilities, extending their reach to resource-constrained environments. Simultaneously, the innovative application of foundation models in specialized domains, as demonstrated by M4Survive, showcases the transformative potential of multimodal learning for complex tasks like survival prediction. These advancements pave the way for a future where multimodal AI is not only more accessible but also more deeply integrated into critical applications across various fields. The focus on efficiency, adaptability, and specialized foundation models signals a shift towards more practical and impactful deployments of multimodal AI.