This newsletter explores the cutting edge of multimodal image and text foundation models, showcasing innovative approaches to training, architecture, and application. From revolutionizing fetal ultrasound analysis to unlocking the potential of byte-level representations, these papers demonstrate the rapid progress and exciting possibilities within this dynamic field. Prepare to delve into novel techniques for synthetic data generation, enhanced model explainability, and surprisingly effective adaptations of existing models for new tasks.
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub https://arxiv.org/abs/2502.14807
Caption: This figure illustrates the FetalCLIP pipeline, from dataset curation and pretraining to performance evaluation and dataset characterization. It highlights the multimodal training process using routine ultrasound scans, textbook images, and corresponding captions, along with the model's architecture and evaluation across various fetal ultrasound tasks. The figure also showcases the clinically annotated keywords used for dataset analysis and the distribution of gestational age and pixel spacing within the dataset.
Fetal ultrasound, a critical component of prenatal care, often faces challenges related to subjectivity and operator dependence in image interpretation. This variability can lead to diagnostic inconsistencies, particularly in resource-limited settings. While AI offers promising solutions, existing foundation models often struggle with the complexities of fetal ultrasound images. FetalCLIP emerges as a groundbreaking vision-language foundation model specifically designed to address these challenges in fetal ultrasound analysis.
Trained on an unprecedented scale, FetalCLIP leverages a massive dataset of 207,943 routine clinical ultrasound images paired with GPT-40 generated captions and 2,092 expert-annotated image-caption pairs from a fetal ultrasound textbook. This multimodal approach allows FetalCLIP to learn intricate anatomical features and align them with diagnostic descriptions, thereby enhancing interpretability and facilitating knowledge transfer to various downstream tasks.
The model's architecture employs a ViT-L image encoder, a Byte-Pair Encoding tokenizer, and a text encoder capable of processing up to 117 tokens. This design choice allows for the inclusion of rich clinical text descriptions, a key advantage over models trained on simpler captions. The training process utilizes a contrastive learning framework, maximizing similarity between paired image-caption embeddings while minimizing similarity for unrelated pairs. Data augmentation techniques, including random rotation, translation, and color jittering, further enhance the model's robustness. The model was pretrained for 20 epochs using a learning rate of 5e-6, a warmup phase of 2,000 steps, and a cosine scheduler.
FetalCLIP undergoes rigorous evaluation on several key fetal ultrasound applications. In zero-shot classification of standard fetal views, it achieves an impressive F1 score of 87.1%, significantly outperforming existing foundation models and a specialized model trained with supervised learning (SonoNet) by 17.2%. For zero-shot gestational age estimation, FetalCLIP attains a prediction validity rate of 83.5%, demonstrating its ability to extract meaningful information related to fetal growth. Moreover, when used as a feature extractor for downstream tasks like congenital heart disease (CHD) detection and fetal structure segmentation, FetalCLIP consistently outperforms other foundation models, improving the AUC by 6.92% over previous models in CHD detection and achieving an average Dice Similarity Coefficient (DSC) of 84.22% across three different fetal anatomical planes for segmentation.
The study also investigates FetalCLIP's interpretability using Class Activation Mapping (CAM) and Uniform Manifold Approximation and Projection (UMAP). CAM analysis reveals that FetalCLIP effectively highlights relevant anatomical structures when identifying views and estimating gestational age. UMAP visualizations further confirm the model's ability to cluster different fetal planes and differentiate between brain subviews. These findings underscore FetalCLIP's potential to enhance the accuracy and efficiency of prenatal assessment, particularly in resource-constrained environments. The authors' plan to publicly release FetalCLIP holds significant promise for further research and the development of innovative applications in fetal ultrasound analysis.
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling by Eric Egli, Matteo Manica, Jannis Born https://arxiv.org/abs/2502.14553
Caption: The Multiscale Byte Language Model (MBLM) architecture uses a hierarchical stack of decoder models (Global Models 1 & 2, Local Model). Global models process patches of the input byte stream, while the local model handles byte-level details within each patch, enabling efficient processing of extremely long sequences, up to 5 million bytes. The input byte stream is represented at the bottom, flowing upwards through the model stages.
Tokenization, a fundamental aspect of language modeling, introduces inherent biases and limits adaptability. Byte Language Models (BLMs) offer a compelling alternative by using bytes as a universal encoding, enabling seamless multimodal learning. However, the substantial length of bytestreams presents a significant computational hurdle. This paper introduces the Multiscale Byte Language Model (MBLM), a novel hierarchical decoder stack designed to address this challenge.
MBLM is model-agnostic, allowing for the incorporation of various decoder models at different stages, and enables training with context windows of 5 million bytes on a single GPU with full model precision. The MBLM architecture consists of N causal decoder models stacked hierarchically. The first N-1 stages function as global models, processing patch representations of the input and capturing inter-patch dependencies. The final stage serves as a local model, performing byte-level intra-patch modeling. A key innovation of MBLM lies in its flexibility to integrate different decoder types. The paper explores hybrid architectures combining Transformer and Mamba models, demonstrating their effectiveness in handling extremely long byte sequences. MBLM also introduces granular control over stage parallelism through selective checkpointing of intermediate activations, offering a trade-off between parallelism and compute time.
Experiments on the Project Gutenberg dataset (PG19) showcase the scalability of MBLMs. A three-stage MBLM with a global Mamba followed by two Transformer decoders achieved 2.448 bits-per-byte (BPB) on the PG19 test set after processing 100GB of UTF-8 bytes. Hybrid MBLMs outperformed Transformer-only models on long sequences exceeding 1 million bytes. Interestingly, the study reveals that purely Mamba-based MBLMs, while offering the best performance, were computationally more demanding during training, particularly when used as the local model. Further investigation into the impact of context length on perplexity revealed diminishing returns beyond 4K bytes on PG19, suggesting a potential limit to the usefulness of extremely long contexts for this specific dataset.
In a novel application of BLMs to multimodal tasks, the paper evaluates MBLM on visual question answering (VQA) using the CLEVR dataset. A 3D MBLM with a 500K byte context window achieved 44% accuracy on CLEVR's validation set, demonstrating the model's capacity to learn from mixed-modality bytestreams. Remarkably, MBLMs outperformed LSTM baselines and achieved comparable performance to a CNN+LSTM model, even without a dedicated image encoder. Moreover, using discretized images and JPEG representations improved accuracy on certain question types, highlighting the potential of byte-level representations for capturing relevant visual features. The study also observed that pre-training on text data positively impacted VQA performance, contrary to some previous findings. These results underscore the potential of MBLMs as a foundation for omnimodal foundation models, capable of learning from and generating diverse data representations.
Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data by Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu https://arxiv.org/abs/2502.14044
Caption: This diagram illustrates a framework for enhancing Large Multimodal Models (LMMs) using self-synthesized data. It shows a two-step process involving concept selection based on mutual information and a rejection sampling technique to refine generated explanations, ultimately improving the LMM's performance on specialized visual classification tasks. The Venn diagrams visualize the relationships between image content (X), label-level concepts (Z), selected image-level concepts (Z), generated descriptions (D), and candidate explanations (Y).*
Large multimodal models (LMMs) demonstrate impressive performance across various visual tasks, yet they often encounter difficulties with fine-grained visual reasoning and providing justifiable explanations, particularly within specialized domains. This research introduces a novel framework designed to enhance both the cognition and explainability of LMMs for specialized visual classification tasks by leveraging self-synthesized data, thereby eliminating the need for extensive manual annotation.
At the heart of this framework lies a two-step process. First, it utilizes an information bottleneck approach to select the most relevant visual concepts for each image. Given an image and its label, the method leverages the LMM's captioning capabilities to generate descriptions, approximating the true distribution of the image's features. It then selects a subset of expert-defined concepts, Z ⊂ Z, that maximizes mutual information with the image content X, I(X; Z ), while minimizing redundancy by penalizing the mutual information between the selected concepts and the full concept set, I(Z ; Z). This selection process is formalized as: Z = arg max<sub>Z'⊂Z</sub> [I(D; Z') – βI(Z'; Z)], where D represents the set of image descriptions and β balances relevance and redundancy. Second, a reward model-free rejection sampling technique filters the synthesized answers, selecting the one that best aligns with the chosen concepts for subsequent fine-tuning rounds. This iterative process of data synthesis and fine-tuning progressively refines the LMM's ability to generate accurate and explainable predictions.
The researchers evaluate their framework on diverse datasets encompassing fine-grained classification, medical images, and plant disease images. The results demonstrate significant improvements in classification accuracy compared to baselines trained solely with labels or with general label-level explanations. For instance, on the Stanford Dogs dataset, the proposed method achieves 86.91% accuracy, surpassing the baseline trained with only labels (84.27%) and the baseline trained with general explanations (76.55%). Furthermore, the framework consistently generates higher-quality explanations, as measured by coherence, logical flow, and fluency. The method also exhibits superior performance in concept selection compared to using GPT-40, LLaVA, or CLIP for concept extraction, achieving a peak precision of 72.89% with 25 descriptions.
This research offers a promising solution to the challenges of fine-grained visual reasoning and explainability in LMMs. By leveraging self-synthesized data and an iterative refinement process, the framework significantly enhances the model's cognitive abilities and provides more justifiable explanations. This advancement paves the way for more reliable and trustworthy multimodal models in specialized domains. The elimination of the need for extensive manual labeling makes this approach particularly valuable for knowledge-intensive applications where detailed image annotations are impractical to obtain.
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation by Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark https://arxiv.org/abs/2502.14846
Caption: CoSyn-400K is a synthetic dataset of 400,000 text-rich images across nine diverse categories (documents, charts, tables, diagrams, etc.) and 2.7M instruction-tuning data rows, used to train vision-language models (VLMs). This collage showcases the variety of synthetic data generated by CoSyn, including math problems, music sheets, chemical structures, and vector graphics, enabling VLMs to better understand and interact with complex visuals.
Vision-language models (VLMs) excel in general image understanding but frequently struggle with text-rich images such as charts, documents, and diagrams. These complex visuals demand both textual comprehension and spatial reasoning, abilities hampered by the limited availability of diverse training data. Researchers introduce CoSyn (Code-Guided Synthetic data generation system), a novel framework that leverages the coding capabilities of text-only large language models (LLMs) to generate synthetic text-rich multimodal data. This approach circumvents the need for manual annotation and enables the creation of large-scale, diverse datasets.
CoSyn operates by first generating code (Python, HTML, LaTeX, etc.) based on a user-provided text query describing the desired image type (e.g., "book covers"). This generated code is then executed to render synthetic images. Critically, the underlying code acts as a grounded textual representation of the image, allowing for the generation of high-quality instruction-tuning data, again leveraging a text-only LLM. This process results in a rich multimodal dataset comprising image-instruction pairs, well-suited for training VLMs. Using CoSyn, the researchers constructed CoSyn-400K, a dataset containing 400,000 images and 2.7 million rows of vision-language instruction-tuning data across nine diverse categories.
Models trained on CoSyn-400K achieved state-of-the-art performance on seven text-rich VQA benchmarks, surpassing not only competitive open-source models but also proprietary models like GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn facilitates sample-efficient learning, achieving strong results with less data. The framework also demonstrates effectiveness in zero-shot generalization to novel tasks, such as understanding nutrition labels (NutritionQA dataset), where open-source VLMs typically struggle. By generating a small, targeted synthetic dataset for fine-tuning, the model adapted remarkably well to this new domain.
Analysis reveals that CoSyn's synthetic data helps mitigate biases present in existing datasets, leading to improved generalization. For example, in ChartQA, models trained on CoSyn-400K exhibited a smaller performance gap between human-asked and machine-generated questions compared to models trained solely on ChartQA. Finally, CoSyn's capabilities extend beyond standard VQA to generating synthetic pointing data, enabling VLMs to locate specific elements within images. A model trained on this data achieved state-of-the-art performance on the ScreenSpot click prediction benchmark. Overall, CoSyn presents a powerful and efficient solution for scaling text-rich image understanding and propelling the development of multimodal digital assistants.
Pretrained Image-Text Models are Secretly Video Captioners by Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, Soroush Vosoughi https://arxiv.org/abs/2502.13363
Caption: This diagram illustrates the architecture of a video captioning system adapted from the image-text model BLIP-2. The system processes video frames with a ViT, aggregates the frame embeddings (e.g., by concatenation), and feeds them into a Q-Former along with learned queries. Finally, an LLM generates the caption, which can be further refined through post-training with reinforcement learning using a CIDEr reward model.
Developing dedicated video captioning models is computationally expensive and often requires complex architectural designs to handle temporal dynamics. However, this research reveals a surprising finding: with minimal modifications and computational resources, a pre-trained image-text model can be effectively repurposed for video captioning, outperforming several specialized video captioning systems. The researchers adapted BLIP-2, a state-of-the-art image captioning model, by simply concatenating frame embeddings and post-training it on a relatively small dataset of 6,000 video-text pairs. This approach significantly reduces the data requirements compared to other methods that utilize millions of video-text pairs.
The study focuses on three key factors for optimizing video captioning: model scale, data efficiency, and training supervision. They discovered that mid-sized language models (LLMs), such as Flan-T5-XL-3B, offer the best balance of trainability and performance, challenging the assumption that larger models are always superior. Furthermore, the extensive image-text pre-training of BLIP-2 proved highly transferable to video, enabling strong performance with limited video data. Finally, using reinforcement learning with the CIDEr metric significantly improved caption quality compared to traditional cross-entropy loss.
The adapted BLIP-2 model achieved remarkable results on established benchmarks, ranking 2nd on MSR-VTT and MSVD, and 3rd on VATEX. This performance highlights the surprising effectiveness of this simple adaptation strategy, particularly in low-resource scenarios. The researchers also explored the impact of video resolution and temporal fusion methods, finding that lower resolution (224x224) is sufficient for competitive performance and that frame concatenation outperforms averaging for capturing temporal dynamics. These findings offer valuable insights for optimizing resource allocation in video captioning.
This newsletter highlights the rapid advancements in multimodal image and text foundation models. From specialized models like FetalCLIP, tailored for complex medical image analysis, to the innovative architectural designs of MBLM that push the boundaries of sequence length, the field is witnessing remarkable progress. The clever repurposing of existing image-text models for video captioning, as demonstrated with BLIP-2, underscores the potential for resource-efficient solutions. Furthermore, the focus on explainability and the use of self-synthesized data, as explored in the research on enhancing LMMs, addresses critical challenges in deploying these powerful models in real-world applications. These developments collectively paint a picture of a rapidly evolving landscape where multimodal models are becoming increasingly sophisticated, adaptable, and accessible.