Hi Elman,
In this newsletter, we'll delve into the latest advancements in multimodal image and text foundation models, exploring novel approaches to continual learning, image segmentation, and fake news detection. These papers showcase innovative techniques that push the boundaries of what's possible with these powerful models, addressing challenges like catastrophic forgetting, efficient integration of dense prediction tasks, and cross-modal understanding in low-resource settings. Let's dive in.
LLaCA: Multimodal Large Language Continual Assistant by Jingyang Qiao, Zhizhong Zhang, Xin Tan, Yanyun Qu, Shouhong Ding, Yuan Xie https://arxiv.org/abs/2410.10868
Multimodal Large Language Models (MLLMs) excel at various tasks, but incorporating new knowledge through continual instruction tuning (MCIT) presents a significant hurdle. Catastrophic forgetting, where the model loses performance on previously learned tasks when fine-tuned on new instructions, is a major obstacle. While methods like model expansion exist, they suffer from memory explosion and high computational costs.
This paper introduces LLaCA (Multimodal Large Language Continual Assistant), a novel method that dynamically adjusts the Exponential Moving Average (EMA) weight during training to balance retaining old knowledge and integrating new information. The core issue with traditional gradient updates is the prioritization of new information at the expense of previously learned knowledge. EMA offers a potential solution by averaging parameters from previous iterations, but its fixed weight struggles to adapt to diverse datasets.
LLaCA addresses this by deriving the optimal EMA weight based on gradient information and previous parameters. Starting from an ideal condition for MCIT – perfect assimilation of new knowledge while retaining previous performance – and using Taylor expansion of the loss function, the authors arrive at the following optimal weight:
β<sub>t</sub> = (L'(θ<sub>t</sub>) + 1) / ((θ<sub>t</sub> - θ*<sub>t-1</sub>)L''(θ<sub>t</sub>))*
where:
This dynamic weight effectively balances plasticity (learning new information) and stability (retaining old knowledge). To simplify computation, the second-order derivative is approximated using a first-order difference, and the L1-Norm is used for a union parameter-wise weight.
Experiments on LLaVA-1.5 using a continual visual-question-answering benchmark demonstrate LLaCA's effectiveness. Compared to the baseline LORA method, LLaCA drastically reduces forgetting (from 22.67 to 2.68) and significantly improves average accuracy (from 41.31 to 61.89). Moreover, LLaCA shows strong zero-shot generalization performance and robustness across different instruction types and training orders. Its low computational cost, achieved by training only a single set of LoRA parameters and a projection layer, makes it a practical and promising solution for continual learning in MLLMs.
Text4Seg: Reimagining Image Segmentation as Text Generation by Mengcheng Lan, Chaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang https://arxiv.org/abs/2410.09855
Integrating dense prediction tasks like image segmentation into MLLMs has traditionally been complex, often requiring the addition of dedicated visual decoders. Text4Seg proposes a novel, streamlined approach by reimagining image segmentation as a text generation problem. This is achieved through the text-as-mask paradigm, where image patches are mapped to corresponding text labels, forming semantic descriptors.
The core of Text4Seg lies in these semantic descriptors, a sequence representation of segmentation masks. Instead of traditional index masks or numerical coordinates, each image patch is assigned a text label, which can be a word, phrase, or even a descriptive sentence. This textual representation seamlessly integrates into the auto-regressive training pipeline of MLLMs, simplifying optimization and leveraging the inherent text generation capabilities of these models. The paper demonstrates that representing an image with a 16x16 grid (256 total) of semantic descriptors achieves competitive segmentation performance.
To address redundancy in these representations, Text4Seg introduces Row-wise Run-Length Encoding (R-RLE). This technique compresses repeated descriptors within each image row, reducing the length of semantic descriptors by a remarkable 74% and accelerating inference by 3x without sacrificing performance. An off-the-shelf mask refiner (SAM) is then used as post-processing to achieve pixel-level segmentation masks.
Text4Seg was evaluated on various vision-centric tasks, including referring expression segmentation, open-vocabulary segmentation, and visual grounding, using several MLLM backbones ranging from 1.3B to 13B parameters. On refCOCO datasets for referring expression segmentation (RES), Text4Seg achieved state-of-the-art performance, with LLaVA-1.5-13B reaching an average cIoU of 76.2. Similar impressive results were observed on gRefCOCO for generalized referring expression segmentation (GRES), with LLaVA-1.5-13B achieving an average cIoU of 71.5. Furthermore, Text4Seg showed competitive performance on open-vocabulary segmentation benchmarks, highlighting its versatility and efficiency. This decoder-free approach simplifies training and allows for easier adaptation to different MLLM architectures, opening exciting possibilities for future research.
MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages by Shubhi Bansal, Nishit Sushil Singh, Shahid Shafi Dar, Nagendra Kumar https://arxiv.org/abs/2410.10407
The spread of multimodal fake news poses a significant threat, especially in regions with diverse languages like India. Existing research focuses on either multimodal approaches for high-resource languages or text-based methods for low-resource languages, neglecting the specific challenges of multimodal fake news detection in Indic languages. This paper introduces MMCFND (Multimodal Multilingual Caption-aware Framework for Fake News Detection) and a new dataset, MMIFND (Multimodal Multilingual dataset for Indic Fake News Detection), to address this gap.
MMIFND is a crucial resource comprising 28,085 real and fake news samples across seven low-resource Indic languages: Hindi, Bengali, Marathi, Malayalam, Tamil, Gujarati, and Punjabi. The MMCFND framework leverages pre-trained unimodal and pairwise encoders from a foundational model (FLAVA) that aligns vision and language, extracting deep representations from both visual and textual components of news articles. A multimodal fusion encoder integrates these representations for comprehensive crossmodal understanding.
Critically, MMCFND uses a vision-language model (BLIP-2) to generate descriptive image captions, providing valuable context for identifying inconsistencies and manipulations often present in fake news. These features, along with original text and image representations, are fused and fed into a classifier to determine news authenticity.
MMCFND was evaluated on MMIFND and a Tamil text-only dataset. On MMIFND, it significantly outperformed several state-of-the-art methods (SpotFake, Semi-FND, Mul-FAD, HFND-TE, and FND-CLIP) in both accuracy and F1-score. Similar superior performance was observed on the Tamil dataset. Ablation studies confirmed the importance of both textual and visual features, with the inclusion of FLAVA-guided multimodal features and BLIP-2 captions contributing significantly to performance gains.
This newsletter highlighted key advancements in multimodal image and text foundation models. We explored LLaCA's innovative approach to continual learning, mitigating catastrophic forgetting by dynamically adjusting EMA weights. Text4Seg redefined image segmentation as a text generation task, simplifying integration with MLLMs through semantic descriptors and R-RLE compression. Finally, MMCFND addressed the critical challenge of multimodal fake news detection in low-resource Indic languages, leveraging a novel caption-aware framework and introducing the valuable MMIFND dataset. These advancements demonstrate the continued evolution and expanding capabilities of multimodal models, paving the way for more robust, adaptable, and efficient solutions across diverse applications.