The quest for efficient and effective long-context language models continues to drive exciting innovation in deep learning. This newsletter explores three recent papers that introduce novel architectures and training strategies to tackle the challenges of processing and generating text from extremely long sequences. From hybrid approaches combining state-space models with selective attention to innovative context compression techniques, these works offer promising pathways towards unlocking the full potential of LLMs in handling extensive textual data.
LOGO -- Long cOntext aliGnment via efficient preference Optimization by Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang https://arxiv.org/abs/2410.18533
Long-context models (LCMs) hold great promise, but their generation quality often falls short, leading to problems like hallucinations and deviation from instructions. This paper introduces LOGO (Long cOntext aliGnment via efficient preference Optimization), a novel training strategy designed to improve the alignment of LCMs with long contexts. The core issue addressed is the inefficiency of traditional cross-entropy loss in optimizing generation for long sequences, where the signal from the vast context can overwhelm the prediction signal.
LOGO leverages preference optimization, guiding the model to differentiate between preferred (aligned) and dis-preferred (misaligned) outputs. This approach moves beyond simply predicting the next token and focuses on generating outputs that align better with the desired behavior in a long-context setting. The training objective is based on SimPO, a variant of Direct Preference Optimization (DPO), and is formally defined as:
L<sub>LOGO</sub>(π<sub>θ</sub>) = -E<sub>(x,y<sub>w</sub>,{y<sub>l</sub><sup>(1...M)</sup>})</sub> log σ(β log π<sub>θ</sub>(y<sub>w</sub>|x) / β Σ<sub>j=1</sub><sup>M</sup> log π<sub>θ</sub>(y<sub>l</sub><sup>(j)</sup>|x) - γ)
Here, π<sub>θ</sub> represents the policy model (LCM), x is the input context, y<sub>w</sub> is the preferred output, {y<sub>l</sub><sup>(1...M)</sup>} represents a set of M dis-preferred outputs, β scales the reward difference between preferred and dis-preferred outputs, and γ is the target reward margin. An additional SFT (Supervised Fine-Tuning) regularization term is included to prevent performance degradation on other tasks, ensuring that the model retains its capabilities in areas like language modeling and multiple-choice question answering.
To address the memory constraints inherent in long-context training, LOGO employs a reference-free preference optimization strategy, avoiding the need to store large reference outputs. Furthermore, it utilizes a position synthesis method to construct training data. This method synthesizes positional indices to simulate long sequences while using shorter segments of context data, significantly reducing memory requirements. This allows for efficient training on a single 8×A800 GPU.
The LOGO training dataset is constructed using a three-step pipeline. First, an automatic evaluator assesses the importance of different context chunks based on entity overlap with the given question. This helps identify the most relevant parts of the context for a particular query. Second, preference data is synthesized using only these relevant context chunks. Dis-preference data is created by including irrelevant or partially relevant chunks, mimicking common misalignment patterns observed in LCMs. Finally, positional indices are synthesized to create the illusion of long sequences, enabling efficient training with limited GPU memory.
Experiments show that LOGO substantially improves LCM performance on real-world long-context tasks, achieving a 5-point average improvement on the LongBench benchmark for the Llama-3-8B-Instruct-80K model. Notably, the performance approaches that of GPT-4. Moreover, LOGO generalizes to short-context LLMs, enabling context window expansion up to 8 times while simultaneously improving performance. Critically, it maintains or even enhances performance on language modeling and short-context tasks like MMLU.
Taipan: Efficient and Expressive State Space Language Models with Selective Attention by Chien Van Nguyen, Huy Huu Nguyen, Thang M. Pham, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Ryan A. Rossi, Trung Bui, Viet Dac Lai, Franck Dernoncourt, Thien Huu Nguyen https://arxiv.org/abs/2410.18572
Caption: This diagram illustrates the architecture of Taipan, a novel long-context language model. It combines Mamba-2 with Selective Attention Layers (SALs), which use a gating network and Gumbel-Softmax to identify and refine important tokens for softmax attention, while less critical tokens bypass this step for efficiency. This hybrid approach allows Taipan to balance computational cost with the ability to capture crucial long-range dependencies.
Efficient long-context language modeling remains a significant challenge. While Transformers excel in many language tasks, their quadratic computational complexity during training and linear memory scaling during inference hinder their effectiveness with long sequences. State Space Models (SSMs), such as Mamba, offer constant memory usage but struggle with tasks demanding substantial in-context retrieval. This paper introduces Taipan, a novel hybrid architecture combining the efficiency of Mamba-2 with the expressiveness of Transformers.
Taipan integrates Selective Attention Layers (SALs) within the Mamba-2 framework. These SALs pinpoint tokens requiring long-range interactions using a lightweight gating network. Selected tokens undergo feature refinement, removing less important information, and are then augmented with information from a softmax attention module. Less crucial tokens bypass the attention step, preserving Mamba-2's efficiency. This targeted approach, represented by:
ĥ<sub>i</sub> = (1 - α<sub>i</sub>)h<sub>i</sub> + α<sub>i</sub>o<sub>i</sub>
where h<sub>i</sub> is the original token representation, o<sub>i</sub> is the attention output, and α<sub>i</sub> is a mixing weight from the gating network, balances computational efficiency with the ability to capture critical long-range dependencies. Taipan also uses Sliding Window Attention within the SALs to maintain linear time complexity, enabling theoretically unlimited context lengths.
Taipan outperforms Transformer and Mamba baselines in zero-shot language modeling across various benchmarks. On recall-intensive tasks like SWDE, FDA, and SQuAD, Taipan shows marked improvement over Mamba-2, demonstrating its strength in in-context retrieval. Further, Taipan exhibits impressive extrapolation capabilities, maintaining strong performance on sequences up to 1 million tokens with efficient generation.
A key element of Taipan's design is the attention budget constraint, represented by C in the constraint loss:
L<sub>constraint</sub> = (∑<sup>L</sup><sub>i=1</sub> m<sub>i</sub> / L - C)<sup>2</sup>
where m<sub>i</sub> represents selected tokens and L is the sequence length. An optimal value of C = 0.15 balances performance and efficiency. Removing positional embeddings from the attention module further enhances Taipan's extrapolation capabilities.
Two are better than one: Context window extension with multi-grained self-injection by Wei Han, Pan Zhou, Soujanya Poria, Shuicheng Yan https://arxiv.org/abs/2410.19318
Caption: SharedLLM uses a lower model to compress past context into a multi-grained context tree, where each node stores token sequences of varying lengths. The upper model then accesses this compressed information through shared key-value states and cross-attention, enabling long-context modeling without continual pre-training. The context tree is dynamically constructed based on query relevance, optimizing efficiency by expanding only necessary nodes.
Limited context windows in LLMs restrict their application to long documents. While continual pre-training on long-context data is effective, it's computationally expensive. SharedLLM offers a novel approach extending context windows by using multi-grained context compression and query-aware information retrieval, achieving a balance between efficiency and performance without extensive pre-training.
SharedLLM utilizes two short-context LLMs: an upper model (decoder) and a lower model (compressor). The lower model compresses past context into multi-grained representations organized in a context tree. Each tree handles a text chunk, with nodes storing token sequences of varying lengths. Higher-level nodes represent longer sequences with higher compression ratios, encoding coarser information. A query-dependent dynamic tree construction algorithm expands only relevant nodes based on the input query, optimizing efficiency. The upper model receives compressed information from the lower model through shared key-value states and cross-attention, enabling context-aware modeling. This exchange occurs only at the lowest layers, minimizing computational overhead.
The context tree is built dynamically using a depth-first search algorithm. Given a query, the algorithm selects nodes with higher similarity in the hidden space for expansion, ensuring fine-grained information for relevant parts while maintaining coarse-grained representations for less relevant sections. The selection policy π is defined as:
π((X<sub>left</sub>, X<sub>right</sub>), y) → left or right
where X<sub>left</sub> and X<sub>right</sub> are the subsequences of a node, and y is the query. The algorithm picks the node with maximum similarity to y.
SharedLLM demonstrates strong extrapolation capabilities, even when trained on shorter sequences. It outperforms baselines on mixed datasets and achieves comparable results to CEPE on RedPajama. On long-context understanding benchmarks like InfiniBench and LongBench, SharedLLM achieves superior or comparable results to state-of-the-art baselines. It also demonstrates significant speed-ups and reduced memory consumption.
This newsletter highlighted three distinct but interconnected approaches to enhancing long-context language modeling. LOGO focuses on improving alignment through preference optimization, addressing the challenges of generating coherent and instruction-following outputs in long-context scenarios. Taipan introduces a hybrid architecture that combines the efficiency of state-space models with the expressiveness of attention mechanisms, allowing for scalable processing of extremely long sequences. Finally, SharedLLM presents a novel context compression and retrieval mechanism, effectively extending context windows without the need for computationally expensive continual pre-training. These diverse strategies represent significant advancements in the field and pave the way for more powerful and efficient long-context language models. They offer valuable insights into addressing the challenges of memory, computation, and alignment that are crucial for unlocking the full potential of LLMs in handling extensive textual data.