This newsletter explores the cutting edge of deep learning architectures designed to tackle the challenges of long-context language modeling. As sequence lengths grow, traditional architectures like Transformers face limitations in computational efficiency and memory capacity. This has spurred the development of innovative approaches to address the quadratic complexity of attention mechanisms and enable efficient processing of sequences containing hundreds of thousands, or even millions, of tokens. This newsletter highlights three recent papers that introduce novel architectures and training strategies for achieving scalable and performant long-context language modeling.
Recycled Attention: Efficient inference for long-context language models by Fangyuan Xu, Tanya Goyal, Eunsol Choi https://arxiv.org/abs/2411.05787
This paper introduces Recycled Attention, a clever inference-time method designed to alleviate the computational burden of long-context language models. The key innovation lies in strategically reusing previously computed attention patterns to reduce the cost of attention calculations without sacrificing performance.
Traditional long-context LLM inference suffers from the quadratic complexity of attention mechanisms. As the input sequence length grows, the computational cost of attending to all previous tokens becomes prohibitive. Existing methods often resort to evicting tokens from the key-value (KV) cache, which can negatively impact performance on tasks requiring access to non-local context. Recycled Attention offers a compelling alternative by maintaining the full KV cache while selectively attending to a smaller, dynamically constructed subset of tokens.
The method alternates between two modes: full attention over the entire KV cache and recycled attention over a smaller cache. During a recycled attention step, it reuses the top K most attended tokens from the previous full attention step. This targeted reuse of attention patterns significantly reduces the computational overhead associated with data movement and attention computation.
The paper explores two scheduling strategies for switching between full and recycled attention: a fixed stride approach, where full attention is performed every S steps, and a dynamic strategy based on the dissimilarity of consecutive query embeddings. The dynamic approach aims to trigger full attention only when the current token's focus significantly deviates from the previous one, further optimizing efficiency.
Experiments on Llama-3.1-8B and Qwen-2-7B with context lengths up to 128K tokens demonstrate the effectiveness of Recycled Attention. Remarkably, it achieves comparable speedups to eviction-based baselines while significantly improving accuracy on the RULER benchmark and language modeling tasks. For instance, on Llama-3.1-8B with 32K context, Recycled Attention achieves 63% accuracy on RULER compared to ~22% for baselines like StreamingLLM and H2O, while maintaining similar speedups. Further performance gains are observed with dynamic scheduling and continued pre-training with Recycled Attention.
The authors attribute the superior performance of Recycled Attention to its ability to effectively recover a larger fraction of the attention mass of full attention compared to competing methods. Furthermore, it dynamically adapts to the context requirements of the task, attending to a flexible mix of local and non-local tokens, a key advantage over strictly local attention methods. While Recycled Attention doesn't reduce memory requirements, it offers a promising path towards efficient long-context LLM inference without compromising performance on tasks requiring access to the full context.
Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences by Niklas Schmidinger, Lisa Schneckenreiter, Philipp Seidl, Johannes Schimunek, Pieter-Jan Hoedt, Johannes Brandstetter, Andreas Mayr, Sohvi Luukkonen, Sepp Hochreiter, Günter Klambauer https://arxiv.org/abs/2411.04165
This paper introduces Bio-xLSTM, a family of recurrent neural network architectures tailored for modeling long biological and chemical sequences. While Transformers have dominated these fields, their quadratic complexity limits their applicability to long sequences and in-context learning scenarios. Bio-xLSTM leverages the xLSTM architecture, known for its linear runtime dependency and constant-memory decoding, making it well-suited for the challenges of genomic, proteomic, and cheminformatic data.
The authors present specialized Bio-xLSTM variants: DNA-xLSTM for genomic sequences, Prot-xLSTM for proteins, and Chem-xLSTM for small molecules. These models are trained and evaluated on large-scale datasets using various approaches, including causal and masked language modeling, fill-in-the-middle, and in-context learning.
DNA-xLSTM, trained on the human genome, outperforms existing DNA models, including Transformers, DNA-Mamba, and HyenaDNA, in downstream classification tasks. Prot-xLSTM excels in homology-aware protein generation, particularly with longer contexts, surpassing other architectures. Chem-xLSTM achieves state-of-the-art results in unconditional molecule generation and demonstrates promising in-context learning capabilities, generating molecules from unseen chemical domains based on few-shot examples.
The success of Bio-xLSTM highlights the potential of recurrent architectures in long-sequence modeling, offering a compelling alternative to Transformers. The linear runtime complexity and constant-memory decoding of xLSTM enable efficient processing of long sequences, addressing a key limitation of Transformer-based models. The results across genomics, proteomics, and cheminformatics demonstrate the versatility and effectiveness of Bio-xLSTM, paving the way for new research and applications in these critical domains.
Context Parallelism for Scalable Million-Token Inference by Amy (Jie)Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang https://arxiv.org/abs/2411.01783
This paper tackles the challenge of scaling LLM inference to extremely long contexts (up to one million tokens) by introducing context parallelism (CP), a novel parallelization strategy. The authors focus on optimizing prefill latency in multi-turn conversations, achieving near-linear scaling with up to 128 H100 GPUs.
The core of their approach lies in two optimized variants of ring attention: pass-KV and pass-Q. These variants are designed to minimize communication overhead in different inference scenarios. Pass-KV excels when KV cache hit rates are low, while pass-Q is optimized for decode operations and high KV cache hit rates. The authors provide clear criteria for selecting the optimal variant based on context length and cache characteristics.
Load-balanced sharding algorithms for both input tokens and KV cache entries ensure even distribution of compute and memory across CP ranks, maximizing efficiency. Experimental results on the Grand Teton platform demonstrate impressive scalability, achieving a 1M context prefill with Llama3 405B in 77 seconds with 93% parallelization efficiency.
The authors demonstrate the superiority of context parallelism over multi-node tensor parallelism, particularly for long contexts where inter-node communication becomes a bottleneck. The robustness of the approach is further validated by comparable performance on both RDMA and TCP interconnects. This work opens exciting new possibilities for leveraging the power of LLMs with extremely long contexts.
This newsletter has showcased three distinct yet complementary approaches to advancing long-context language modeling. Recycled Attention offers a clever inference-time optimization to reduce computational costs without sacrificing performance. Bio-xLSTM demonstrates the resurgence of recurrent architectures, tailored for the specific challenges of biological and chemical sequences. Context Parallelism provides a powerful parallelization strategy for scaling inference to unprecedented context lengths. These innovations represent significant progress towards enabling efficient and performant long-context language modeling, opening up new possibilities for research and applications across various domains.