This newsletter dives into the cutting-edge advancements in deep learning architectures designed to tackle the challenges of long-context language modeling. We'll explore novel approaches to scaling mutual information, optimizing normalization strategies, and intelligently selecting training data, all aimed at pushing the boundaries of context window size and model performance.
L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling by Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić https://arxiv.org/abs/2503.04725
Caption: (a) An example text passage with X and Y representing two non-adjacent segments. (b) The bipartite mutual information (I) between X and Y scales as a power law with the sequence length (L). (c) An illustration of the autoregressive language modeling process, where the model generates output Y conditioned on input X and latent state Z. (d) The L<sup>2</sup>M condition states that the dimension of the latent state (Z) must be at least as large as the bipartite mutual information between X and Y to effectively capture long-range dependencies.
This paper introduces a groundbreaking bipartite mutual information scaling law specifically designed for natural language, differentiating it from the traditional two-point mutual information. The authors rigorously demonstrate that long-range dependencies in natural language adhere to a power-law scaling: I(X<sub>1:ℓ</sub>; Y<sub>1:L-ℓ</sub>) ~ L<sup>β</sup>, where X and Y represent adjacent segments of length ℓ within a text of total length L. This scaling law has been validated across multiple datasets using state-of-the-art LLMs, such as LLaMA and DeepSeek, consistently exhibiting power-law growth and supporting the proposed scaling law.
Based on this novel scaling law, the authors introduce the Long-context Language Modeling (L$^2$M) condition. This condition establishes a crucial theoretical lower bound on the necessary latent state size (dim(z)) for effective long-context modeling. The L$^2$M condition posits that for a model to effectively capture long-range dependencies, the dimension of its latent state must grow at least as rapidly as the power-law scaling of the bipartite mutual information: dim(z) ≥ I(X<sub>1:ℓ</sub>; Y<sub>1:L-ℓ</sub>). This condition has been both theoretically proven and empirically validated using transformer and state-space models.
Empirical validation was conducted on both synthetic data (sub-volume Gaussian distributions) and real-world data (PG19 dataset). The synthetic data experiments confirm the bipartite mutual information scaling and the L$^2$M condition. Experiments on the PG19 dataset reveal that transformer models inherently satisfy the L$^2$M condition, while state-space models necessitate scaling up their size to maintain performance as sequence length increases, aligning with the theoretical predictions of the L$^2$M condition.
The authors highlight the inadequacy of two-point mutual information, previously considered a key indicator of long-range dependencies, in capturing true multi-token dependencies. The bipartite mutual information, in contrast, offers a more comprehensive understanding of long-range dependencies in natural language. This proposed bipartite mutual information scaling law and the L$^2$M condition provide a theoretical framework for understanding and enhancing the efficiency and scalability of long-context language models. The authors acknowledge limitations and suggest future research directions, including extending the framework to other architectures, languages, and tasks, and investigating the ethical implications of improved long-context modeling.
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization by Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma https://arxiv.org/abs/2503.04598
Caption: The image diagrams four different normalization strategies for transformer blocks: (a) Post-Norm, (b) Pre-Norm, (c) a variant where Q and K are normalized, and (d) HybridNorm, where Q, K, and V are normalized before attention and Post-Norm is applied to the feed-forward network (FFN). HybridNorm aims to combine the training stability of Pre-Norm with the performance of Post-Norm by normalizing Q, K, and V within the attention mechanism and using Post-Norm in the FFN.
Training deep transformer networks, particularly Large Language Models (LLMs), presents ongoing challenges, especially regarding the optimal placement of layer normalization. While Pre-Norm simplifies training due to a clearer identity path, it often underperforms Post-Norm in final performance. This paper introduces HybridNorm, a novel hybrid normalization strategy designed to combine the strengths of both.
HybridNorm uses QKV normalization within the attention mechanism and Post-Norm within the feed-forward network (FFN) of each transformer block. In the attention mechanism, the query (Q), key (K), and value (V) matrices are individually normalized before attention calculation: attn<sub>QKV</sub>(Q, K, V) = softmax((Norm(Q)Norm(K)<sup>T</sup>)/√d<sub>k</sub>)Norm(V). This stabilizes information flow between layers. Post-Norm is then applied in the FFN for effective depth in deeper layers. A variant, HybridNorm *, applies Pre-Norm to both MHA and FFN in the first block, maintaining QKV-Norm in the MHA, to further enhance early training stability.
Extensive experiments on dense and Mixture of Experts (MoE) models (550M to 1.2B parameters for dense, 6.9B parameter MoE with 1.3B activated) show HybridNorm consistently outperforms both Pre-Norm and Post-Norm. For example, on a 1.2B parameter dense model, HybridNorm * improved performance across various benchmarks. Similar improvements were observed on the MoE model. Further analysis revealed that HybridNorm effectively balances gradient flow, mitigating gradient explosion (Pre-Norm) and vanishing gradient (Post-Norm) issues. The paper also explored initialization methods and normalization positions, finding HybridNorm performs best with Megatron initialization, while Pre-Norm benefits from Normal initialization. Ablation studies confirmed the superiority of QKV normalization. Experiments on deeper models (up to 29 layers) showed HybridNorm maintains stability and performance, while Post-Norm struggles to converge. These findings highlight HybridNorm's potential for training deep transformer models, especially LLMs, paving the way for larger, more complex architectures.
LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs by Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang https://arxiv.org/abs/2503.02502
Long-context modeling is crucial for LLMs, but ensuring the quality of training data remains a challenge. This paper introduces LADM (Long-context data selection framework with Attention-based Dependency Measurement) to identify high-quality long-context data from large corpora. LADM prioritizes data with rich, long-range contextual relationships, leveraging the attention mechanism's retrieval capabilities.
LADM trains a smaller Long Attention Calculator on long-context data. This calculator processes candidate samples, computing a Pairwise Focus Score (PFS) between spans based on accumulated attention scores: PFS(i, j) = Sum(Softmax(Q<sub>j</sub>K<sub>0:j</sub><sup>T</sup> / √d<sub>k</sub>)[:, i]), where Q<sub>j</sub> are query states of span s<sub>j</sub>, K<sub>0:j</sub> are key states of spans from s<sub>0</sub> to s<sub>j</sub>, and d<sub>k</sub> is the scaling factor. PFS scores are aggregated into a span-level Aggregated Focus Score (AFS), considering dependency strength and variance. Finally, the sample-level Contextual Dependency Score (CDS) is calculated by weighting and summing AFS of all spans, emphasizing later positions for long-range relationships.
LADM was evaluated against random sampling and ProLong using several LLMs (OpenLlama-3B-v2, Llama2-7B/13B, and Mistral-7B-v0.1) on perplexity, a synthetic retrieval task, and real-world tasks from LongBench (QA, summarization, code completion). LADM consistently boosted performance, achieving an average improvement of 2.16% over ProLong across the four models on LongBench, with a notable 10.09% gain on single-document QA for Mistral-7B, using only 1B tokens for continual training. This highlights the importance of contextual dependencies in long-context training and LADM's efficient use of attention scores for identifying high-quality data, leading to more powerful long-context LLMs.
This newsletter has highlighted several key advancements in the pursuit of effective long-context language modeling. From establishing new scaling laws for mutual information to innovative normalization techniques and data selection strategies, the field is rapidly developing new tools to tackle the complexities of long-range dependencies in language. The common thread connecting these works is the recognition that effectively modeling long contexts requires not just larger models, but also smarter architectures and training paradigms. These developments promise to unlock even more powerful and capable LLMs in the near future.