ArXiv Pulse - Stay updated with the latest research papers

General Overview

Several recent studies have explored novel deep learning architectures for analyzing and predicting biological sequences, with a particular focus on mRNA and DNA. Wood, Klop, and Allard (2025) introduced Helix-mRNA, a hybrid structured state-space and attention model designed for mRNA therapeutics. This model uniquely incorporates a two-stage pre-training process and codon-aware tokenization, enabling it to analyze both coding regions and UTRs. It outperforms existing methods while maintaining efficiency with fewer parameters and extended sequence handling capabilities. This focus on UTR analysis addresses a significant gap in current mRNA optimization research, which often overlooks these crucial regulatory regions.

Concurrently, Oh et al. (2025) presented scMamba, a pre-trained model specifically tailored for analyzing single-nucleus RNA sequencing (snRNA-seq) data in neurodegenerative disorders. By incorporating a linear adapter layer, gene embeddings, and bidirectional Mamba blocks, scMamba effectively processes snRNA-seq data, addressing challenges posed by data quality and heterogeneity. This specialized model demonstrates improved performance in downstream tasks such as cell type annotation and differential gene expression analysis.

The application of advanced language models to DNA sequences is also gaining traction. Ma et al. (2025) introduced HybriDNA, a hybrid Transformer-Mamba2 model designed for long-range DNA sequence modeling. This architecture leverages the strengths of both attention mechanisms and selective state-space models, enabling efficient processing of ultra-long sequences with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance in both generative and understanding tasks, demonstrating its potential for applications in synthetic biology and genomic research. Complementing this, Zhou et al. (2025) developed MLFformer, a Transformer-based model incorporating a Fast Attention mechanism and a multilayer perceptron (MLP) module for genotype-to-phenotype prediction in rice. This approach addresses the challenges posed by high-dimensional nonlinear features in genomic data, improving predictive accuracy compared to traditional methods. It showcases the potential of Transformer-based architectures for complex phenotypic prediction tasks.

Beyond sequence modeling, Momenzadeh and Meyer (2025) reviewed recent advancements in single-cell proteomics (SCP) using mass spectrometry. Their review highlights the transformative potential of SCP in revealing cellular heterogeneity and disease pathogenesis, while also acknowledging the persistent challenges related to sensitivity, data processing, and standardization. The authors emphasize the importance of integrating analytical, computational, and experimental strategies for advancing the field. Similarly, Saadat and Fellay (2025) focused on predicting nonsense-mediated decay (NMD) with NMDEP, a novel framework integrating rule-based methods, sequence embeddings, and biological features. By leveraging explainable AI, they identified key NMD determinants, contributing to a deeper understanding of this critical post-transcriptional surveillance mechanism.

A common thread across these studies is the increasing adoption of hybrid architectures that combine the strengths of different deep learning approaches, such as attention mechanisms and state-space models. This trend reflects the growing need for models capable of handling the complexity and scale of biological data while preserving crucial information at the nucleotide or single-cell level. Furthermore, the emphasis on pre-training and the incorporation of domain-specific knowledge, as seen in scMamba and NMDEP, underscores the importance of tailoring models to specific biological tasks and datasets.

Finally, Su, Yu, Zhi, and Ji (2025) introduced Seq2Exp, a network designed for predicting gene expression from DNA sequences by discovering regulatory elements. Their approach utilizes an information bottleneck with the Beta distribution to filter non-causal components and focus on the causal relationship between epigenomic signals, DNA sequences, and regulatory elements. This focus on causality represents a significant step towards understanding the complex interplay of factors governing gene expression. Collectively, these studies highlight the rapid advancements in applying deep learning to diverse biological problems, paving the way for more accurate predictions, deeper biological insights, and novel therapeutic applications.

Paper Highlights

Helix-mRNA: A Hybrid Foundation Model for Full Sequence mRNA Therapeutics

Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics by Matthew Wood, Mathieu Klop, Maxime Allard https://arxiv.org/abs/2502.13785

Caption: (A) The Helix-mRNA model architecture processes tokenized nucleotide sequences through Mamba2 layers (M²), MLP layers (+), and attention layers () before a task-specific head generates output. (B) t-SNE visualizations of codon-only (top) and full-sequence (bottom) embeddings demonstrate Helix-mRNA’s ability to capture phylogenetic relationships across different phyla. (C) Helix-mRNA’s transfer learning approach freezes initial layers during fine-tuning for specific downstream tasks. (D) Scatter plots comparing Helix-mRNA’s predictions of Mean Ribosome Load (MRL) against experimental measurements in different cell lines (HEK293T, T cells, HepG2) and replicates, showcasing its superior performance with high r² values.*

mRNA therapeutics hold immense potential, but optimizing mRNA sequences for crucial properties like translation efficiency and stability remains a significant hurdle. Existing deep learning models often fall short by focusing solely on coding regions, neglecting the vital role of untranslated regions (UTRs). Helix-mRNA, a new hybrid foundation model, addresses these limitations by considering full-length mRNA sequences, incorporating both UTRs and coding regions in its analysis.

The innovative architecture of Helix-mRNA combines state-space-based and attention-based approaches. This hybrid design allows it to process sequences six times longer than current methods while using a fraction (only 10%) of the parameters of existing foundation models. A key feature is its use of single nucleotide tokenization with codon separation, preserving critical biological and structural information from the original mRNA sequence. The model's training strategy also contributes to its success. A two-stage pre-training approach, utilizing Warmup-Stable-Decay (WSD) scheduling, enhances both generalization and specialization. The first stage trains on a diverse dataset of mRNA sequences from various phyla, providing a broad foundation. The second stage then focuses on high-quality human mRNA data, refining the model for human applications.

Helix-mRNA's performance was rigorously evaluated on several benchmark tasks related to coding region properties, including mRNA stability, degradation, and translation efficiency. It consistently outperformed existing models like CodonBERT, Transformer HELM, and Transformer XE. For example, on the mRFP expression task, Helix-mRNA achieved a Spearman rank correlation of 0.86 ± 0.008, compared to 0.85 for HELM XE and 0.82 for Transformer XE. Importantly, Helix-mRNA also demonstrated its unique capability to analyze UTRs by surpassing Optimus 5-Prime in predicting Mean Ribosome Load (MRL) in different cell lines. The most significant improvements were seen in T cells and HepG2 cell lines, where Helix-mRNA achieved r² values exceeding 0.8, compared to Optimus 5-Prime's r² values around 0.78-0.8. This comprehensive analysis of both UTRs and coding regions represents a significant advance in the field.

The open-sourcing of the Helix-mRNA model and its weights further amplifies its impact, encouraging broader adoption and development by the scientific community. This breakthrough has the potential to accelerate the development of more effective and broadly applicable mRNA therapeutics across various clinical and industrial domains.

HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model

HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model by Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin https://arxiv.org/abs/2502.10807

Caption: The HybriDNA architecture combines Mamba2 blocks and a Transformer to process long DNA sequences. Mamba2 blocks efficiently capture long-range dependencies, while the Transformer focuses on local context. The model processes input nucleotides (A, T, C, G) to predict the next nucleotide in the sequence and utilizes a hybrid DNA block composed of SSD layers and convolutional operations.

Deciphering the intricate "language" of DNA has long been a central goal in biology. Foundation models, inspired by natural language processing, offer a powerful approach. However, existing DNA models face challenges in balancing contextual understanding with generative capabilities, and efficiently processing the ultra-long sequences critical for understanding DNA function. HybriDNA, a novel decoder-only DNA language model, addresses these limitations through a hybrid architecture combining the strengths of Transformer and Mamba2 models. Mamba2 excels at capturing long-range dependencies with subquadratic complexity, while Transformer focuses on fine-grained, token-level details. This hybrid approach allows HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.

The model is pre-trained on a massive, multi-species genome dataset (160 billion nucleotides from 845 species) using a next-token prediction objective. This pre-training, combined with a base-level tokenization scheme (treating each nucleotide A, C, T, G as a separate token), enables HybriDNA to capture subtle sequence variations crucial for understanding DNA function. A multi-stage warm-up procedure, gradually increasing the context length during pre-training, further enhances its ability to handle long-range dependencies.

For downstream understanding tasks, HybriDNA employs a novel "echo embedding" technique. This involves duplicating the input sequence and extracting embeddings from the latter half, effectively incorporating bidirectional context into the autoregressive model. For generative tasks, prompt tokens are used to encode task-specific instructions, guiding the model to generate DNA sequences with desired properties.

The performance of HybriDNA was rigorously tested on a suite of DNA understanding and generation tasks. On 33 DNA understanding datasets from the BEND, GUE, and LRB benchmarks, HybriDNA achieved state-of-the-art performance, surpassing existing encoder-only and decoder-only models. For example, on the GUE benchmark, HybriDNA-7B achieved an MCC of 88.10% for Promoter Detection, 72.03% for Core Promoter Detection, and 90.12% for Splice Site Detection, exceeding previous best results. The model also demonstrated strong performance on long-range understanding tasks, with improved results observed as the pre-training context length increased. In generative tasks, HybriDNA showed exceptional proficiency in designing synthetic cis-regulatory elements (CREs) with desirable properties, outperforming the baseline HyenaDNA model in generating higher-activity human enhancers and yeast promoters with greater diversity.

Beyond its performance gains, HybriDNA also exhibits favorable scaling behavior and computational efficiency. As the model size increased from 300M to 3B and 7B parameters, performance consistently improved, adhering to scaling laws observed in language models. Moreover, HybriDNA showed significantly higher training throughput and lower GPU memory consumption compared to a standard Transformer model with a similar parameter size, particularly when processing long context lengths. These findings underscore the potential of HybriDNA as a powerful tool for advancing both the understanding and engineering of genomic sequences.

From Mutation to Degradation: Predicting Nonsense-Mediated Decay with NMDEP

From Mutation to Degradation: Predicting Nonsense-Mediated Decay with NMDEP by Ali Saadat, Jacques Fellay https://arxiv.org/abs/2502.14547

Nonsense-mediated mRNA decay (NMD) is a vital quality control mechanism that degrades transcripts containing premature termination codons (PTCs). Predicting NMD efficiency is crucial for understanding gene expression and disease, but existing models often rely on simplistic rules or limited features, hindering their accuracy. NMDEP (NMD Efficiency Predictor), a new framework, addresses this challenge by integrating sequence embeddings, curated biological features, and optimized rule-based methods to achieve state-of-the-art performance in predicting NMD efficiency.

The researchers initially benchmarked embedding-only models using paired DNA and RNA data from The Cancer Genome Atlas (TCGA). They found these models underperformed compared to a simple rule-based approach, highlighting the need for incorporating additional biological context. NMDEP addresses this by incorporating sequence embeddings from a pre-trained mRNA foundation model, along with features like variant position, transcript characteristics, and evolutionary conservation. Crucially, the study also optimized the thresholds for existing rule-based features like the "penultimate exon" rule and "close to start" rule using a two-step grid search, minimizing validation loss. NMD efficiency was quantified using allele-specific expression (ASE) derived from paired DNA and RNA sequencing, with the formula: NMD efficiency = -log₂(VAF<sub>RNA</sub>/VAF<sub>DNA</sub>), where VAF represents variant allele frequency.

The results demonstrated that NMDEP significantly outperformed both the baseline rule-based model and embedding-only approaches. Specifically, NMDEP improved Mean Absolute Error (MAE) by 10.7%, Root Mean Squared Error (RMSE) by 9.2%, and R² by 24.4%, showcasing its superior predictive power. Furthermore, by leveraging explainable AI (SHAP), the researchers identified key NMD determinants, reaffirming the importance of variant position while uncovering novel contributors like ribosome loading. This provides valuable insights into the underlying mechanisms of NMD.

To demonstrate its practical utility, NMDEP was applied to over 2.9 million simulated stop-gain variants, generating a comprehensive resource for assessing transcript stability. This large-scale application showcases the scalability and practical utility of NMDEP for variant interpretation and disease research. While promising, the authors acknowledge limitations, such as the lack of tissue-specific considerations. Future work will focus on incorporating tissue-specific data and expanding the model to other variant types like frameshifts and splicing mutations, further refining NMD efficiency prediction and its applications in understanding transcriptome regulation and disease.

scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders

scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders by Gyutaek Oh, Baekgyu Choi, Seyoung Jin, Inkyung Jung, Jong Chul Ye https://arxiv.org/abs/2502.19429

Caption: This diagram illustrates the architecture of scMamba, a pre-trained model for snRNA-seq analysis. It features bidirectional Mamba blocks, gene embeddings processed with masked expression modeling (MEM), and a linear adaptor layer, enabling efficient handling of raw snRNA-seq data without dimensionality reduction.

Single-nucleus RNA sequencing (snRNA-seq) provides invaluable insights into the cellular landscape of neurodegenerative diseases. However, analyzing this data is challenging due to low sample quality, disease heterogeneity, and high dropout rates. Existing computational methods struggle with long processing times for imputation and often rely on dimensionality reduction techniques that can lead to information loss. scMamba, a novel pre-trained model, is designed to address these challenges and enhance snRNA-seq analysis, specifically for neurodegenerative disorders.

Inspired by the recent Mamba model, scMamba leverages a unique architecture incorporating a linear adapter layer, gene embeddings, and bidirectional Mamba blocks. Unlike Transformer-based models, scMamba efficiently handles long snRNA-seq data without dimensionality reduction. The model's pre-training utilizes masked expression modeling (MEM), where a subset of input embeddings is randomly masked, and the model predicts the masked expression levels. This pre-training objective is defined as: MEM = ∑(Ci - Ĉi)², where M denotes the set of masked indices, and Ci and Ĉi represent the true and predicted gene expression levels, respectively. This approach allows scMamba to learn generalizable features of both cells and genes from raw snRNA-seq data.

scMamba was evaluated on five diverse datasets from different brain tissues and compared against established methods like Seurat, SciBet, scBERT, and scHyena. In cell type classification, scMamba consistently outperformed other methods, achieving the highest F1 scores, particularly for detailed subtypes and subclusters. For doublet detection, scMamba demonstrated superior performance with the highest recall, F1 score, AUROC, and AUPRC in most experiments. In imputation tasks, scMamba effectively corrected batch effects and improved data quality, leading to denser clusters in UMAP visualizations and higher NMI and ARI values compared to other imputation methods.

Furthermore, scMamba demonstrated improved robustness in differential gene expression (DEG) analysis. Imputation with scMamba significantly enhanced the recovery of DEGs between diseased and neurotypical samples, even with subsampled data. For instance, with only 20% subsampling, scMamba recovered approximately 50% of the original upregulated DEGs in microglia, while non-imputed data recovered only 20%. This enhanced robustness is crucial for integrating data from multiple sources and conducting large-scale studies.

The results highlight scMamba's ability to learn meaningful representations of cells and genes, leading to improved performance in various downstream tasks. Its ability to handle raw snRNA-seq data without dimensionality reduction and its effectiveness in correcting batch effects and improving data quality make it a valuable tool for analyzing snRNA-seq data, particularly in the context of complex neurodegenerative diseases. scMamba's efficient processing and scalability further suggest its potential for large-scale studies aimed at uncovering the underlying causes of these devastating disorders.

Learning to Discover Regulatory Elements for Gene Expression Prediction

Learning to Discover Regulatory Elements for Gene Expression Prediction by Xingyu Su, Haiyang Yu, Degui Zhi, Shuiwang Ji https://arxiv.org/abs/2502.13991

Caption: The Seq2Exp model architecture integrates DNA sequence (X_seq) and epigenomic signal (X_sig) data. A generator module uses these inputs to produce a probabilistic mask (P(m_s|X)), representing learned regulatory elements, which is then applied to the input. The masked input is then fed into a predictor module to estimate gene expression (Y).

Predicting gene expression from DNA sequences remains a core challenge in genomics. While deep learning has shown promise, many models overlook the critical role of regulatory elements and their interactions with epigenomic signals. Seq2Exp (Sequence to Expression), a new framework, addresses this by explicitly learning to discover and extract these regulatory elements, improving the accuracy of gene expression prediction. The key innovation lies in its integration of DNA sequence and epigenomic data, capturing the causal relationships between these elements and target gene expression.

Seq2Exp operates on the principle of a causal relationship between genomic data and gene expression. It categorizes regulatory elements into three types: Rg (potential interactors with the target gene), Rm (elements discovered through measurements like DNase-seq peaks), and Rag (elements actively interacting with the target gene and causally influencing its expression). Using this framework, the model decomposes the learning process into components based on DNA sequences and epigenomic signals. A generator module learns a token-level mask based on both data types, extracting relevant DNA sub-sequences. A predictor module then uses these extracted sequences to predict gene expression. An information bottleneck mechanism, implemented using a Beta distribution prior on the mask, constrains the mask size, ensuring that only the most influential regions are extracted, thus filtering out non-causal components. The objective function combines a task-specific loss (e.g., mean squared error) with a KL divergence term that enforces sparsity in the learned regulatory elements:

L ≈ 1/N Σᵢ Eₚ₍ₘᵢ|ₓᵢ₎ [-log q₍ᵧᵢ|ₘᵢ⊙ₓᵢ₎] + βKL[p₍ₘᵢ|ₓᵢ₎, r(mᵢ)]

where mᵢ is the mask, xᵢ is the input sequence, yᵢ is the target gene expression, and β is a hyperparameter.

Seq2Exp was evaluated on two cell types (K562 and GM12878) using CAGE data from the ENCODE project, predicting expression values for over 18,000 protein-coding genes. Performance was compared against established baselines, including Enformer, HyenaDNA, Mamba, Caduceus, and EPInformer, using metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and Pearson Correlation. Seq2Exp consistently outperformed all baselines. For example, on GM12878, Seq2Exp-soft achieved an MSE of 0.1873, MAE of 0.3137, and Pearson Correlation of 0.8951, compared to EPInformer's MSE of 0.1975, MAE of 0.3246, and correlation of 0.8907. Furthermore, Seq2Exp performed better than using regulatory elements identified by the peak-calling method MACS3, demonstrating its ability to learn more informative regulatory regions.

Seq2Exp represents a significant advance in gene expression prediction by explicitly incorporating the discovery of regulatory elements. Its integration of sequence and epigenomic information, coupled with the information bottleneck mechanism, allows for more accurate and biologically relevant predictions. This approach opens exciting avenues for future research, including extending the framework to diverse cell types and epigenomic data, and applying it to other genomic tasks related to regulatory element discovery and sequence analysis.

Conclusion

This newsletter highlights the significant advancements in applying deep learning to complex biological problems. A recurring theme is the development of hybrid architectures, combining the strengths of different deep learning approaches like attention mechanisms and state-space models, as seen in Helix-mRNA and HybriDNA. This reflects the growing need for models capable of handling the scale and complexity of biological sequences while preserving crucial information at the nucleotide level. The focus on pre-training and incorporation of domain-specific knowledge, exemplified by scMamba for snRNA-seq analysis and NMDEP for NMD prediction, emphasizes the importance of tailoring models to specific biological tasks and datasets. Finally, Seq2Exp's focus on causality in gene expression prediction through the discovery of regulatory elements represents a significant step towards a deeper understanding of the complex interplay of factors governing gene regulation. These advances pave the way for more accurate predictions, deeper biological insights, and novel therapeutic applications.