ArXiv Pulse - Stay updated with the latest research papers

General Overview

Several recent studies have explored novel applications of deep learning and algorithmic advancements in genomics and bioinformatics. One notable area of progress is in variant identification and primer design. Wang et al. (2025) introduce Primer C-VAE, a variational autoencoder with convolutional neural networks designed for this purpose. Their model achieved high accuracy in classifying SARS-CoV-2 variants and generated variant-specific primers suitable for qPCR, demonstrating its potential for rapid detection of emerging viral strains (Wang et al., 2025). This work addresses the critical need for efficient and adaptable primer design in epidemiological studies, particularly for rapidly mutating viruses and organisms with similar genomes.

Concurrently, research is also focusing on improving the speed, accuracy, and efficiency of real-time genome analysis. Firtina (2025) developed novel algorithms—BLEND, RawHash, RawHash2, and Rawsamble—to mitigate noise in various sequencing data types, with Rawsamble enabling de novo assembly directly from raw nanopore signals, bypassing basecalling (Firtina, 2025). These tools offer significant advancements in handling the increasing volume and complexity of genomic data.

Understanding the relationship between genomic variation and environmental factors is another key area of investigation. Davenport and Harrison (2025) explored the association between fungal genetic variants and environmental conditions in oceanic environments, suggesting a potential link between specific fungal genes and variations in iron, salt, and phosphate levels (Davenport & Harrison, 2025). This research contributes to our understanding of adaptation and phenotypic plasticity in fungi within dynamic oceanic ecosystems.

In evolutionary genomics, Albors et al. (2025) present PhyloGPN, a genomic language model (gLM) trained using a novel phylogenetic approach. By incorporating multispecies whole-genome alignments, PhyloGPN demonstrates improved performance in predicting functionally disruptive variants and exhibits strong transfer learning capabilities (Albors et al., 2025). This highlights the potential of integrating evolutionary information into gLMs for enhanced variant interpretation.

Finally, the application of large language models (LLMs) is gaining traction in bioinformatics. Wang et al. (2025) provide a comprehensive survey of recent advancements in using LLMs for various tasks, including genomic sequence modeling, RNA structure prediction, and protein function inference (Wang et al., 2025). They also discuss the challenges and future directions of this rapidly evolving field. Newsham et al. (2025) introduce a framework for using LLMs to infer causal structures in biological systems, exploring their performance in zero-shot inference using interventional data (Newsham et al., 2025). Sahu et al. (2025) present a theoretical model exploring the evolutionary pressures driving genome length divergence, proposing a connection between genome length, cellular energetics, and endosymbiotic organelles (Sahu et al., 2025).

Paper Highlights

Taming the Noise: New Algorithms for Faster, More Accurate, and Efficient Genome Analysis

Enabling Fast, Accurate, and Efficient Real-Time Genome Analysis via New Algorithms and Techniques by Can Firtina https://arxiv.org/abs/2503.02997

Caption: This infographic illustrates the standard genome analysis pipeline, highlighting key stages where noise reduction and real-time analysis can significantly improve efficiency and accuracy. Novel algorithms like BLEND, RawHash, and Rawsamble are introduced to address challenges in seed matching, raw signal analysis, and de novo assembly directly from raw sequencing data, bypassing computationally expensive basecalling steps.

High-throughput sequencing (HTS) has revolutionized genomics, but the sheer volume and complexity of the data, coupled with inherent noise, pose significant challenges to accuracy, scalability, and efficiency. This dissertation introduces novel algorithms and techniques designed to address these challenges by mitigating noise for faster, more accurate, and efficient real-time analysis of genomic sequencing data. The work targets key steps in the genome analysis pipeline, from seed matching to raw signal analysis and de novo assembly.

A major contribution is BLEND, a noise-tolerant hashing mechanism for fuzzy seed matching. Traditional methods require exact seed matches, limiting sensitivity and increasing computational cost. BLEND overcomes this by assigning identical hash values to highly similar seeds, enabling the identification of both exact and fuzzy matches with a single lookup. Evaluations in read overlapping and mapping show BLEND is significantly faster (2.4x-83.9x for overlapping, 0.8x-4.1x for mapping) and more memory-efficient (0.9x-14.1x lower memory footprint for overlapping) than existing tools like minimap2 while also improving the quality of overlaps, leading to more accurate de novo assemblies.

For raw nanopore signal analysis, the dissertation introduces RawHash, a novel mechanism enabling real-time analysis without computationally expensive basecalling. RawHash quantizes raw signals, reducing noise and enabling hash-based similarity search. It also proposes Sequence Until, a mechanism to dynamically stop sequencing runs when sufficient data for accurate analysis is obtained. Compared to existing methods like UNCALLED and Sigmap, RawHash demonstrates a remarkable 25.8x and 3.4x improvement in throughput, respectively, along with significantly better accuracy for large genomes. RawHash2 further refines this approach with adaptive quantization and improved chaining algorithms, achieving even higher accuracy (10.57% improvement on average) and throughput (4.0x improvement on average).

Finally, to expand the scope of raw signal analysis, Rawsamble is introduced—the first mechanism for all-vs-all overlapping of raw signals using hash-based search. This allows for de novo assembly directly from raw signals, bypassing basecalling altogether. Rawsamble achieves substantial speedups (16.36x on average) and memory reductions (11.73x on average) compared to traditional pipelines using basecalling and overlapping. Notably, Rawsamble constructs unitigs up to 2.7 million bases long directly from raw signals, demonstrating the feasibility of this approach.

Phylogenetic Language Model Outperforms Baselines in Predicting Deleterious Variants

A Phylogenetic Approach to Genomic Language Modeling by Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song https://arxiv.org/abs/2503.03773

Caption: This diagram illustrates the architecture of PhyloGPN, a genomic language model trained to predict F81 nucleotide substitution model parameters for a central position within a 481 bp DNA sequence. By incorporating phylogenetic information during training, PhyloGPN learns from multispecies whole-genome alignments without requiring them for prediction, enabling accurate variant classification and functional element prediction. The model utilizes ByteNet blocks and RCE 1D convolutions to process the sequence data.

Genomic language models (gLMs) hold promise in genomics, but their ability to identify evolutionarily constrained elements—crucial for understanding function and disease—has been limited. This study introduces PhyloGPN (Phylogenetics-based Genomic Pre-trained Network), a gLM trained using a novel framework that explicitly incorporates nucleotide evolution on phylogenetic trees. Unlike previous models like GPN-MSA, which require multiple sequence alignments (MSAs) for prediction, PhyloGPN only needs a single sequence, significantly enhancing its applicability for transfer learning and analysis of non-aligned regions.

The key innovation of PhyloGPN lies in its training methodology. It leverages multispecies whole-genome alignments and phylogenetic trees to model the evolution of nucleotides. Specifically, the model is trained to predict the parameters of a Felsenstein 81 (F81) nucleotide substitution model for the central position of a 481 bp input sequence. The loss function incorporates the likelihood of observed nucleotides given the phylogenetic tree, effectively capturing evolutionary relationships. To address numerical instability, the training minimizes a stable upper bound of the loss function derived using the sigmoid function: a(t) > sigmoid (logt + Σα∈ν θα). This approach allows the model to learn from alignment data during training without requiring it for prediction, overcoming limitations of previous MSA-dependent models.

The researchers evaluated PhyloGPN on several benchmarks, including ClinVar for classifying pathogenic and benign variants and OMIM for discriminating pathogenic regulatory variants from common variants. PhyloGPN consistently outperformed baseline gLMs, achieving state-of-the-art performance on ClinVar variant classification across all variant categories and demonstrating superior performance on the OMIM dataset across various minor allele frequency thresholds. Furthermore, PhyloGPN excelled in predicting deep mutational scanning (DMS) outcomes, outperforming baselines on 22 out of 25 proteins. In the BEND benchmarking suite, PhyloGPN achieved state-of-the-art performance on Chromatin Accessibility, Histone Modification, and CpG Methylation tasks, and achieved an AUROC of 0.96 on the Disease Variant Effect Prediction task, a substantial improvement over existing gLMs.

Deep Learning Designs Primers to Detect Emerging Virus Variants

Primer C-VAE: An interpretable deep learning primer design method to detect emerging virus variants by Hanyu Wang, Emmanuel K. Tsinda, Anthony J. Dunn, Francis Chikweto, Alain B. Zemkoho https://arxiv.org/abs/2503.01459

Caption: The Primer C-VAE workflow depicts the process of training a convolutional variational autoencoder (C-VAE) model on SARS-CoV-2 and other coronavirus genomic data to generate candidate primers. The model uses feature generation and filtering to select highly specific forward primers based on target variant, other variant, non-human host, and other taxa criteria. The trained C-VAE model achieves high accuracy in differentiating between variants and closely related species, facilitating the design of effective PCR primers for molecular diagnostics.

A new deep learning model, Primer C-VAE (Convolutional Variational Auto-Encoder for Primer design), offers a semi-automated approach to designing PCR primers, addressing key limitations of traditional methods. Existing tools often struggle with long genomic sequences and differentiating between closely related organisms or viral variants. Primer C-VAE overcomes these limitations by using a VAE framework with CNNs to learn latent representations of genomic sequences, enabling the generation of highly specific forward and reverse primers. Its flexibility allows for variable-length primers (18-25 base pairs) and eliminates restrictions on input sequence length, making it suitable for various organisms, including viruses with large genomes and bacteria like E. coli and S. flexneri. The method also incorporates an interpretable feature extraction process, allowing researchers to identify variant-discriminative regions with biological significance.

The Primer C-VAE methodology comprises four stages. First, genomic data is acquired and pre-processed, including sequence standardization and ordinal encoding (Y := f(x) := 0 if x = N, 1 if x = C, 2 if x = T, 3 if x = G, 4 if x = A). Second, the C-VAE model is trained on the pre-processed data to extract discriminative genomic signatures. Four feature extraction methods (Pooling, Top, Mix, and Reconstruction) are used to generate candidate forward primers. Third, reverse primers are designed using a similar approach, incorporating a synthetic reference dataset to address the challenge of single-label classification for downstream sequences. Finally, candidate primer pairs undergo rigorous validation using Primer-BLAST and in-silico PCR to ensure specificity and amplification efficiency.

LLMs Show Promise for Zero-Shot Inference of Causal Structures in Biology

Large Language Models for Zero-shot Inference of Causal Structures in Biology by Izzy Newsham, Luka Kovačević, Richard Moulange, Nan Rosemary Ke, Sach Mukherjee https://arxiv.org/abs/2503.04347

Caption: This figure presents the AUROC scores of a large language model (Gemma2-9B-it) predicting causal gene relationships in a zero-shot setting across various prompting strategies. Different shades represent varying chain-of-thought prompting, while the x-axis categorizes these prompts based on the inclusion of gene descriptions, literature information, or experimental context (cancer/mRNA). The dots represent individual runs, and the boxplots summarize the distribution of AUROC scores.

Unlocking the complex web of causal relationships within biological systems, such as gene regulatory networks, remains a significant challenge. Traditional methods are often time-consuming and expensive. This research explores the potential of large language models (LLMs) to infer these causal relationships in a zero-shot manner, leveraging the power of interventional experimental data for validation. The study focuses on directed gene-gene causal relationships, a crucial step towards broader applications of LLMs in causal inference tasks within biology.

The researchers developed a benchmarking approach using a Perturb-seq dataset, which combines large-scale gene perturbations with single-cell RNA sequencing. This dataset provides a causal ground truth by identifying differentially expressed genes (Δk) following an intervention on a specific gene (k). Formally, Δk = {j | p < α}, where p is the corrected p-value from a hypothesis test comparing the intervened and unintervened univariate distributions for gene j, and α is the significance level. This process generates an ancestral causal graph, where each edge represents a causal, though potentially indirect, influence. The LLM (Gemma2-9B-it) was then prompted to predict the probability of a causal relationship between gene pairs, generating a probability matrix. Performance was evaluated using the area under the receiver operating characteristic curve (AUROC) by comparing the LLM-derived probabilities to the experimentally derived causal graph.

Eukaryotes evade information storage-replication rate trade-off with endosymbiont assistance leading to larger genomes

Eukaryotes evade information storage-replication rate trade-off with endosymbiont assistance leading to larger genomes by Parthasarathi Sahu, Sashikanta Barik, Koushik Ghosh, Hemachander Subramanian https://arxiv.org/abs/2502.21125

A new theoretical study investigates why eukaryotic genomes are so much larger than prokaryotic genomes. The research, using a parameter-free model, proposes that the divergence in genome size stems from the interplay of two fundamental selection pressures: minimizing replication time and maximizing information storage capacity. The model quantifies this interplay using a 'selection pressure' factor (γ), defined as the ratio of replication time to information storage capacity:

γ = (replication time) / (information storage capacity)

where replication time is proxied by the length of the longest replichore and information storage capacity by the total genome length. The model simulates genome evolution by starting with a pool of identical sequences and subjecting them to random deletions and duplications, mimicking large-scale genomic mutations. A key difference between the simulated prokaryotes and eukaryotes lies in the number of replication origins. Prokaryotes are restricted to a single origin, while eukaryotes are allowed multiple origins, reflecting the biological reality. The model then applies the selection pressure, favoring sequences with lower γ values (faster replication and/or higher information storage).

Conclusion

This newsletter highlights significant advancements in genomics and bioinformatics, focusing on novel applications of deep learning and algorithmic improvements. From innovative primer design with Primer C-VAE to noise reduction in sequencing data with tools like BLEND, RawHash, and Rawsamble, the field is rapidly developing methods to handle the increasing volume and complexity of genomic data. The development of PhyloGPN demonstrates the power of integrating evolutionary information into genomic language models for enhanced variant interpretation. Perhaps most excitingly, the exploration of LLMs for zero-shot causal inference opens new avenues for understanding complex biological systems, even without extensive training data. Taken together, these advancements promise to accelerate research and discovery across various domains of biology, from epidemiology and evolutionary genomics to systems biology and precision medicine.