Several recent studies leverage the power of large language models (LLMs) and machine learning to address critical challenges in genomics and diagnostics. Zhang et al. (2025) Zhang et al. (2025) introduce a pan-infection foundation framework for pathogen prediction. This framework utilizes a massive transcriptome dataset to train a "teacher" model and then distills its knowledge into smaller, specialized "student" models for specific infections like staphylococcal, streptococcal, HIV, and RSV infections, as well as sepsis. This knowledge distillation approach allows for accurate pathogen diagnosis while maintaining lightweight models suitable for clinical deployment.
Similarly, Li et al. (2024) Li et al. (2024) propose scReader, a hybrid approach combining LLMs with domain-specific representation models to interpret single-cell RNA-seq data. By initializing gene representations using functional descriptions from LLMs like LLaMA-2, scReader improves cell annotation and visualization, particularly for challenging cell types across species. Chen et al. (2024) Chen et al. (2024) also employ LLMs in their GeneSUM framework for automated gene summary extraction, addressing the growing need for efficient literature review in biomedical research.
Beyond LLMs, other studies explore novel machine learning techniques for biological age estimation and genomic analysis. Wu et al. (2025) Wu et al. (2025) develop iTARGET, an interpretable age regression algorithm that addresses Epigenetic Correlation Drift and Heterogeneity Among CpGs by clustering methylation profiles and using Explainable Boosting Machines for group-specific prediction. This approach not only improves accuracy but also provides insights into key age-related CpG sites and aging rate changes. Zhou (2025) Zhou (2025) investigates the origin of alpha-satellite repeat arrays, proposing a mitochondrial origin based on analysis of the jewel wasp genome. The research reveals a process involving mitochondrial insertion, expansion, and rapid evolution of these repeats within the nuclear genome.
Moving towards metagenomics and epigenetic analysis, Liu et al. (2025) Liu et al. (2025) introduce METAGENE-1, a large language model pre-trained on a massive metagenomic dataset derived from wastewater samples. This metagenomic foundation model demonstrates promising results in pathogen detection and genomic sequence embedding, highlighting its potential for pandemic monitoring and biosurveillance. Colando et al. (2025) Colando et al. (2025) focus on ChIP-Seq normalization, examining the technical conditions underlying different methods and their impact on differential binding analysis. Their simulations and experimental results emphasize the importance of selecting appropriate normalization methods based on the specific experimental context. Finally, Hozumi & Wei (2024) Hozumi & Wei (2024) present k-mer topology, a novel method using persistent Laplacians to analyze the shape of genome space. This approach provides a new perspective on evolutionary relationships and demonstrates superior performance in species classification and clustering compared to existing methods.
The integration of explainable AI (XAI) is also gaining traction, as demonstrated by Usman et al. (2025) Usman et al. (2025), who combine neural networks with SHAP to reveal disease-related mechanisms in single-cell RNA-seq data. Their work on Huntington's disease showcases the potential of XAI to provide mechanistic insights and complement traditional differential gene expression analysis. These studies collectively contribute to a deeper understanding of complex biological processes and pave the way for more accurate diagnostics, personalized medicine, and effective disease monitoring.
Pan-infection Foundation Framework Enables Multiple Pathogen Prediction by Lingrui Zhang, Haonan Wu, Nana Jin, Chenqing Zheng, Jize Xie, Qitai Cai, Jun Wang, Qin Cao, Xubin Zheng, Jiankun Wang, Lixin Cheng https://arxiv.org/abs/2501.01462
Host-response-based diagnostics offer a promising avenue for improving infection diagnosis and reducing inappropriate antibiotic use. However, existing diagnostic models are often hampered by limited sample sizes and coarse infection classifications. This study introduces Teacher-Student Gene Pair Signature (TSGPS), a novel framework leveraging knowledge distillation and a vast pan-infection dataset to enhance pathogen prediction and infection-related disease diagnosis. The researchers curated the largest infection host-response transcriptome dataset to date, encompassing 11,247 samples across 89 blood transcriptome datasets from 13 countries and 21 platforms. This dataset represents a significant advancement in the field, providing a rich resource for developing more robust and generalizable diagnostic models.
The TSGPS framework employs a coarse-to-fine teacher-student architecture. A pan-infection foundation model (PIFM), trained on the diverse pan-infection dataset, serves as the "teacher." This model, built using transformer modules, achieved an AUC of 0.97 in distinguishing bacterial and viral infections, outperforming existing methods like bvnGPS, Random Forest, GBDT, and SVM. The knowledge from this "teacher" is then distilled into lightweight "student" models specialized for specific pathogens and sepsis. This knowledge transfer is facilitated by a distillation loss function, L<sub>distill</sub>, calculated as:
L<sub>distill</sub> = Σ<sub>i</sub> q<sub>i</sub> (q<sub>i</sub>-p<sub>i</sub>) / η·τ<sup>2</sup>
where q<sub>i</sub> and p<sub>i</sub> represent the teacher's and student's predicted probabilities (after softmax with temperature τ), and η is the number of predictions. This loss is combined with a cross-entropy loss to train the student models. The resulting student models are significantly smaller than the teacher model, making them suitable for deployment in clinical settings with limited computational resources. This distillation process allows the specialized models to benefit from the broad knowledge learned by the teacher model, improving their accuracy and generalization capabilities.
The TSGPS framework yielded impressive results across multiple pathogen predictions. The student models achieved AUCs of 0.99 for staphylococcal infection, 0.94 for streptococcal infection, 0.93 for HIV, 0.94 for RSV, and 0.99 for sepsis. These results demonstrate the effectiveness of knowledge distillation in transferring insights from the pan-infection data to specific pathogen and disease models, even with limited data for individual conditions. Furthermore, TSGPS demonstrated its utility in cross-disease prediction, significantly improving sepsis diagnosis compared to existing biomarkers like SeptiCyte and sNIP. The framework's ability to generalize across different infections and diseases highlights its potential for broad clinical application.
Origin of $α$-satellite repeat arrays from mitochondrial molecular fossils -- sequential insertion, expansion, and evolution in the nuclear genome by Yihang Zhou https://arxiv.org/abs/2501.02284
Alpha satellite DNA, characterized by large tandem repeats, constitutes a significant portion of eukaryotic genomes, yet its evolutionary origin remains enigmatic. This study challenges conventional understanding by revealing a mitochondrial origin for alpha-satellite-like (SatL) repeats in the jewel wasp Nasonia vitripennis. Researchers identified 1,545 SatL repeat units in the N. vitripennis nuclear genome, with 39 copies organized into two palindromic arrays within the mitochondria, remarkably increasing the mitochondrial genome size by 50%.
The study employed a combination of bioinformatics techniques, including local BLASTN searches, multiple sequence alignments, and phylogenetic analyses, to investigate the origin and evolution of SatL repeats. The analysis revealed that nuclear SatL repeats are located within nuclear mitochondrial DNA (NuMT) regions. Furthermore, the phylogenetic relationships of SatL repeats closely mirrored those of mitochondrial genes and NuMT pseudogenes. This compelling evidence strongly suggests that SatL repeats originated from the mitochondria and were subsequently inserted into the nuclear genome.
The researchers propose that at least ten independent insertion events occurred from the mitochondria to the nuclear genome within the last 500,000 years, after N. vitripennis diverged from its sister species N. giraulti. Genomic neighborhood analyses further supported this hypothesis, demonstrating the close association of nuclear SatL repeats with NuMT elements. This finding provides a novel perspective on the origin of satellite DNA and challenges the prevailing view that these repeats arose solely within the nuclear genome.
The mitochondrial SatL repeats exhibited a dramatic increase in GC content (from 33.9% to 50.4%), likely due to GC-biased gene conversion facilitated by the palindromic structure of the mitochondrial SatL arrays. Upon integration into the nuclear genome, SatL repeats underwent substantial copy number expansion, with the oldest array (SatL4B) expanding to over 400 copies from an initial 12-15 copies. Analysis of SatL4B revealed four distinct repeat unit types derived from deletions within the ancestral repeat's AT-rich region, suggesting that complex higher-order structures evolved through duplication events. This dynamic process of insertion, expansion, and diversification highlights the fluidity of genome architecture and the significant role of mitochondrial DNA in shaping the nuclear genome.
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring by Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger https://arxiv.org/abs/2501.02045
Caption: The image illustrates the METAGENE-1 model development process, starting with wastewater sequencing and deep metagenomic analysis of DNA/RNA. The sequenced data is then tokenized and used to pretrain a 7B parameter transformer model, which can be applied to various downstream tasks like pathogen detection, species classification, and anomaly detection.
Scientists have unveiled METAGENE-1, a 7-billion-parameter autoregressive transformer model trained on a massive dataset of metagenomic sequences derived from human wastewater. This marks a significant departure from previous genomic models that primarily focused on individual genomes or curated species sets. METAGENE-1's training data, comprising over 1.5 trillion base pairs of DNA and RNA sequences, captures the vast diversity of the human microbiome, offering a powerful new tool for pandemic monitoring and pathogen detection. Unlike previous models, which often focus on curated collections of known species, METAGENE-1 aims to capture the full complexity of microbial and viral interactions within human-associated environments.
The model's development involved several key innovations. Researchers employed byte-pair encoding (BPE) tokenization, a technique commonly used in natural language processing, to process the metagenomic sequences. This approach allows for flexible token sizes and the ability to tokenize novel sequences, crucial for handling the diverse and often unknown organisms present in wastewater. METAGENE-1's architecture mirrors that of popular language models like GPT and Llama, leveraging a decoder-only transformer design with a causal language modeling objective. This choice enables the model to predict the next token in a sequence based on preceding tokens, and allows researchers to utilize existing infrastructure and techniques developed for this class of models. The model's training data represents a significant advance in the field, as it captures the complex mixture of genetic material found in wastewater, including both known and unknown pathogens.
The model's performance was evaluated on several benchmarks, demonstrating its potential for real-world applications. On a pathogen detection benchmark, METAGENE-1 achieved state-of-the-art results, outperforming other genomic models by a significant margin, with improvements ranging from 3 to 17 MCC points. It also excelled in generating high-quality sequence embeddings, a crucial capability for downstream tasks like anomaly detection and building predictive models. Further demonstrating its utility, METAGENE-1 successfully identified out-of-distribution data in a wastewater anomaly detection scenario, highlighting its potential for early detection of emerging health threats. These results suggest that METAGENE-1 can be a valuable tool for public health officials in monitoring for and responding to outbreaks of infectious diseases.
scReader: Prompting Large Language Models to Interpret scRNA-seq Data by Cong Li, Qingqing Long, Yuanchun Zhou, Meng Xiao https://arxiv.org/abs/2412.18156
Caption: SCREADER initializes gene embeddings using GPT-3.5 based on NCBI Gene descriptions and constructs cell embeddings by combining gene expression levels with these embeddings. These cell embeddings are then projected and fed into a frozen Llama-13b model along with a task prompt (e.g., cell type classification), with the output used for downstream tasks.
Large language models (LLMs) have shown remarkable potential in various fields, but their application in life sciences, particularly in interpreting complex genomic data, remains underexplored. Existing methods often fail to fully leverage the rich biological knowledge embedded within LLMs, leading to limited gene representations and hindering cross-species analysis. This paper introduces SCREADER (LLM AS SINGLE-CELL RNA DATA READER), a novel framework that integrates LLMs with gene expression data interpretation to address these limitations. SCREADER tackles the challenges of insufficient gene knowledge utilization, limited domain data across species, and the semantic gap between biological models and human language understanding. By incorporating the rich contextual understanding of LLMs, SCREADER aims to improve the interpretation and analysis of single-cell RNA-seq data.
SCREADER employs a two-part strategy. First, it initializes gene-level embeddings using detailed functional descriptions extracted from the NCBI Gene database. These descriptions are then processed using GPT-3.5 to generate rich, context-aware numerical representations (e<sub>i</sub> = f<sub>gpt</sub>(T<sub>i</sub>), where T<sub>i</sub> is the description of gene g<sub>i</sub>). This approach leverages the LLM's understanding of language and scientific literature to capture nuanced relationships and functional similarities between genes. This innovative use of LLMs allows SCREADER to go beyond simple keyword matching and capture the deeper semantic meaning embedded within gene descriptions.
Second, SCREADER constructs cell-level representations by combining gene expression levels with the pre-generated gene embeddings. Genes are ranked within each cell based on their expression levels, and a position-aware representation (p<sub>i</sub> = e<sub>i</sub> + PE(i)) is computed for each gene, incorporating both identity and relative expression level. The final cell representation (E<sub>c</sub>) is a sequence of these position-aware gene representations. This approach preserves relative expression levels, combines semantic information with expression patterns, and provides a fixed-length representation for each cell. This method allows SCREADER to capture the complex interplay between gene expression and function, providing a more comprehensive view of cellular activity. The use of position-aware representations further enhances the model's ability to distinguish between cells with similar gene expression profiles but different functional roles.
iTARGET: Interpretable Tailored Age Regression for Grouped Epigenetic Traits by Zipeng Wu, Daniel Herring, Fabian Spill, James Andrews https://arxiv.org/abs/2501.02401
Caption: The iTARGET algorithm predicts biological age from DNA methylation data. First, similar methylation profiles are clustered into age groups using FAISS. Then, an Explainable Boosting Machine (EBM) is trained on each age group using top correlated CpG sites, allowing for accurate age prediction and interpretation of age-related biomarkers.
Accurately predicting chronological age from DNA methylation patterns is crucial for aging research, but existing methods face challenges due to Epigenetic Correlation Drift (ECD) and Heterogeneity Among CpGs (HAC). A new study introduces iTARGET (Interpretable Tailored Age Regression for Grouped Epigenetic Traits), a two-phase algorithm designed to address these issues and improve both accuracy and interpretability of DNA methylation age prediction. The first phase uses Facebook AI Similarity Search (FAISS) to cluster methylation profiles into age groups based on similarity. This step allows for the identification of age-related biomarkers specific to different stages of life, addressing the issue of ECD. The second phase trains an Explainable Boosting Machine (EBM) model for each age group, using the top 30 CpG sites with the strongest linear correlations within that group. This targeted approach reduces the complexity of the relationships within each group, making linear assumptions more valid and improving the overall accuracy of predictions while also addressing HAC. The EBM model also allows for the detection of pairwise interactions between CpG sites, offering valuable insights into the synergistic effects of these sites on age prediction.
The iTARGET method was evaluated on a publicly available dataset of 11,910 blood-derived methylomes, and compared against several established epigenetic clocks and linear regression models. The dataset was split into 80% training and 20% testing sets, with five-fold cross-validation performed to ensure robustness. The EBM model is represented by the following formula:
ŷ = β₀ + Σⱼfⱼ(xⱼ) + ΣᵣΣₛfᵣₛ(xᵣ, xₛ)
where ŷ is the predicted age, β₀ is the intercept, fⱼ(xⱼ) represents the learned feature function for the j-th feature xⱼ, and fᵣₛ(xᵣ, xₛ) represents the interaction function between the r-th and s-th features. This formula allows the model to capture both the individual effects of CpG sites and their interactions, providing a more comprehensive understanding of how methylation patterns relate to age. The use of EBM also allows for the interpretation of feature importance and interactions, making the model more transparent and insightful than traditional black-box machine learning models.
The results demonstrated that iTARGET-decade (using decade-based age grouping) achieved the lowest Mean Absolute Error (MAE) of 3.7752 years among all evaluated models. While Linear Lasso achieved a slightly lower Root Mean Squared Error (RMSE) of 5.7082 compared to iTARGET-decade's 5.8164, iTARGET still outperformed in terms of MAE, a crucial metric for accuracy in age prediction. An "iTARGET-ideal" scenario, assuming perfect age group classification, achieved an even lower MAE of 1.5744 and RMSE of 2.5867, highlighting the potential for highly precise prediction when age groups are correctly identified.
This newsletter highlights a surge of innovation in applying advanced computational methods to complex biological data. From leveraging LLMs for tasks like pathogen prediction (TSGPS) and single-cell data interpretation (scReader) to developing novel algorithms for age estimation (iTARGET) and exploring the evolutionary origins of genomic elements (mitochondrial origin of alpha-satellite repeats), the field is rapidly advancing. The development of metagenomic foundation models like METAGENE-1 promises to revolutionize pandemic monitoring and biosurveillance, while advances in interpretable AI, as exemplified by iTARGET, are bringing much-needed transparency and mechanistic insights to genomic analysis. A common thread across these studies is the push towards more accurate, personalized, and insightful analyses, paving the way for significant advancements in diagnostics, personalized medicine, and our understanding of fundamental biological processes.