ArXiv Pulse - Stay updated with the latest research papers

General Overview

Several new computational tools and methodologies are advancing the analysis of genomic data, particularly in the areas of gene regulation, disease association, and drug response. Ghosh, Dutta, and Santoni (2025) introduce TFBS-Finder, a deep learning model for predicting Transcription Factor Binding Sites (TFBSs). Leveraging DNABERT for sequence embedding and incorporating CNNs, a Modified Convolutional Block Attention Module (MCBAM), and a Multi-Scale Convolutions with Attention (MSCA) module, TFBS-Finder demonstrates superior performance on ENCODE ChIP-seq datasets compared to existing methods. This sophisticated architecture allows the model to capture both long-term dependencies and higher-order local features within DNA sequences, crucial for accurate TFBS prediction.

Concurrently, Wu, Remita, and Diallo (2025) present MirLibSpark, a scalable, distributed pipeline for plant microRNA prediction and functional annotation using Apache Spark. This addresses the growing need for efficient processing of large NGS datasets, offering improved speed and accuracy compared to existing methods. The application of machine learning to specific biological problems is also evident in the work of Wright et al. (2025), who developed DepoRanker, a tool for predicting Klebsiella depolymerases. This machine learning model outperforms BLAST in identifying potential depolymerases, offering a promising avenue for phage therapy development.

Meanwhile, Neeley et al. (2025) explore the use of Large Language Models (LLMs) for gene prioritization in rare disease diagnosis. Benchmarking various LLMs, they found that while GPT-4 initially performed best, accuracy decreased with larger gene sets. Their proposed divide-and-conquer strategy, combined with multi-agent and Human Phenotype Ontology (HPO) classification, mitigated biases and improved causal gene identification. This work highlights both the potential and the limitations of LLMs in complex biological tasks, emphasizing the need for tailored strategies to overcome inherent biases.

Moving beyond single-gene analysis, Pena, Lin, and Li (2025) introduce a novel method for constructing cell-type taxonomies using Optimal Transport with Relaxed Marginal Constraints (OT-RMC). This approach facilitates the simultaneous alignment of clusters across multiple single-cell datasets, even when cluster proportions vary or some clusters are absent in certain samples. This work addresses a critical challenge in single-cell analysis, enabling more robust and accurate comparisons across diverse datasets. Similarly, Mohammad, Björkegren, and Michoel (2025) propose a network-driven framework for enhancing gene-disease association studies, specifically in coronary artery disease (CAD). By integrating cis- and trans- genetic regulatory effects within a Transcriptome-Wide Association Study (TWAS) framework using tissue-specific Gene Regulatory Networks (GRNs), they aim to uncover a more complete picture of the genetic architecture of complex diseases.

Finally, Huang et al. (2025) present scGSDR, a model for single-cell pharmacological profiling that leverages gene semantics. By integrating knowledge of cellular states and gene signaling pathways, scGSDR improves the prediction of cellular drug responses and provides interpretable insights into the underlying mechanisms of drug resistance. This approach, validated across multiple drug response experiments, demonstrates the power of incorporating biological context into machine learning models for precision medicine applications. Collectively, these studies showcase the diverse and innovative ways in which computational methods are being applied to address complex biological questions, from gene regulation and disease association to drug response and single-cell analysis. They highlight the increasing importance of integrating diverse data sources and leveraging sophisticated algorithms to gain deeper insights into biological systems.

Paper Highlights

LLMs for Rare Disease Gene Prioritization

Survey and Improvement Strategies for Gene Prioritization with Large Language Models by Matthew Neeley, Guantong Qi, Guanchu Wang, Ruixiang Tang, Dongxue Mao, Chaozhong Liu, Sasidhar Pasupuleti, Bo Yuan, Fan Xia, Pengfei Liu, Zhandong Liu, Xia Hu https://arxiv.org/abs/2501.18794

Rare diseases pose a formidable diagnostic challenge due to their inherent genetic diversity and the scarcity of patient data. Large language models (LLMs) have emerged as powerful tools in various medical applications, yet their effectiveness in diagnosing rare genetic diseases remains largely unexplored. This study delves into the potential of LLMs, including GPT-4, GPT-3.5, Mixtral-8x7B, Llama-2-70B, and BioMistral, for prioritizing causal genes in rare diseases, utilizing datasets from Baylor Genetics, the Undiagnosed Diseases Network (UDN), and the Deciphering Developmental Disorders (DDD) study.

The researchers employed a multi-pronged approach, incorporating multi-agent and Human Phenotype Ontology (HPO) classification to categorize patient cases based on their phenotypic complexity. Recognizing the limitations of LLMs in handling large gene sets, they also implemented a divide-and-conquer strategy. Initial benchmarking revealed GPT-4's superior performance, achieving approximately 30% accuracy in correctly ranking causal genes. The multi-agent and HPO classification systems proved invaluable in distinguishing between readily solvable cases and more challenging ones, with more specific phenotype descriptions correlating with improved LLM performance.

However, the study also uncovered significant biases in LLM performance. A preference for well-studied genes, evidenced by higher rankings for genes with more ClinVar submissions, and a sensitivity to input order, with genes appearing later in the prompt being prioritized more frequently, were observed. The UDN dataset, comprised of complex and often novel cases, presented the greatest challenge for the LLMs, further emphasizing the impact of case complexity on performance.

To mitigate these biases, the researchers devised a divide-and-conquer strategy. This involved partitioning the candidate genes into smaller subsets, allowing the LLM to rank genes within each group, and then averaging the probabilities across multiple iterations. This approach significantly enhanced GPT-3.5's performance, especially for larger gene sets, by minimizing the influence of positional and literature biases. It enabled consistent higher scores for causal genes and lower scores for non-causal genes when grouped with their causal counterparts. This research underscores the promise of LLMs, particularly GPT-4, in advancing gene prioritization for rare diseases, but also highlights the critical need to address inherent biases arising from data representation and model architecture. The divide-and-conquer strategy presents a promising avenue for mitigating these biases and improving the reliability of LLM-based gene prioritization. Further research is needed to refine LLMs for handling the diverse and complex landscape of rare diseases, ultimately translating AI potential into practical clinical applications.

Single-Cell Drug Resistance Prediction with Gene Semantics

scGSDR: Harnessing Gene Semantics for Single-Cell Pharmacological Profiling by Yu-An Huang, Xiyue Cao, Zhu-Hong You, Yue-Chao Li, Xuequn Shang, Zhi-An Huang https://arxiv.org/abs/2502.01689

Single-cell sequencing has revolutionized our understanding of drug resistance by revealing the crucial role of cellular heterogeneity. Predicting drug responses at the single-cell level, however, remains challenging due to the inherent data sparsity of scRNA-seq and the limitations of relying solely on bulk-seq data. scGSDR, a novel model, addresses this challenge by incorporating gene semantics—leveraging the biological context of genes—to enhance predictive accuracy and interpretability. Unlike existing methods that primarily focus on expression levels, scGSDR integrates two computational pipelines: one centered on cellular states using marker genes, and another on gene signaling pathways using attention mechanisms and graph neural networks.

The dual pipeline approach is central to scGSDR's functionality. The first pipeline filters genes using marker genes from 14 different cellular states, constructing cellular features that are then mapped into an embedding space using a transformer. The second pipeline utilizes a cell-pathway matrix and dot-product self-attention to learn cell-pathway associations and construct cellular graphs. These graphs, along with gene expression profiles within pathways, are processed by a multi-graph fusion module to generate a second embedding. These two embeddings are then fused to create a final representation for predicting cellular drug responses. Furthermore, scGSDR incorporates domain adaptation to mitigate batch effects between reference and query datasets. It also optionally includes a loss function specifically designed to address data imbalance between drug-resistant and drug-sensitive cells: ALG = Sigmoid (QK/dk), where dk is the length of the key matrix, and Q and K are the query and key matrices.

Extensive validation across 16 experiments encompassing 11 drugs demonstrated scGSDR's superior predictive performance compared to existing methods like SCAD and scDEAL. It achieved higher AUROC, AUPR, and F1 scores when trained with either bulk-seq or scRNA-seq data. For example, in bulk-seq reference experiments, scGSDR-overlap (using the Overlap loss function) achieved an average AUROC of 0.9219, AUPR of 0.9352, accuracy of 0.8583, and F1-macro of 0.8567. The model's robustness extended to predicting responses to drug combinations, using data from individual drug treatments as references. In scRNA-seq reference experiments, scGSDR achieved an average accuracy of 0.825 across four experiments, showcasing its strong performance in cross-platform and cross-cell line scenarios.

A key strength of scGSDR lies in its interpretability module, which employs an attention mechanism to reveal the contribution of each pathway in each cell during inference. This allows for the identification of key pathways associated with drug resistance and sensitivity. By analyzing these pathway attention scores, researchers were able to identify genes associated with specific drugs, such as BCL2, CCND1, and PIK3CA for PLX4720, and ICAM1, VCAM1, NFKB1, and RAC1 for Paclitaxel, findings corroborated by existing literature. Moreover, by examining pathways with significant differences in attention scores between drug-resistant and drug-sensitive cells, scGSDR can infer potential novel drug-related genes. This was demonstrated by identifying BCL2 as a potential gene associated with PLX4720 and ICAM1/VCAM1 with Paclitaxel. These findings underscore scGSDR's potential to not only predict drug responses accurately but also provide valuable insights into the mechanisms of drug resistance, paving the way for more targeted and effective therapies.

Network-Driven TWAS for Coronary Artery Disease

A network-driven framework for enhancing gene-disease association studies in coronary artery disease by Gutama Ibrahim Mohammad, Johan LM Björkegren, Tom Michoel https://arxiv.org/abs/2501.19030

Caption: This figure displays scatter plots comparing the predictive R² values for gene expression using cis-regulatory effects alone versus combined cis and trans-regulatory effects across six different tissues relevant to Coronary Artery Disease (CAD). The plots highlight the substantial improvement in predictive accuracy achieved by incorporating trans-eQTLs, especially for genes with low cis-predictability. Each plot also quantifies the total number of genes analyzed, along with the number of genes where the combined model performed equally, better, or worse than the cis-only model.

Genome-wide association studies (GWAS) have successfully identified numerous genetic variants linked to complex diseases like Coronary Artery Disease (CAD). However, the majority of these variants reside in non-coding regions, making it difficult to decipher their functional impact. Transcriptome-wide association studies (TWAS) aim to bridge this gap by connecting genetic variants to gene expression and disease phenotypes. Traditional TWAS methods primarily focus on cis-regulatory effects, often overlooking the important contribution of trans-regulation. This paper introduces a novel network-driven framework that integrates both cis and trans genetic regulatory effects into TWAS, enhancing gene-disease association studies in CAD.

The methodology comprises three key stages. First, tissue-specific gene regulatory networks (GRNs) are reconstructed from CAD-relevant reference datasets using causal inference methods (Findr). Second, a machine learning prediction model (Ridge regression with cross-validation and independent weight optimization) is employed to estimate gene expression levels, integrating both cis and trans regulatory effects derived from the GRNs. The model predicts gene expression (Xgenetici) as a weighted sum of cis (Xcisi) and trans (Xtransi) components: Xgenetici = wcis Xcisi + wtrans Xtransi. Finally, the parameters from the prediction model are combined with GWAS summary statistics using a Z-score method to assess gene-disease associations. The overall gene-disease association Z-score (Ztotal) is calculated as: Ztotal = wcis Zcis + wtrans Ztrans, where Zcis and Ztrans represent the Z-scores for the cis and trans components, respectively.

The framework's efficacy was validated using the STARNET dataset, which encompasses multi-tissue gene expression and genetic data from approximately 600 CAD patients. The results demonstrated that incorporating trans-eQTLs significantly improved the prediction performance (R²) for a substantial portion of genes, particularly those with low predictive accuracy based on cis-eQTLs alone. For instance, in mammary artery (MAM) tissue, the inclusion of trans-eQTLs increased the number of significant genes by 42.9%. Furthermore, the method identified novel gene-disease associations not detected by traditional cis-only TWAS or present in the GWAS catalog. Genes like KCNE3, TIPARP, FRK, RAD50, and CUBN, implicated in various biological processes relevant to CAD, were uniquely identified by this framework.

This network-driven TWAS framework underscores the importance of incorporating trans-regulatory effects for a more holistic understanding of complex disease genetic architecture. By enhancing gene expression prediction and uncovering novel gene-disease associations, this approach provides valuable insights into the molecular mechanisms underlying CAD and may lead to the identification of new drug targets. While these findings are promising, further research is needed to validate the functional relevance of the identified associations and explore the applicability of this framework to other complex diseases.

Predicting Klebsiella Depolymerases with Machine Learning

DepoRanker: A Web Tool to predict Klebsiella Depolymerases using Machine Learning by George Wright, Slawomir Michniewski, Eleanor Jameson, Fayyaz ul Amir Afsar Minhas https://arxiv.org/abs/2501.16405

Caption: The Receiver Operating Characteristic (ROC) curve (left) highlights DepoRanker's superior performance (AUROC = 0.99) compared to BLAST (AUROC = 0.94) in distinguishing depolymerases from non-depolymerases. The Precision-Recall (PR) curve (right) further emphasizes DepoRanker's enhanced precision (AUCPR = 0.42) over BLAST (AUCPR = 0.37), reflecting its ability to correctly identify true depolymerases among predicted candidates.

Antimicrobial resistance (AMR) poses a grave threat to global health, with Klebsiella infections being particularly concerning. Phage therapy, utilizing bacteriophages to target and eliminate bacteria, presents a promising alternative to traditional antibiotics. Crucial to effective phage therapy is the identification of phage depolymerases, enzymes that degrade the protective capsules of Klebsiella. Traditional methods like BLAST, relying on sequence homology, have limitations in discovering novel depolymerases. DepoRanker, a new machine learning tool, aims to revolutionize depolymerase discovery.

DepoRanker employs a novel implementation of Extreme Gradient Boosting (XGBoost) to rank proteins based on their likelihood of being a depolymerase. The model was trained on a dataset of experimentally verified depolymerase proteins and their associated phage proteomes, using a simple amino acid composition of a protein as its feature representation. The model learns a prediction function f(x, θ), where x is the feature vector of a protein sequence, and θ are the learnable parameters. The goal is to find optimal parameters θ* such that the score f(xᵢ, θ*) for positive examples (known depolymerases) is higher than f(xⱼ, θ*) for negative examples (non-depolymerases). Non-redundant cross-validation, using CD-HIT to cluster depolymerases based on sequence similarity, was performed to ensure the model's robustness.

The cross-validation results demonstrated DepoRanker's significant advantage over BLAST. DepoRanker achieved a median rank of the first positive prediction (RFPP) of 1, meaning the true depolymerase was ranked as the top prediction in 50% of the tested proteomes. In contrast, BLAST had a median RFPP of 31. Furthermore, DepoRanker achieved an AUROC of 0.99 and an AUCPR of 0.42, showcasing excellent discrimination between depolymerases and non-depolymerases. The model's generalization capabilities were assessed using an external test set of five recently characterized proteins and their phage proteomes. DepoRanker consistently ranked a known depolymerase within the top 3 predictions for all five proteomes, demonstrating its ability to generalize to unseen data.

Beyond its predictive power, DepoRanker is readily accessible. A user-friendly web server and the open-source code on GitHub make the model easily usable for researchers, even those without programming expertise. A pre-computed list of the highest-scoring proteins from 665 Klebsiella phage proteomes, processed using DepoRanker, is also available as a valuable resource. While primarily focused on Klebsiella, the underlying approach could be adapted for other pathogens or therapeutic proteins, opening exciting avenues for future research.

Deep Learning Model for Predicting TFBSs

TFBS-Finder: Deep Learning-based Model with DNABERT and Convolutional Networks to Predict Transcription Factor Binding Sites by Nimisha Ghosh, Pratik Dutta, Daniele Santoni https://arxiv.org/abs/2502.01311

Caption: The architecture of TFBS-Finder, a novel deep learning model for predicting Transcription Factor Binding Sites (TFBSs), is visualized. It integrates DNABERT for sequence embedding with CNN, Modified Convolutional Block Attention Module (MCBAM), and Multi-Scale Convolutions with Attention (MSCA) to capture both global and local sequence context. The final output module integrates features from MCBAM and MSCA for enhanced prediction accuracy.

Predicting Transcription Factor Binding Sites (TFBSs) is paramount for understanding gene regulation. While existing deep learning models have shown promise, there's still room for improvement. TFBS-Finder, a novel deep learning model, leverages pre-trained DNABERT for sequence embedding and combines it with convolutional networks and attention mechanisms for superior TFBS prediction. The model architecture consists of a CNN module for extracting higher-order local features, a Modified Convolutional Block Attention Module (MCBAM) and a Multi-Scale Convolutions with Attention (MSCA) module for refining and enhancing these features, and an output module for prediction.

TFBS-Finder's key innovation lies in its combined use of global and local context. DNABERT captures long-range dependencies in DNA sequences, while the CNN, MCBAM, and MSCA modules focus on local motifs and patterns at different scales. MCBAM utilizes both spatial and channel attention, applying spatial attention before channel attention for enhanced performance. MSCA employs multi-scale convolutions with attention to capture broader local context and multi-scale features. The output module integrates features from both MCBAM and MSCA, implementing a parallel attention mechanism. The model is trained using cross-entropy loss: Loss(y, ŷ) = (1/η) Σi=1η [yi(logŷi) + (1 - yi)log(1 - ŷi)], where y and ŷ are the actual and predicted values, and η is the batch size.

Rigorous evaluation on 165 ENCODE ChIP-seq datasets demonstrated the contribution of each module, highlighting the importance of the combined architecture. TFBS-Finder achieved an accuracy of 0.930, a PR-AUC of 0.961, and a ROC-AUC of 0.961, outperforming state-of-the-art models like BERT-TFBS by 7.9%, 4.1%, and 4.2% respectively. Cross-cell line validation experiments further showcased the model's robustness and generalizability in predicting CTCF binding sites across different cell lines, achieving ROC-AUC scores above 0.93 in all cases.

TFBS-Finder represents a significant advancement in TFBS prediction. Its innovative combination of DNABERT with convolutional networks and attention mechanisms allows for a more comprehensive and accurate analysis of DNA sequences. Its superior performance on benchmark datasets and robustness in cross-cell line validation suggest its potential as a valuable tool for researchers studying gene regulation and related fields. Future work will explore incorporating DNA structure information to further enhance the model's predictive capabilities.

Conclusion

This newsletter highlights the rapid advancements in computational genomics, showcasing novel tools and methodologies addressing critical challenges in gene regulation, disease association, and drug response. From predicting transcription factor binding sites with TFBS-Finder's sophisticated deep learning architecture to leveraging LLMs for rare disease gene prioritization with innovative bias mitigation strategies, the field is rapidly evolving. The development of DepoRanker, a machine learning tool outperforming traditional methods in identifying Klebsiella depolymerases, offers a promising avenue for phage therapy development. Meanwhile, the network-driven TWAS framework presented by Mohammad et al. enhances gene-disease association studies by incorporating trans-regulatory effects, providing a more complete understanding of the genetic architecture of complex diseases like CAD. Finally, scGSDR demonstrates the power of incorporating gene semantics into single-cell pharmacological profiling, improving drug response prediction and offering mechanistic insights into drug resistance. These advances collectively underscore the growing importance of integrating diverse data sources and leveraging sophisticated algorithms to unravel the complexities of biological systems.