Subject: Computational Biology Breakthroughs: From Genomics to Clinical Applications
Hi Elman,
Several recent studies have explored novel computational approaches for analyzing complex biological data, ranging from genomic sequences to clinical notes. Turnbull et al. (2025) https://arxiv.org/abs/2503.09312 introduced Terrier, a deep learning model for classifying repetitive DNA sequences. Trained on the comprehensive RepBase library, Terrier demonstrated superior accuracy compared to existing tools, particularly in divergent taxa, addressing a key limitation in understanding repeat evolution. This advance facilitates research on repeat-driven evolution, genomic instability, and phenotypic variation, especially in non-model organisms. Concurrently, Yang et al. (2025) https://arxiv.org/abs/2503.07981 developed a reinforcement learning approach for designing regulatory DNA sequences. By incorporating biological priors through computational inference-based rewards, their method generates high-fitness CREs, overcoming limitations of traditional iterative optimization methods. This work has significant implications for therapeutic and bioengineering applications.
Moving beyond individual sequences, Esenturk et al. (2025) https://arxiv.org/abs/2503.13189 investigated the causes of evolutionary divergence in prostate cancer using a novel causal inference method. Their analysis identified key genetic alterations that drive or prevent subtype divergence, suggesting a positive-feedback loop accelerating this process. This research provides crucial insights into cancer subtype emergence and informs genomic surveillance strategies. Similarly, Hermansen et al. (2025) https://arxiv.org/abs/2503.13078 proposed a Bayesian Cox model with graph-structured variable selection priors for multi-omics biomarker identification. Leveraging prior biological knowledge through a Markov random field prior, this approach improves the interpretability and stability of biomarker selection, particularly valuable in cancer prognosis. The robustness of this model to partially correct prior information further enhances its practical applicability.
The integration of machine learning with real-time biological data is also gaining traction. Huang et al. (2025) https://arxiv.org/abs/2503.12330 developed a machine learning framework to analyze sleep patterns in Drosophila and identify the influence of metabolic changes. Their findings revealed that ketone metabolism plays a crucial role in sleep stability and circadian dynamics, with metabolic interventions exhibiting time-dependent effects. This work underscores the importance of considering temporal dynamics in studying sleep-metabolism interactions.
Finally, the application of large language models (LLMs) to clinical tasks is being actively explored. Wu et al. (2025) https://arxiv.org/abs/2503.12286 demonstrated the potential of combining Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) for rare disease diagnosis from clinical notes. Their RAG-driven CoT and CoT-driven RAG methods, utilizing resources like HPO and OMIM, significantly improved candidate gene prioritization accuracy compared to foundation models alone. This research highlights the importance of tailoring LLMs for specific clinical applications and leveraging domain-specific knowledge bases. Complementing this, Li et al. (2025) https://arxiv.org/abs/2503.06845 introduced Bizard, a community-driven platform for biomedical data visualization. This resource aims to simplify data analysis by providing a repository of visualization codes, tutorials, and interactive forums, fostering collaboration and standardization in the field.
Causes of evolutionary divergence in prostate cancer by Emre Esenturk, Atef Sahli, Valeriia Haberland, et al. https://arxiv.org/abs/2503.13189
Prostate cancer, like many cancers, exhibits divergent evolutionary trajectories, leading to distinct subtypes with varying prognoses. This study introduces a novel two-stage semi-parametric method to address the challenges of identifying causal drivers in this complex system. The researchers analyzed genomic data from 829 prostate cancer patients, focusing on a three-event causal chain and incorporating potential confounders. Their method utilizes do-calculus and a semi-parametric approach to estimate causal effects, quantified by the Average Causal Effect (ACE). Incorporating temporal information via Cancer Cell Fraction (CCF), they derived a temporal ACE (tACE) to ensure the correct direction of causality.
The study identified a strong causal link between AR dysregulation and divergence toward the Alternative-evotype, a subtype associated with poorer prognosis. The probability of this trajectory increased significantly (ACE<sub>U</sub>(AR, ALT) = 0.69, CI:[0.52, 0.87]) upon AR dysregulation. Similarly, CHD1 loss, impacting AR function, was causally linked to the Alternative-evotype (ACE<sub>U</sub>(CHD1-, ALT) = 0.28, CI:[0.21, 0.36]). Further analysis revealed that CHD1 loss and AR dysregulation causally influence other genetic alterations associated with the Alternative-evotype, while exhibiting an "anti-causal" effect on some Canonical-evotype alterations. The pairwise analysis of CNAs showed a positive-feedback loop accelerating divergence within the Alternative-evotype, while the Canonical-evotype showed causal neutrality, suggesting its dependence on stochastic acquisition of alterations. This work provides strong evidence for AR dysregulation's causal role in subtype emergence, highlighting its potential as a key biomarker. The novel causal inference method offers a powerful tool for dissecting complex evolutionary processes in cancer and other biological systems.
Computational identification of ketone metabolism as a key regulator of sleep stability and circadian dynamics via real-time metabolic profiling by Hao Huang, Kaijing Xu, Michael Lardelli https://arxiv.org/abs/2503.12330
This study reveals the critical role of ketone metabolism in sleep regulation by developing a machine learning framework to analyze Drosophila sleep patterns and identify the influence of metabolic changes at specific times. The framework leverages the independent nature of sleep periods in Drosophila, treating sleep at each hour as a discrete feature. Gradient boosting models and explainable AI quantified the influence of these time-dependent features, while causal inference and autocorrelation analyses confirmed their statistical independence, essential for isolating the effects of metabolic interventions.
Applying this framework to flies with altered monocarboxylate transporter 2 (MCT2) expression, crucial for ketone transport, revealed modifications in sleep stability and day-night transitions. Increasing RU486 concentrations, inhibiting ketone uptake, decreased sleep duration, particularly at night, and increased sleep fragmentation. Circadian rhythmicity was also progressively reduced. Investigating ketone elevation's effects in an Alzheimer's disease (AD) Drosophila model, β-hydroxybutyrate (BHB) supplementation and intermittent fasting (IF) restored sleep duration and continuity, particularly at night, suggesting a protective effect against AD-induced deficits. Both interventions improved sleep consolidation and partially restored circadian function, highlighting the potential of metabolic interventions targeting ketone metabolism for sleep disruptions in AD. The study emphasizes the temporal dynamics of metabolite function, showing that metabolic states exert their strongest influence at distinct time points, shaping sleep stability and circadian transitions. While temporal correlations were observed, no direct causal relationships were found between sleep features at different times, suggesting a complex regulatory process driven by rhythmic fluctuations rather than a linear chain of events.
Regulatory DNA sequence Design with Reinforcement Learning by Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen https://arxiv.org/abs/2503.07981
Caption: The image illustrates the reinforcement learning process of TACO for CRE design. Each step represents the addition of a nucleotide (actions a_0 to a_7) to the growing DNA sequence (states S_0 to S_80), guided by a policy and rewarded based on TFBS generation and final CRE fitness, evaluated by an "oracle."
This research introduces TACO (TFBS-Aware Cis-Regulatory Element Optimization), a novel approach leveraging reinforcement learning (RL) for designing synthetic cis-regulatory elements (CREs) like promoters and enhancers. TACO addresses limitations of traditional methods by pre-training an autoregressive (AR) DNA generative model (HyenaDNA) on low-fitness CRE sequences and then fine-tuning it using RL. The AR model acts as the policy network, guided by a reward model incorporating biological prior knowledge about transcription factor binding sites (TFBSs). The reward function combines a fitness reward (applied at the end) with a TFBS reward (r<sub>TFBS</sub>) applied upon generating a TFBS. This TFBS reward is determined by inferring the regulatory role (activator or repressor) of each TFBS using a LightGBM model trained on TFBS frequency features and interpreted with SHAP values. The objective is to maximize the expected cumulative reward:
max<sub>θ</sub> E[Σ<sup>L</sup><sub>i=1</sub> r(s<sub>i-1</sub>, a<sub>i</sub>)]
where θ represents the policy parameters, L is the sequence length, s<sub>i-1</sub> is the partial sequence, and a<sub>i</sub> is the nucleotide selected at position i.
Evaluated on yeast promoter and human enhancer design tasks, TACO consistently generated sequences with superior fitness, exceeding baseline methods. In yeast, TACO achieved maximum fitness. For human enhancer design, TACO achieved state-of-the-art fitness in two of three cell lines while maintaining significantly higher sequence diversity. Ablation experiments confirmed the importance of pre-training and the TFBS reward, highlighting the benefit of incorporating biological knowledge. This represents a significant advance in CRE design, offering a generative approach combining RL's power with data-driven biological insights.
Learnable Group Transform: Enhancing Genotype-to-Phenotype Prediction for Rice Breeding with Small, Structured Datasets by Yunxuan Dong, Siyuan Chen, Jisen Zhang https://arxiv.org/abs/2503.11180
Caption: The figure illustrates the Learnable Group Transform (LGT) framework for genotype-to-phenotype prediction in rice. It depicts the data preprocessing steps, the LGT procedure with its group transformations and convolutional layers, and the encoder-decoder architecture used for both single-trait and multi-trait prediction. This approach leverages graph representations and transformer-like operations to capture complex genetic interactions and improve prediction accuracy.
This paper introduces Learnable Group Transform (LGT), a framework enhancing genotype-to-phenotype (G2P) prediction in rice. Addressing challenges posed by complex genetics and small datasets, LGT combines linear genetic modeling with graph-based and transformer-based deep learning. It leverages graph representations of genomic data to capture spatial relationships and genetic structures, and uses a transformer module to learn complex patterns. An optimized training strategy incorporating intelligent sampling and multi-trait integration further aims to improve prediction accuracy with limited data.
LGT's core innovation lies in its group-based transformations of genotype data, employing the permutation group S<sub>M</sub> (on M features) to achieve equivariance to feature reindexing. The transform output is defined via convolution: Wx;ψ = (x * ψ<sub>g</sub>)(t), where ψ<sub>g</sub>(t) = ψ(g(t)) for g ∈ G. This captures higher-order gene interactions (epistasis) often missed by traditional models. The model is trained using max-min alternating optimization, updating filter parameters (ψ) and transformation parameters (g) separately.
Evaluated on the Rice529 dataset, LGT outperformed baselines in single-trait predictions across multiple agronomic traits. Multi-trait training showed mixed results, with some traits exhibiting positive transfer (improved MSE and RMSE) and others negative transfer, highlighting the complexities of multi-task learning. Fine-tuning the multi-trait model on individual traits partially mitigated negative transfer. This work demonstrates LGT's potential for improving genomic selection in rice, offering a promising solution for identifying superior genotypes.
Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes by Da Wu, Zhanliang Wang, Quan Nguyen, Kai Wang https://arxiv.org/abs/2503.12286
This study explores the potential of LLMs for diagnosing rare diseases directly from clinical notes, bypassing manual preprocessing. The researchers evaluated various LLMs, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, on Phenopacket-derived notes, PubMed narratives, and in-house clinical notes. They investigated the impact of Chain-of-Thought (CoT) prompting and Retrieval Augmented Generation (RAG), using the Human Phenotype Ontology (HPO) and Online Mendelian Inheritance in Man (OMIM) for knowledge retrieval.
Four methodologies were employed: direct inference (base and CoT prompts), RAG-enhanced base prompts, and hybrid approaches (RAG-driven CoT and CoT-driven RAG). RAG-driven CoT retrieves information before reasoning, while CoT-driven RAG reasons before retrieval. Performance was assessed using Top-1 and Top-10 accuracy.
Recent LLMs like Llama 3.3 and DeepSeek outperformed earlier versions. Both CoT and RAG individually improved performance, with the hybrid approaches yielding the most substantial gains. On Phenopacket-derived notes, DeepSeek's Top-10 accuracy increased from 11.72% to 42.13% with RAG-driven CoT. CoT-driven RAG benefited noisier in-house notes, increasing DeepSeek's diagnostic accuracy from 29.55% to 35.00%. These findings suggest that combining structured reasoning with knowledge retrieval enhances LLM performance in clinical tasks. While overall accuracy remains suboptimal for standalone clinical use, the improvements through CoT and RAG are promising, highlighting the potential of LLMs for automating rare disease diagnosis.
This newsletter highlights the diverse applications of computational approaches in biology and medicine. From understanding the causal drivers of prostate cancer evolution and the role of ketone metabolism in sleep regulation to designing regulatory DNA sequences and diagnosing rare diseases from clinical notes, these studies showcase the transformative potential of computational methods. The development of novel techniques like Terrier for repeat classification, reinforcement learning for CRE design, and the integration of CoT and RAG for clinical diagnosis underscores the ongoing innovation in this field. The emphasis on incorporating prior biological knowledge and leveraging the power of causal inference further strengthens the impact of these advancements. These studies collectively pave the way for more precise and personalized medicine, promising to accelerate scientific discovery and improve human health.