Several recent studies have explored novel computational approaches for analyzing complex biological data, particularly in cancer research and single-cell omics. Chowdhury et al. (2025) (Chowdhury et al., 2025) introduced a multi-view feature selection framework for transcriptome data, employing a novel iterative Boruta-based approach. Coupled with ensemble classifiers combining Logistic Regression, Support Vector Machines, and XGBoost, this framework achieved high accuracy (97.11%) and AUC (0.9996) in classifying 33 cancer types, significantly outperforming existing methods, especially for challenging cancers with similar tissue origins. This work underscores the potential of sophisticated feature selection and ensemble learning for improving cancer classification.
Moving beyond classification, Uthamacumaran (2025) (Uthamacumaran, 2025) leveraged deep learning and graph-based machine learning to investigate phenotypic plasticity in pediatric high-grade gliomas (pHGGs). By analyzing single-cell transcriptomics data, Uthamacumaran identified key network interactions and transition genes driving glioma morphogenesis and highlighted the role of the tumor-immune microenvironment. This research suggests that pHGGs exhibit maladaptive behaviors and hybrid cellular identities, potentially offering new therapeutic avenues by targeting these plasticity networks.
Further advancing single-cell and spatial omics analysis, Zheng et al. (2025) (Zheng et al., 2025) developed TopoLa, a framework that enhances cell representations by incorporating topological relationships through latent hyperbolic geometry. The introduction of the TopoLa distance (TLd) metric allows for more effective capture of network structure, improving performance across various tasks, including clustering and domain identification. This innovative approach highlights the importance of considering topological information for enhancing cell representations.
Focusing on personalized medicine, Zolotareva et al. (2025) (Zolotareva et al., 2025) investigated transcriptomic signatures for predicting bevacizumab response in ovarian cancer. Using RNA-seq data, they identified a signature potentially linked to cancer stemness driven by CTCFL activation. Patients with this signature demonstrated improved survival with bevacizumab treatment, suggesting its potential as a biomarker for treatment stratification. This work emphasizes the need for robust biomarkers to guide treatment decisions in ovarian cancer.
Finally, Biar et al. (2025) (Biar et al., 2025) introduced cliPE (curated loci prime editing), a streamlined pipeline for multiplexed assays of variant effect (MAVEs) using prime editing. This accessible and cost-effective method, accompanied by a dedicated Shiny app for pegRNA design, facilitates the study of variant effects in their native genomic context. This contribution advances the field of functional genomics by providing a robust and user-friendly tool for MAVE studies. Collectively, these studies demonstrate the ongoing development of innovative computational and experimental approaches for addressing critical questions in cancer research and single-cell biology.
Multi-megabase scale genome interpretation with genetic language models by Frederik Träuble, Lachlan Stuart, Andreas Georgiou, Pascal Notin, Arash Mehrjou, Ron Schwessinger, Mathieu Chevalley, Kim Branson, Bernhard Schölkopf, Cornelia van Duijn, Debora Marks, Patrick Schwab https://arxiv.org/abs/2501.07737
Caption: This diagram illustrates the architecture of Phenformer, a novel deep learning model that predicts disease risk and elucidates potential mechanisms directly from DNA sequence. The model processes genome sequences, generates expression predictions, and uses transformer layers to link these predictions to individual disease risk and subtypes. The output includes potential disease mechanisms, individual risk predictions for six major diseases, and disease subtype clustering.
Interpreting the human genome remains a formidable challenge in understanding disease mechanisms. Existing methods, like Polygenic Risk Scores (PRS), often rely on single nucleotide polymorphisms (SNPs) and struggle to capture the full complexity of genetic variation or provide mechanistic insights. This paper introduces Phenformer, a groundbreaking multi-scale genetic language model that directly links individual genome sequences to disease risk and potential underlying mechanisms. Unlike existing approaches, Phenformer can analyze up to 88 million base pairs, nearly 3% of the human genome, an order of magnitude larger than previous models. The model’s architecture mimics the biological flow of information: sequence → cell context → expression → phenotype, enabling it to generate rich, multi-scale hypotheses.
Phenformer processes genome sequences in 512 windows of 196 kilobases, centered on transcription start sites (TSS). These sequences are fed into a pre-trained sequence-to-expression model (Enformer) to generate embeddings. These embeddings then serve as input to Phenformer's core, consisting of transformer layers and a pooling mechanism. The model is trained on whole genome sequencing data from over 150,000 individuals in the UK Biobank, using a 60/20/20 train-validation-test split. Separate Phenformer models were trained for six major diseases: psoriasis, type 1 diabetes, type 2 diabetes, diabetic retinopathy, hypothyroidism, and COPD.
The results demonstrate Phenformer’s remarkable ability to identify disease-associated cell types and generate mechanistic hypotheses. Its predictions for disease-associated cell types were more accurate than existing state-of-the-art methods leveraging additional data like single-cell RNA sequencing. Furthermore, Phenformer highlighted potential mechanisms for several poorly understood disease pathologies, such as liver involvement in psoriasis and appendicitis complications in type 1 diabetes. In terms of predictive performance, ensembles of Phenformer with existing PRS methods showed significant improvement in disease risk prediction, boosting AUROC by up to 4.2% and 11.19% in populations of mixed and non-European ancestry, respectively, compared to PRS alone. When evaluated on the same 3% of the genome used for Phenformer training, the improvement reached 5.49% and 14.59% for mixed and non-European ancestries, respectively. This improvement underscores the potential of Phenformer to address health disparities across diverse populations.
Beyond population-level risk prediction, Phenformer enables subtyping of individuals based on their unique genetic predispositions. The model identified molecular clusters within diseases characterized by differential prevalence of co-morbidities, suggesting the ability to stratify individuals based on underlying molecular processes. This capability opens exciting possibilities for personalized medicine and a deeper understanding of disease heterogeneity. Despite these promising results, limitations exist, primarily the computational constraints that limited the analysis to 3% of the genome and a subset of individuals. Future research with larger datasets and increased computational power could further enhance Phenformer’s performance and broaden its applicability. The authors also emphasize the importance of careful interpretation of Phenformer’s attributions, as they represent potential, not necessarily causal, mechanisms. Ethical considerations regarding data bias and potential misuse of genetic risk predictions are also highlighted, underscoring the need for responsible development and application of such powerful tools.
Deep Learning-based Feature Discovery for Decoding Phenotypic Plasticity in Pediatric High-Grade Gliomas Single-Cell Transcriptomics by Abicumaran Uthamacumaran https://arxiv.org/abs/2501.04181
Pediatric high-grade gliomas (pHGGs), including K27M-mutant DIPG and IDH wild-type glioblastoma, are aggressive childhood brain tumors with limited treatment options. These tumors exhibit significant phenotypic plasticity, enabling them to adapt and evade therapies. This study leveraged deep learning models, including Hopfield networks, variational autoencoders (VAEs), generative adversarial networks (GANs), and graph convolutional networks (GCNs), to analyze single-cell RNA sequencing data from pHGG subtypes. The goal was to identify key regulators of plasticity and potential therapeutic targets. Block Decomposition Method (BDM) was used to quantify shifts in gene expression after perturbations, providing insights into network stability and critical transition genes.
The study employed several deep learning architectures to analyze single-cell transcriptomic data. Hopfield networks identified attractor states in the energy landscape of gene expression, representing stable phenotypic states. VAEs and GANs generated latent representations of cell states and identified transition genes driving phenotypic shifts. GCNs and GATs modeled cell-cell interactions as graph networks, revealing key regulators of plasticity. BDM analysis, coupled with network centrality measures, identified genes whose perturbation significantly destabilized network dynamics, highlighting potential therapeutic targets.
The analysis revealed distinct yet overlapping pathways driving plasticity in K27M and IDHWT gliomas. K27M tumors showed enrichment for ribosomal protein-encoding genes, metabolic regulators, and immune signaling components, with ANXA6, NDUFAB1, and UBE2L3 exhibiting significant BDM shifts. IDHWT gliomas displayed upregulation of genes involved in cell cycle dynamics, oxidative stress response, and chromatin remodeling, including SPATS2, TFDP2, HOOK3, and GATAD1. Both subtypes showed evidence of dysregulated neurodevelopmental programs, with markers of radial glia and neuronal differentiation. Calcium dynamics, ECM remodeling, and WNT signaling emerged as shared drivers of plasticity across both pHGG types. The convergence of findings across different deep learning models strengthens the reliability of the identified markers.
The study highlights the role of immune evasion, metabolic reprogramming, and ECM remodeling in pHGG plasticity and suggests these processes as potential therapeutic targets. The observed upregulation of neuronal lineage markers in both subtypes, despite their distinct cells of origin, suggests a potential therapeutic strategy of inducing neuronal differentiation to stabilize tumor plasticity and reduce aggressiveness. This innovative approach could offer new hope for children battling these devastating cancers. The study’s limitations include the lack of longitudinal data, restricting analysis to pseudotemporal inference. Future studies incorporating time-series gene expression could further refine our understanding of pHGG dynamics and inform the development of novel therapeutic strategies.
TopoLa: A Universal Framework to Enhance Cell Representations for Single-cell and Spatial Omics through Topology-encoded Latent Hyperbolic Geometry by Kai Zheng, Shaokai Wang, Yunpei Xu, Qiming Lei, Qichang Zhao, Xiao Liang, Qilong Feng, Yaohang Li, Min Li, Jinhui Xu, Jianxin Wang https://arxiv.org/abs/2501.08363
Researchers have developed a novel computational framework called Topology-encoded Latent Hyperbolic Geometry (TopoLa) to enhance cell representations in both single-cell and spatial transcriptomics. TopoLa addresses a key challenge in the field: capturing the intricate intercellular relationships that drive biological processes. Existing methods, while powerful, often introduce biases and limitations due to their varying underlying principles. TopoLa tackles this by encoding these relationships within a latent hyperbolic space, enabling a more precise and nuanced understanding of cellular interactions.
The framework hinges on two core components: TopoLa distance (TLd) and TopoConv. TLd is a new metric that quantifies the geometric distance between cells in the latent hyperbolic space, reflecting their structural similarity within the cell network. It achieves this by considering the weighted number of even-hop paths connecting cells, incorporating both local node connectivity and global network structure. TopoConv is a specialized spatial convolution technique that leverages TLd to refine cell representations. By convolving neighboring cells based on their positions in the latent hyperbolic space, TopoConv integrates geometric structural information, thereby correcting biases and enhancing the robustness of cell representations. The TLd matrix D<sub>topo</sub> is calculated as:
D<sub>topo</sub>(I - D<sub>topo</sub>)<sup>-1</sup> = A<sup>T</sup>A / λ = A<sup>T</sup>A(λI + A<sup>T</sup>A)<sup>-1</sup>
where A is the adjacency matrix of the original network, and λ is a regularization parameter.
The researchers evaluated TopoLa's performance across seven diverse biological tasks, including single-cell RNA sequencing (scRNA-seq) data clustering, multi-batch and multi-omic integration, rare cell identification, and spatially informed clustering of spatial transcriptomics (ST) data. Across these tasks, TopoLa consistently improved the performance of several state-of-the-art models. For instance, in scRNA-seq clustering, TopoLa enhanced the Adjusted Rand Index (ARI) by 5.8% and Normalized Mutual Information (NMI) by 2.4% compared to SIMLR. In multi-batch integration, scGPT augmented with TopoLa improved NMI by 3.2% and ARI by a substantial 15.4%. For rare cell identification, TopoLa boosted F1 scores by 28.1% compared to standard Surprisal Component Analysis (SCA). In spatially informed clustering of ST data, TopoLa enhanced ARI by 5.2% with predefined labels and by an impressive 20.9% when the number of cell types was unknown.
The consistent improvements across these diverse tasks highlight TopoLa’s generalizability and robustness. The researchers also provided a mathematical justification for its effectiveness, demonstrating that TLd, by incorporating global connectivity, captures the geometric structure of networks more precisely than traditional energy distance measures. Furthermore, they showed that TopoConv, by increasing the singular value gap, reduces noise interference and enhances the robustness of cell representations. TopoLa represents a significant advance in cell analysis, offering a novel and powerful approach to capture and leverage the intricate geometric relationships between cells. Its versatility and robust performance across various tasks position it as a valuable tool for driving both biological discovery and computational methodology development.
Transcriptome signature for the identification of bevacizumab responders in ovarian cancer by Olga Zolotareva, Karen Legler, Olga Tsoy, Anna Esteve, Alexey Sergushichev, Vladimir Sukhov, Jan Baumbach, Kathrin Eylmann, Minyue Qi, Malik Alawi, Stefan Kommoss, Barbara Schmalfeldt, Leticia Oliveira-Ferrer https://arxiv.org/abs/2501.04869
Ovarian cancer, with its high mortality rate and frequent late-stage diagnoses, presents a significant challenge in oncology. While cytoreductive surgery followed by platinum/taxane chemotherapy and maintenance therapy with bevacizumab and/or PARP inhibitors is the standard of care, there's a critical need for biomarkers to predict bevacizumab response and personalize treatment. This study aimed to address this gap by identifying a transcriptomic signature associated with bevacizumab benefit in ovarian cancer.
Researchers generated a novel RNA-seq dataset (UKE cohort, n=181) from patients treated at Universitätsklinikum Hamburg-Eppendorf, including 67 patients who received bevacizumab and 114 who received standard therapy. They also used a previously published microarray-based dataset (DASL cohort, n=377) for validation. Analyzing individual gene expression and known molecular subtypes yielded no statistically significant predictors of bevacizumab response. However, unsupervised patient stratification using the UnPaSt method, which identifies differentially expressed biclusters, revealed a promising candidate.
Out of 23 biclusters replicated in both UKE and DASL datasets, bicluster 84 emerged as the most promising predictor of overall survival (OS) under bevacizumab treatment. In both cohorts, patients with high expression of bicluster 84 genes showed improved OS with bevacizumab compared to standard therapy (UKE: HR=0.41 (0.23-0.74), adj.p-value=7.70e-03; DASL: HR=0.51 (0.34-0.75), adj.p-value=3.25e-03). No significant OS benefit was observed in patients with low bicluster 84 expression. This signature was further validated in the TCGA-OV dataset (n=426), although this cohort did not receive bevacizumab, demonstrating its reproducibility across datasets and platforms.
Bicluster 84 includes 14 genes significantly differentially expressed in both UKE and DASL cohorts, notably the cancer/testis antigen CTCFL/BORIS. This gene is known to regulate VEGF-A expression and maintain cancer stemness, suggesting a potential mechanism for bevacizumab's effect in this subgroup. While traditional pathway analysis provided limited insights, mining of public gene expression data linked bicluster 84 with stemness-related processes, further supporting the role of CTCFL. This signature offers a potential new classification of ovarian tumors independent of established molecular subtypes and could guide personalized bevacizumab treatment decisions, potentially also informing therapies targeting CTCFL. Further validation in larger, RNA-seq-based cohorts is needed to confirm these findings and refine the signature’s clinical utility.
A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier by Tareque Mohmud Chowdhury, Farzana Tabassum, Sabrina Islam, Abu Raihan Mostofa Kamal https://arxiv.org/abs/2501.06805
Caption: ROC curves for 12 cancer types (CHOL, COAD, ESCA, KICH, KIRC, KIRP, LIHC, PAAD, READ, STAD, UCEC, and UCS) against the rest, demonstrating the high performance of the avEns model. Each plot shows the ROC curves for 10 folds of cross-validation, with the mean ROC curve and AUC score indicated. The high AUC scores (mostly 1.00 or very close) across all folds and cancer types demonstrate the robustness and accuracy of the model in distinguishing these often-confused cancer types.
Accurately classifying cancer types is crucial for effective diagnosis and treatment. Traditional methods struggle with the high dimensionality and complexity of gene expression data. This study introduces a novel feature selection framework and ensemble classifiers designed specifically for transcriptome data, demonstrating significant improvements in pan-cancer classification accuracy. Existing methods often misclassify tumors with similar tissues of origin, a challenge this study directly addresses.
The researchers employed a multi-view feature selection method, partitioning the transcriptome dataset based on feature types (mRNA, miRNA, lncRNA, and other RNA). The Boruta feature selection algorithm was applied to each partition, the results combined, and Boruta applied again to the combined set. This process was iterated with varying Boruta parameters, generating multiple feature sets. Features were then ranked based on their frequency across these sets, with the best-performing set selected for further analysis. Two ensemble models, max voting parallel ensemble (mvEns) and average voting parallel ensemble (avEns), were constructed using Logistic Regression, Support Vector Machine, and XGBoost as base classifiers. The avEns model calculates the final prediction by averaging the prediction probabilities of the sub-models:
y = argmax<sub>i</sub> (1/n) Σ<sup>m</sup><sub>j=1</sub> W<sub>ij</sub>
where W<sub>ij</sub> is the predicted probability of the i<sup>th</sup> class label of the j<sup>th</sup> classifier.
The avEns model achieved an overall accuracy of 97.11% using a selected feature set of 3515 genes, outperforming state-of-the-art methods in classifying 33 cancer types from The Cancer Genome Atlas (TCGA) dataset. Critically, the model demonstrated over 90% accuracy in classifying 12 cancer types known to be difficult to differentiate due to their shared tissues of origin (CHOL, COAD, ESCA, KICH, KIRC, KIRP, LIHC, PAAD, READ, STAD, UCEC, and UCS). This is a marked improvement over existing literature, where accuracies for these specific cancers often fall below 90%. The mvEns model also performed well, achieving 96.88% accuracy. The superior performance of avEns is attributed to its use of probability averaging, which minimizes information loss compared to the majority voting approach of mvEns.
Further analysis using Gene Ontology (GO) pathway enrichment revealed that the selected features significantly enriched pathways known to be associated with cancer development, including Maturity Onset Diabetes of the Young (MODY), Complement and Coagulation Cascades (CCC), and Extracellular Matrix (ECM)-receptor interaction. This finding supports the biological relevance of the selected features and strengthens the validity of the proposed feature selection framework. The study's results highlight the potential of multi-view feature selection and ensemble classifiers for improving pan-cancer classification, offering promising avenues for enhanced diagnostic accuracy and personalized treatment strategies.
This newsletter highlights significant advancements in computational biology, particularly in cancer research and single-cell omics. From innovative feature selection and ensemble methods for improved cancer classification to the development of novel deep learning architectures for understanding phenotypic plasticity and leveraging topological information for enhanced cell representations, these studies showcase the power of computational approaches in addressing complex biological questions. The introduction of Phenformer, a genetic language model capable of interpreting multi-megabase scale genome sequences, marks a significant leap forward in our ability to predict disease risk and understand underlying mechanisms directly from DNA. The application of deep learning to decode phenotypic plasticity in pHGGs offers new insights into tumor heterogeneity and potential therapeutic vulnerabilities. Similarly, TopoLa's innovative use of latent hyperbolic geometry enhances cell representations, improving performance across various single-cell and spatial omics tasks. The identification of a transcriptomic signature for predicting bevacizumab response in ovarian cancer underscores the potential of personalized medicine approaches, while the development of cliPE provides a powerful tool for functional genomics studies. Collectively, these advancements demonstrate the rapid pace of innovation in computational biology and its potential to transform our understanding and treatment of complex diseases.