Subject: Cutting-Edge Computational Biology: From Graphs to Genomes and Beyond
Hi Elman,
This newsletter covers recent exciting advancements in computational biology, spanning novel techniques for spatial transcriptomics, genomic variation analysis, multi-modal data integration, single-cell analysis, and the growing role of large language models.
Several new computational approaches are enhancing the analysis of complex biological data, particularly in genomics and transcriptomics. Xiao et al. (2024) introduce GATES, a graph attention network designed for spatial transcriptomics. GATES integrates both local spatial proximity and global gene expression similarity (Xiao et al., 2024), aiming to overcome limitations of existing methods that prioritize local spatial information, thereby improving spatial domain identification. Similarly, Su et al. (2024) propose sisPCA, a supervised extension of Principal Component Analysis that leverages the Hilbert-Schmidt Independence Criterion to disentangle interpretable factors within multiple subspaces (Su et al., 2024). They demonstrate its utility across diverse applications, including breast cancer diagnosis and single-cell analysis of malaria infection. These novel methods highlight the growing importance of incorporating both local and global information, as well as leveraging supervised learning techniques for improved feature extraction and data interpretation.
Another key theme emerging is the development of methods for analyzing genomic variation and structure. Mwaniki, Garrison, and Pisanti (2024) introduce the concept of "flubbles" to represent variants in pangenome graphs, offering a linear time solution for their detection and the construction of a hierarchical "flubble tree" (Mwaniki et al., 2024). This work provides new tools for investigating genomic structural variations, including hairpin inversions. Li et al. (2024) tackle the challenge of generating heterogeneous genomic sequences by combining AutoRegressive and Diffusion Models with a novel post-training sampling method called Absorb & Escape (Li et al., 2024). Their approach addresses the limitations of each individual model type, improving the quality of generated sequences across multiple species.
The integration of multiple data modalities is also gaining traction. Krishna et al. (2024) present PathoGen-X, a cross-modal network that translates and aligns histopathology image features into the genomic feature space for enhanced survival prediction (Krishna et al., 2024). This approach leverages the predictive power of genomic data while relying solely on readily available imaging data during testing. Rezapour et al. (2024) assess the concordance between RNA-Seq and NanoString technologies for gene expression analysis in Ebola-infected nonhuman primates, employing machine learning for cross-platform validation and identifying key markers of infection (Rezapour et al., 2024). Their findings highlight the complementary strengths of different technologies for comprehensive gene expression profiling.
Furthermore, researchers are exploring new ways to address challenges in single-cell analysis. Wu, Zuo, and Xie (2024) develop eDOC, a method using a transformer architecture with evidential learning to identify and interpret out-of-domain cell types in scRNA-seq data, including pinpointing cell type-specific gene drivers (Wu et al., 2024). Chen et al. (2024) introduce a fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction, offering significant speed improvements over existing Wasserstein-2 based methods (Chen et al., 2024). These advancements contribute to a more nuanced understanding of cellular heterogeneity and responses to perturbations.
Finally, the role of Large Language Models (LLMs) in biomedical research is being actively investigated. Wang et al. (2024) evaluate the capacity of LLMs to perform real-world data science tasks in clinical research, finding that while LLMs are not yet ready for full automation, they can significantly enhance efficiency when integrated into expert workflows (Wang et al., 2024). Their work underscores the potential of LLMs as valuable tools for augmenting, rather than replacing, human expertise in complex data analysis tasks. V and P (2024) examine the societal implications of nanopore sequencing technology, offering a sociological perspective on its dissemination and potential impact (V & P, 2024). Powadi et al. (2024) introduce a compositional autoencoder to disentangle genotype and environment-specific latent features for improved trait prediction in plants (Powadi et al., 2024). Hatami et al. (2024) explore the use of explainable convolutional neural networks for identifying mutations in SARS-CoV-2 associated with phenotypic changes, offering an alternative to traditional GWAS approaches (Hatami et al., 2024). These diverse applications demonstrate the broadening impact of computational methods across various domains of biological and medical research.
Can Large Language Models Replace Data Scientists in Clinical Research? by Zifeng Wang, Benjamin Danek, Ziwei Yang, Zheng Chen, Jimeng Sun https://arxiv.org/abs/2410.21591
Caption: This figure illustrates the evaluation pipeline for LLMs on clinical data science tasks using the CliniDSBench dataset. It shows the model input (question and dataset description), the answer generation process with various prompting methods, and the evaluation metrics based on unit tests and Pass@k scores. The figure also highlights the human-AI collaboration platform where users can integrate LLM-generated code into their workflows.
Data science is indispensable in clinical research but demands specialized expertise in coding and medical data analysis, a resource often in short supply. Large Language Models (LLMs) offer a potential solution, given their demonstrated proficiency in general coding and medical tasks. However, their practical utility in the specific context of clinical research data science remains largely unexplored. This study aimed to rigorously assess the capability of LLMs to automate real-world data science tasks in clinical research.
To achieve this, the researchers developed a benchmark dataset, CliniDSBench, comprising 293 coding tasks derived from 39 published clinical studies. These tasks, coded in both Python and R, realistically simulate clinical research scenarios using patient data. Six state-of-the-art LLMs (GPT-4, GPT-4-mini, Sonnet, Opus, Gemini-pro, and Gemini-flash) were evaluated using various adaptation strategies, including chain-of-thought prompting, few-shot prompting, and self-reflection. The primary performance metric was Pass@k, the probability of obtaining at least one correct solution within k attempts.
While LLMs struggled to consistently produce perfect solutions on the first attempt (Pass@1 varied considerably depending on task difficulty and the specific LLM), they often generated code that was close to being correct. Strategic adaptations like chain-of-thought prompting, providing a step-by-step analysis plan, significantly boosted performance, improving accuracy by 60%. Self-reflection, allowing LLMs to iteratively refine their code, also yielded a substantial 38% accuracy gain. Importantly, the study went beyond automated evaluation and explored the potential of human-AI collaboration by developing a platform that integrates LLMs into the data science workflow. A user study with five medical doctors demonstrated that while LLMs cannot fully automate coding, they significantly streamline the process. A remarkable 80% of the submitted code solutions incorporated LLM-generated code, with reuse rates reaching up to 96% in some cases. Code reuse was quantified using the Copy Ratio, calculated as the length of overlapping tokens between LLM-generated code and user-submitted code, divided by the length of the user-submitted code.
This research underscores the current limitations of LLMs in fully automating complex clinical data science tasks. They frequently failed to accurately interpret instructions, understand the target data, and adhere to standard analysis practices, highlighting the continued need for human oversight. However, the study also reveals the immense potential of LLMs as valuable tools when integrated into expert workflows. The developed platform effectively showcased the power of human-AI collaboration, significantly enhancing productivity and facilitating data science tasks in clinical research.
PathoGen-X: A Cross-Modal Genomic Feature Trans-Align Network for Enhanced Survival Prediction from Histopathology Images by Akhila Krishna, Nikhil Cherian Kurian, Abhijeet Patil, Amruta Parulekar, Amit Sethi https://arxiv.org/abs/2411.00749
Caption: The PathoGen-X architecture uses a pathology encoder (with LN, MSA, and PPEG modules) to extract image features and a genomic decoder (with similar modules) to translate these into a genomic representation space. A projection module (PM) and survival prediction network then predict survival risk from the aligned latent representation, trained using a combined loss function ( L<sub>t</sub> and L<sub>i</sub>) incorporating both genomic and imaging data.
Accurate cancer survival prediction is paramount for personalized treatment strategies. While genomic data often provides superior predictive power compared to pathology data, its acquisition can be costly and inaccessible. PathoGen-X, a novel cross-modal deep learning framework, addresses this challenge by leveraging both genomic and imaging data during training but relying solely on readily available histopathology images for testing. This innovative approach circumvents the limitations of existing methods that either require both data types during testing or project both modalities into a shared latent space, potentially leading to information loss. The core idea behind PathoGen-X is to translate the weaker imaging signals into the more informative genomic feature space, thereby enhancing the predictive capabilities of the imaging data.
PathoGen-X utilizes a transformer-based encoder-decoder architecture. A pathology encoder extracts relevant features from histopathology images, while a genomic decoder translates these features into a genomic representation space. A genomic projection network maps genomic embeddings into a latent space aligned with the output of the pathology encoder. This alignment is achieved without projecting both modalities into a shared latent space, preserving crucial modality-specific information. A survival prediction network, implemented as a multi-layer perceptron, then operates on this aligned latent representation to predict survival risk, trained using the Cox loss function (L<sub>Cox</sub> = Σ<sub>i∈events</sub>(f(X<sub>i</sub>) - log(Σ<sub>j∈R(T<sub>i</sub>)</sub>exp(f(X<sub>j</sub>))))). The entire model is trained using both pathology and genomic data, optimizing a combined loss function that incorporates both latent and translation losses to ensure effective cross-modal alignment and translation.
Evaluated on three publicly available cancer datasets (TCGA-BRCA, TCGA-LUAD, and TCGA-GBM), PathoGen-X consistently outperformed methods relying solely on pathology images for testing, achieving an average improvement of 0.05 in c-index. Remarkably, its performance approached that of models trained exclusively on genomic data, demonstrating the effectiveness of this cross-modal translation strategy. Ablation studies further confirmed the importance of both the latent and translation loss components for optimal performance. This research highlights the potential of leveraging genomic data during training to enhance the predictive power of readily available histopathology images for survival prediction. PathoGen-X’s ability to achieve genomic-level performance using only imaging data at test time makes it a promising tool for clinical settings where genomic data might be inaccessible.
Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder by Anirudha Powadi, Talukder Zaki Jubery, Michael C. Tross, James C. Schnable, Baskar Ganapathysubramanian https://arxiv.org/abs/2410.19922
Caption: The top diagram shows a standard autoencoder processing high-dimensional sensor data from a plant, resulting in a latent representation that entangles genotype and environment. The bottom diagram illustrates the proposed compositional autoencoder, which disentangles these factors into structured latent representations, enabling improved trait prediction.
In plant breeding and genetics, predicting complex traits from high-dimensional sensor data, like hyperspectral reflectance, is increasingly common. However, this data reflects the intertwined effects of both genotype (G) and environment (E), making it challenging to isolate their individual contributions. Traditional dimensionality reduction methods, such as PCA and standard autoencoders, create compact representations but fail to disentangle these intertwined effects. This study introduces a novel compositional autoencoder (CAE) framework designed specifically to address this limitation, leading to significant improvements in trait prediction.
The CAE uses a hierarchical architecture to decompose high-dimensional plant data into distinct G and E latent features. The process begins by encoding individual plant data from multiple replicates across different environments. These encoded representations are then fused and partitioned into three components: genotype-specific features, macro-environment features (shared by plants in the same field/year), and micro-environment features (unique to each replicate). Finally, these disentangled features are reassembled and decoded to reconstruct the original plant data. The CAE is trained using a two-part loss function: a reconstruction loss (mean squared error) and a correlation loss to ensure that the disentangled latent space features remain uncorrelated. The correlation loss is defined as: ΣᵢΣⱼ | CorrMatᵢⱼ – Iᵢⱼ|, where CorrMatᵢⱼ is the correlation coefficient between dimensions i and j in the latent space, and Iᵢⱼ is the identity matrix.
The researchers applied the CAE to a maize diversity panel dataset, comprising hyperspectral reflectance data from 578 genotypes grown in two environments with two replicates each. They then used the disentangled latent representations to predict two key traits: "Days to Pollen" and "Yield." The CAE significantly outperformed traditional methods, including standard autoencoders and PCA with regression. For "Days to Pollen," the CAE achieved an R² of 0.74, a dramatic improvement over the near-zero R² obtained with the standard autoencoder. For "Yield," a notoriously difficult-to-predict trait, the CAE achieved an R² of 0.34, again significantly outperforming the standard autoencoder (R² of 0.026) and PCA (R² of 0.034).
The study also demonstrated the robustness of the CAE's performance by training the model with multiple initial conditions and different regression models. Further experiments explored the impact of hyperparameters such as input masking, network depth, and latent space dimension, revealing that a 20% masking fraction and a latent space dimension of 20 provided optimal performance. These findings highlight the potential of the CAE for enhancing trait prediction in plant breeding and genetics programs.
Fast and scalable Wasserstein-1 neural optimal transport solver for single-cell perturbation prediction by Yanshuo Chen, Zhengmian Hu, Wei Chen, Heng Huang https://arxiv.org/abs/2411.00614
Predicting single-cell responses to perturbations, such as drug treatments, is a critical task in biology, but computationally demanding. Existing methods often employ Wasserstein-2 (W₂) optimal transport (OT) to map unpaired control and perturbed cell distributions. However, W₂ OT necessitates solving a complex min-max optimization problem, which converges slowly, particularly with high-dimensional data. This study introduces a novel solver based on the Wasserstein-1 (W₁) dual formulation, offering substantially faster and more scalable performance.
The key advantage of the W₁ dual lies in its simplified optimization. Unlike W₂, which requires optimizing over two conjugate functions, W₁ involves only a single 1-Lipschitz function, eliminating the need for time-consuming min-max optimization. While the W₁ dual itself reveals only the transport direction and not the full transport map, the researchers ingeniously incorporate adversarial training to learn a sample-specific transport step size, effectively recovering the complete map. This two-step process—learning the transport direction via W₁ dual and then the step size via GANs—forms the core of the proposed W₁ OT solver. The transport map is defined as: T(x) = x – η(x)∇f(x)/||∇f(x)||, where f(x) is the Kantorovich potential (1-Lipschitz function) and η(x) is the learned step size function.
Validation on both synthetic and real single-cell datasets demonstrated the efficacy of the W₁ OT solver. On 2D synthetic data, it successfully learned "monotonic" transport maps, preserving the relative order of data points after transport, a crucial property for maintaining biological interpretability. On real single-cell perturbation datasets ("4i" imaging and "sciplex3" scRNA-seq), the W₁ OT solver achieved performance comparable to or exceeding existing W₂ OT solvers. Critically, the W₁ solver exhibited a remarkable 25-45x speedup compared to W₂ solvers, completing training in minutes versus hours on a CPU. Furthermore, it demonstrated superior scalability on high-dimensional scRNA-seq data, where W₂ solvers struggled. This improved scalability stems from the use of the W₁ dual formulation and 1-Lipschitz GroupSort networks.
This newsletter highlights the rapid advancements in computational biology, showcasing innovative approaches to address complex challenges in data analysis and interpretation. The development of GATES and sisPCA underscores the importance of integrating local and global information in spatial transcriptomics and feature extraction. The introduction of "flubbles" and the Absorb & Escape method provides new tools for analyzing and generating genomic sequences, respectively. PathoGen-X demonstrates the power of cross-modal learning for enhanced survival prediction, while the novel W₁ optimal transport solver offers a faster and more scalable solution for single-cell perturbation prediction. Finally, the exploration of LLMs in clinical research highlights their potential as valuable tools for augmenting human expertise in complex data analysis tasks, even if full automation remains a future goal. These diverse advancements point towards a future where computational methods play an even more central role in driving biological discovery and improving human health.