Subject: Cutting-Edge Advances in Biological Sequence Analysis with Machine Learning
Hi Elman,
This collection of preprints explores diverse applications of machine learning and bioinformatics in biological sequence analysis, ranging from multi-omics integration to single-cell analysis and disease biomarker discovery. A common theme across several studies is the need for explainable AI (xAI) to enhance the interpretability of complex models. Hussein, Prasad, and Braytee (2024) review the role of xAI in multi-omics research, emphasizing its potential to unlock the "black box" of deep learning models and provide clinicians with actionable insights. This focus on interpretability is crucial for translating the power of deep learning into clinical applications, particularly when integrating diverse data sources like genomics, proteomics, and metabolomics.
Another key area is developing novel methodologies for analyzing single-cell data and identifying disease biomarkers. Umar, Asif, and Mahmood (2024) introduce EnProCell, a reference-based method for single-cell RNA sequencing (scRNA-seq) analysis. Their approach leverages lower-dimensional projections obtained through an ensemble of principal component analysis and multiple discriminant analysis, coupled with a deep neural network for cell type classification. This method demonstrates improved accuracy and F1 scores compared to existing methods, offering a promising new tool for scRNA-seq research. Meanwhile, Khatun et al. (2024) employ a bioinformatic approach combined with machine learning algorithms—specifically support vector machines (SVM) and random forests (RF)—to identify biomarkers and pathways in gallbladder cancer. Their analysis reveals eleven hub genes, three of which (SLIT3, COL7A1, and CLDN4) are strongly implicated in GBC development and prediction.
Weinstein, Wood, and Blei (2024) tackle inferring the causal effects of T cell receptors (TCRs) on patient outcomes. Their method addresses unobserved confounders by utilizing the patient's pre-selection TCR repertoire, estimated from nonproductive TCR data. This innovative approach leverages the randomized nature of V(D)J recombination as a natural experiment, allowing for causal inferences about the impact of specific TCR sequences on disease outcomes. Their application to COVID-19 severity data highlights the method's potential to uncover therapeutically relevant TCRs.
Finally, Xiang et al. (2024) introduce BSM, a compact yet powerful multimodal biological sequence model. Trained on diverse datasets (RefSeq, gene-related sequences, and interleaved biological sequences from the web), BSM performs strongly on single-modal and mixed-modal tasks, outperforming larger models. Importantly, BSM exhibits in-context learning capabilities for mixed-modal tasks, absent in existing models. This work highlights the potential of mixed-modal training for improving the efficiency and cross-modal representation learning of biological sequence models. Distinctly, Chantzi, Mouratidis, and Georgakopoulos-Soares (2024) investigate the distribution and properties of Zimin avoidmers (k-mers lacking Zimin patterns) across various genomes. Their findings reveal differences in Zimin avoidmer frequencies and genomic localization preferences across organisms, with higher densities in prokaryotes and lower densities in eukaryotes, contributing to our understanding of sequence complexity and its potential biological significance.
Estimating the Causal Effects of T Cell Receptors by Eli N. Weinstein, Elizabeth B. Wood, David M. Blei https://arxiv.org/abs/2410.14127
Caption: This figure demonstrates CAIRE's superior performance in identifying causal TCRs compared to an uncorrected method. CAIRE achieves a PR AUC of approximately 0.9 at a motif rate of 0.01, significantly outperforming the uncorrected method with a PR AUC of roughly 0.55. This highlights CAIRE's ability to accurately predict the causal effect of TCRs on disease outcomes, even in the presence of confounding factors.
A new method called CAIRE (causal adaptive immune receptor effect estimator) aims to revolutionize our understanding of how TCRs influence disease. Instead of merely associating TCRs with disease, CAIRE predicts the causal effect of adding a specific TCR to a patient's repertoire, potentially enabling targeted therapies and vaccine development. Analyzing observational TCR data is challenging due to unobserved confounders (e.g., patient environment, infection history) that create spurious correlations. CAIRE addresses this by leveraging the pre-selection repertoire—a patient's initial TCR set before antigen exposure—as an instrumental variable. This repertoire, estimated from nonproductive TCR data (TCRs with disabling mutations), acts as a natural experiment unaffected by confounders.
CAIRE's core methodology uses a biophysical model of antigen-dependent selection: q(x) = (r<sub>i</sub>(x) / Σ<sub>x'∈X</sub> r<sub>i</sub>(x')q(x'))q(x). This describes how the mature repertoire q(x) develops from the pre-selection repertoire q(x) based on the relative fitness r<sub>i</sub>(x) of each TCR x in patient i. This reconstructs the selective pressures shaping a patient's repertoire, correcting for unobserved confounders. CAIRE then uses deep representation learning to embed high-dimensional TCR sequences into a lower-dimensional space, enabling computationally feasible causal effect estimations, even for rare TCRs.
In a semisynthetic data study, CAIRE accurately identified causal TCRs despite confounding (PR AUC: 0.86 ± 0.04), outperforming a non-causal method (PR AUC: 0.56 ± 0.02) and a state-of-the-art repertoire classification method without confounding correction (PR AUC: 0.55 ± 0.03). Applied to COVID-19 severity data, CAIRE revealed diverse TCR effects on outcomes: strong positive, weak positive, and even negative effects. CAIRE's causal effect estimates aligned better with laboratory binding data than estimates from a method without confounding correction.
The COVID-19 analysis also revealed heterogeneous patient repertoires, with varying causal effect TCR mixes. Some severely ill patients had a higher burden of TCRs with negative effects, potentially driving overactive immune responses, suggesting that the balance of beneficial and detrimental TCRs influences disease outcome. CAIRE identified promising therapeutic candidates: TCR sequences observed in patients, binding SARS-CoV-2 antigens in vitro, and showing positive clinical effects. It also highlighted potential vaccine antigens enriched for TCRs with beneficial effects. While limitations exist (e.g., reliance on a simplified T cell development model, potential residual confounding), CAIRE's ability to leverage pre-selection repertoires provides a valuable strategy for disentangling correlation from causation.
BSM: Small but Powerful Biological Sequence Model for Genes and Proteins by Weixi Xiang, Xueting Han, Xiujuan Chai, Jing Bai https://arxiv.org/abs/2410.11499
Caption: This diagram illustrates the BSM training process, showcasing the three-round approach using increasingly complex datasets. Starting with single-modal biological sequence data (DNA, RNA, protein), the model progressively incorporates mixed-modal pairs from NCBI RefSeq and gene-related sequences, culminating in the integration of interleaved web data for a holistic understanding of biological information.
Modeling biological sequences (DNA, RNA, proteins) is crucial for understanding complex processes. Existing models often specialize in a single data type or treat multiple types separately. BSM (Biological Sequence Model) takes a different approach: a mixed-modal foundation model trained on diverse datasets capturing relationships between DNA, RNA, and proteins. This includes RefSeq data, Gene Related Sequences data capturing gene-gene and gene-protein relationships, and interleaved biological sequences from the web, mimicking the natural co-occurrence of different biological data types. This cross-modal focus allows BSM to gain a more holistic understanding of each modality.
BSM's architecture uses a single-nucleotide tokenizer and an autoregressive Transformer to capture long-range dependencies. Two sizes exist: BSM-110M (12 layers, 12 attention heads, 768 hidden dimensions) and BSM-270M (20 layers, 16 attention heads, 896 hidden dimensions). Both use rotary position embedding and flash-attention for efficient training. A three-round training approach is key. The first round uses single-modal data to establish a foundational understanding. Subsequent rounds progressively incorporate mixed-modal datasets, using simulated annealing to optimize the data mix.
BSM's performance was evaluated on single-modal and mixed-modal tasks. Despite its small size, BSM achieved comparable or superior performance to larger models, including billion-parameter models. In the ncRNA-Protein Interaction (ncRPI) task, BSM outperformed LucaOne. In the Central Dogma task (DNA-protein associations), BSM performed on par with LucaOne. BSM also demonstrated few-shot learning on mixed-modal tasks, achieving near-supervised fine-tuning performance—a feat not observed in other models. In protein-specific tasks like Protein-Protein Interaction (PPI) and Prokaryotic Protein Subcellular Location (ProtLoc), BSM outperformed all baselines.
Scaling from 110M to 270M parameters further improved performance. Ablation studies confirmed the value of mixed-modal data, with each training round contributing to improvements. Perplexity evaluation on protein data demonstrated BSM's strong generative capabilities. These results highlight the effectiveness of BSM's mixed-modal approach, offering a powerful and efficient alternative to resource-intensive larger models.
Explainable AI Methods for Multi-Omics Analysis: A Survey by Ahmad Hussein, Mukesh Prasad, Ali Braytee https://arxiv.org/abs/2410.11910
Multi-omics, the integrated analysis of data from multiple biological levels (genomics, proteomics, etc.), offers a powerful approach to understanding complex biological systems. Deep learning (DL) is valuable for analyzing multi-omics data, but its "black box" nature hinders interpretability and clinical application. This survey explores how explainable AI (xAI) enhances DL model transparency in multi-omics research, highlighting its potential to provide actionable clinical insights. The shift from hypothesis-driven to data-driven methodologies requires transparent models for effective interpretation and application of complex data, particularly in clinical settings. The increasing use of DL in integrating multi-omics data for disease subtyping, biomarker discovery, and personalized medicine underscores the critical need for xAI.
The survey reviewed various xAI methods categorized by scope (global/local), implementation (ante-hoc/post-hoc), applicability (model-specific/agnostic), and explanation level (machine/human-interpretable). Model-agnostic approaches like SHAP (Shapley Additive exPlanations), using game theory to distribute prediction contributions among features (Φί = ∑s (d-s-1)![Fs∪{i} (xsu{i}) - Fs(xs)] / |d|!), and LIME (Local Interpretable Model-agnostic Explanations), providing local explanations, were discussed. Model-specific methods like attention mechanisms, Class Activation Mapping (CAM), Integrated Gradients, and Layerwise Relevance Propagation (LRP) were also examined. Visualization techniques (t-SNE, UMAP, heatmaps) and methods like Partial Dependence Plots (PDP) and Accumulated Local Effects (ALE) were highlighted for enhancing interpretability.
The review analyzed xAI applications in multi-omics research, including cancer subtyping, biomarker discovery, and drug response prediction. One study used xAI-guided deep learning to identify DNA methylation biomarkers for non-small cell lung cancer, achieving 84.95% classification accuracy. Another study integrated omics data with molecular networks, demonstrating superior performance in various classifications (single-cell embryonic stage, pan-cancer type). These studies highlight xAI's potential to improve DL model accuracy and interpretability in multi-omics analysis.
However, limitations in current xAI methodologies were identified, including challenges related to small sample sizes, data biases, computational intensity, and interpreting complex inter-omics interactions. The "black box" nature of deep learning, despite xAI integration, still poses interpretability challenges. The lack of standardized xAI evaluation metrics hinders comparison and assessment. Future research should address these limitations by developing more robust and scalable xAI techniques, improving complex interaction interpretability, and establishing standardized metrics.
A Bioinformatic Approach Validated Utilizing Machine Learning Algorithms to Identify Relevant Biomarkers and Crucial Pathways in Gallbladder Cancer by Rabea Khatun, Wahia Tasnim, Maksuda Akter, Md Manowarul Islam, Md. Ashraf Uddin, Md. Zulfiker Mahmud, Saurav Chandra Das https://arxiv.org/abs/2410.14433
Caption: This figure compares the performance of Support Vector Machine (SVM) and Random Forest (RF) classifiers in predicting gallbladder cancer based on different gene sets. The classifiers trained on the 11 identified hub genes achieved the highest accuracy (0.85 for SVM and 0.90 for RF), outperforming models trained on genes selected by Pearson correlation or Recursive Feature Elimination. This highlights the potential of these hub genes as robust biomarkers for gallbladder cancer.
Gallbladder cancer (GBC) is highly lethal, with a poor prognosis. Early detection is crucial, but the molecular mechanisms driving GBC progression remain poorly understood. This study integrated bioinformatics and machine learning to identify and validate novel GBC biomarkers, potentially improving diagnosis and treatment. Two microarray datasets (GSE100363, GSE139682) from NCBI GEO, containing GBC tumor and normal tissue samples, were analyzed.
Differentially expressed genes (DEGs) were identified by comparing GBC tumor and normal samples. Filtering by adjusted p-value (< 0.05) and log2 fold change (|log2FC| > 1) yielded 146 DEGs: 39 up-regulated and 107 down-regulated in GBC. Functional enrichment analysis (DAVID) revealed that up-regulated DEGs were involved in cell adhesion, epidermis development, and keratinization, while down-regulated DEGs were associated with cell differentiation, nervous system development, and cell adhesion. A protein-protein interaction (PPI) network was constructed (STRING database), and hub genes were identified using Degree, Maximum Neighborhood Component (MNC), and Closeness Centrality algorithms. The intersection yielded 11 hub genes. Feature selection methods (Pearson correlation, Recursive Feature Elimination (RFE)) identified significant gene subsets.
Machine learning models (SVM, RF) were trained on GSE100363 and validated on GSE139682 to assess the predictive performance of different gene sets. The 11 hub genes consistently outperformed other subsets. The RF model trained on hub genes achieved 0.90 accuracy, while the SVM model achieved 0.85 accuracy. Hub gene expression levels were validated using the GEPIA database, confirming significant up-regulation of SLIT3, COL7A1, and CLDN4 in GBC patients.
This study highlights the power of integrating bioinformatics and machine learning for biomarker discovery. The identified hub genes, particularly SLIT3, COL7A1, and CLDN4, are promising diagnostic and prognostic GBC biomarkers. Further investigation of these genes and their pathways could provide valuable insights into GBC development and progression. The study also underscores the importance of validating bioinformatics findings with independent datasets and machine learning models.
Lower-dimensional projections of cellular expression improves cell type classification from single-cell RNA sequencing by Muhammad Umar, Muhammad Asif, Arif Mahmood https://arxiv.org/abs/2410.09964
Caption: This figure illustrates the EnProCell workflow. It shows the process of generating ensembled projections from gene expression data using MDA and PCA (a-d), followed by training a deep neural network for cell type classification (e-h). This approach leverages both unsupervised and supervised dimensionality reduction for improved cell type identification in scRNA-seq data.
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity, but accurate cell type identification remains challenging due to high dimensionality and data sparsity. Current methods often rely on unsupervised dimensionality reduction like PCA, which may not optimally separate cell types. This paper introduces EnProCell, a reference-based method leveraging both unsupervised and supervised dimensionality reduction for improved cell type classification.
EnProCell operates in two phases. First, it computes lower-dimensional projections capturing high variance and class separability by ensembling PCA with Multiple Discriminant Analysis (MDA). MDA maximizes inter-class scatter while minimizing intra-class scatter, improving cell type separation in lower-dimensional space. The ensembled components are represented by S<sub>mj</sub> = [VU] (V: MDA components, U: PCA components). This projects scRNA-seq data into a lower-dimensional space using P<sub>nj</sub> = X<sub>nm</sub> * S<sub>mj</sub> (X<sub>nm</sub>: gene expression matrix). Second, EnProCell trains a deep neural network (autoencoder-like architecture) on this representation to classify cell types, using ReLU activation for hidden layers, softmax for the output layer, and minimizing cross-entropy loss.
EnProCell was evaluated on six datasets from different scRNA-seq technologies, compared to ACTINN, CaSTLe, SingleR, and scPred. For intra-dataset classification, EnProCell consistently outperformed others (accuracy: 98.91%, F1 score: 98.64%). It also excelled in inter-dataset classification (accuracy: 99.52%, F1 score: 99.07%). Varying the number of principal components showed that including PCA components generally improved performance, especially in intra-dataset classification. A five-layer autoencoder-like structure performed best. These results suggest that incorporating class information during dimensionality reduction leads to more discriminative features. EnProCell's robustness across datasets and technologies highlights its broad applicability. Beyond accuracy, EnProCell is computationally efficient, making it promising for large-scale scRNA-seq studies.
This newsletter highlights significant advancements in applying machine learning to biological sequence analysis. The emphasis on explainable AI (xAI), as exemplified by the survey on multi-omics research, underscores the growing need for interpretability in complex models. This theme resonates with the development of CAIRE, which tackles the challenging problem of causal inference in TCR repertoires, offering a potential breakthrough in understanding the adaptive immune system's role in disease. Similarly, BSM showcases the power of mixed-modal training for biological sequence models, achieving impressive performance with a compact architecture. The development of EnProCell for scRNA-seq analysis demonstrates how innovative dimensionality reduction techniques coupled with deep learning can significantly improve cell type classification. Finally, the application of machine learning in identifying GBC biomarkers exemplifies the potential of these techniques to accelerate disease research and improve clinical outcomes. These advancements collectively represent a significant step forward in leveraging the power of machine learning and bioinformatics to unlock the complexities of biological systems.