This newsletter covers several new computational approaches in genomics, spanning raw signal analysis, variant interpretation, gene expression modeling, and innovative machine learning applications. A groundbreaking development is presented by Firtina et al. (2024) (Firtina et al., 2024) with Rawsamble, a tool for de novo assembly directly from raw nanopore signals, bypassing basecalling altogether. This hash-based approach offers significant speed and memory improvements compared to traditional pipelines involving basecalling and overlapping tools like Dorado and minimap2. While a substantial portion of Rawsamble's overlaps match those of minimap2, its ability to assemble unitigs up to 2.7 million bases long suggests a potential paradigm shift in sequence analysis, contrasting with the reliance on basecalling in other recent works.
Disease mechanism prediction and gene prioritization are also key areas of advancement. Saadat and Fellay (2024a) (Saadat & Fellay, 2024a) present a graph-of-graphs approach combining protein-protein interaction networks and structural interactomics to predict the mode of inheritance and molecular mechanisms of genetic diseases, integrating graph neural networks and topological features. In a separate study, Saadat and Fellay (2024b) (Saadat & Fellay, 2024b) employ a DNA language model (HyenaDNA) and an interpretable graph neural network for gene prioritization and pathway identification in rare diseases, leveraging dynamic gene embeddings from variant data.
New methods for analyzing single-cell RNA sequencing (scRNA-seq) data are also emerging. Zhou et al. (2024) (Zhou et al., 2024) introduce scPN, a dynamical modeling approach that simultaneously infers cell pseudotime, velocity field, and gene interaction networks from multi-branch scRNA-seq data. Their piecewise gene-gene interaction network model provides an interpretable framework for complex gene regulation. Pashay Ahi et al. (2024) (Pashay Ahi et al., 2024) investigate the role of m6A RNA methylation in Atlantic salmon maturation, revealing differential expression of m6A-related genes in the hypothalamus, suggesting a role for epitranscriptomics.
Finally, several papers showcase innovative machine learning applications. Tsui et al. (2024) (Tsui et al., 2024) present SHAP zero, an algorithm for efficiently estimating all-order Shapley feature interactions in black-box genomic models using the Fourier transform. Liang (2024) (Liang, 2024) introduces DNAHLM, a hybrid large language model for DNA sequences and English text. Wickramarachchi et al. (2024) (Wickramarachchi et al., 2024) develop AskBeacon, a tool leveraging LLMs to interact with genomic data via the GA4GH Beacon protocol. Chu et al. (2024) (Chu et al., 2024) apply hierarchical classification with elastic-net regularization to predict metastasis from gene expression data. Golan and Shur (2024) (Golan & Shur, 2024) explore the theory of minimizer schemes in biological string processing.
Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism by Can Firtina, Maximilian Mordig, Harun Mustafa, Sayan Goswami, Nika Mansouri Ghiasi, Stefano Mercogliano, Furkan Eris, Joël Lindegger, Andre Kahles, Onur Mutlu https://arxiv.org/abs/2410.17801
Caption: This figure compares the output of Minimap2, a traditional overlapper operating on basecalled data (left), with that of Rawsamble, a novel overlapper working directly with raw nanopore signals (right). Rawsamble produces a more fragmented representation but demonstrates the feasibility of raw signal overlapping, achieving unitigs of substantial length (2.7 million bps) and showing significant overlap with minimap2's results.
Rawsamble represents a significant advancement in raw nanopore signal analysis, enabling de novo assembly directly from raw signals, bypassing the computationally intensive and time-consuming basecalling step. This innovative approach builds upon the existing RawHash2 algorithm, incorporating several key improvements for enhanced accuracy and efficiency. Rawsamble implements an aggressive filtering technique to remove non-distinct and adjacent signals, mitigating the impact of stay errors, a common issue in nanopore sequencing. Furthermore, the chaining mechanism is refined to construct longer and more accurate chains between noisy signals, and a deterministic ordering mechanism prevents trivial cyclic overlaps.
The performance gains achieved by Rawsamble are remarkable. Evaluations across various genomes demonstrate a significant speedup (averaging 16.36x and up to 41.59x) and a substantial reduction in peak memory usage (averaging 11.73x and up to 41.99x) compared to traditional pipelines utilizing Dorado for basecalling and minimap2 for overlapping. Impressively, 36.57% of the overlapping pairs identified by Rawsamble are identical to those generated by minimap2 after basecalling, demonstrating the accuracy of this direct raw signal analysis. Rawsamble achieves contiguous assembly segments (unitigs) up to an impressive 2.7 million bases in length for E. coli, highlighting its potential for assembling complex genomes.
While Rawsamble marks a significant step forward, the study also identifies challenges for future research in real-time de novo assembly, including dynamic index update and real-time assembly graph construction. The increasing memory and computational demands of dynamic index updates necessitate the development of efficient stopping mechanisms. Similarly, current assembly graph algorithms struggle with streaming data, prompting the exploration of alternative graph structures. Addressing the current inability of Rawsamble to reverse complement raw signals, which can lead to assembly gaps, is another area for future improvement. Integrating overlapping information with basecalling for enhanced accuracy and enabling selective basecalling of assembly-relevant reads could further optimize efficiency.
SHAP zero Explains All-order Feature Interactions in Black-box Genomic Models with Near-zero Query Cost by Darin Tsui, Aryan Musharaf, Amirali Aghazadeh https://arxiv.org/abs/2410.19236
Caption: This figure illustrates the workflow of SHAP zero, a novel algorithm for explaining black-box genomic models. It leverages the Fourier transform to efficiently estimate model behavior and then maps these coefficients to Shapley-based explanations, enabling the identification of high-order feature interactions at near-zero cost per query sequence after an initial one-time setup cost.
SHAP zero introduces a groundbreaking approach to explainability in black-box genomic models, addressing the computational bottleneck of calculating Shapley interactions, particularly crucial high-order interactions. By leveraging a novel connection between Shapley interactions and the Fourier transform, SHAP zero offers near-zero cost explanations per query sequence after an initial model sketching phase. This one-time investment involves estimating the top s Fourier coefficients, with a sample complexity of *O(sn²) * and computational complexity of O(sn³). This allows for the explanation of numerous query sequences with minimal additional computational overhead, unlike traditional methods that require extensive model evaluations for each query.
The efficiency of SHAP zero stems from mapping the Fourier coefficients to the Möbius transform and subsequently to Shapley-based explanations, all with complexities independent of the input sequence length n. The mapping from Fourier coefficients to the Möbius transform has a complexity of O(s²(2<sup>q</sup>)), where q is the alphabet size (4 for DNA). The final mapping to Shapley values ISV(i) and interactions IFSI(T) are given by:
ISV(i) = (1/q) Σ<sub>k:k<sub>i</sub>>0</sub> k<sub>0</sub>g<sup>k<sub>0</sub></sup>M[k]
IFSI(T) = (-1)<sup>|T|</sup> Σ<sub>k:k<sub>T</sub>>0|T|</sub> (1/q<sup>||k||<sub>0</sub></sup>)M[k]
where M[k] is the Möbius transform, k is the feature interaction vector, and g is the primitive q-th root of unity. This efficient mapping enables SHAP zero to reveal high-order feature interactions at a fraction of the computational cost of existing methods.
The effectiveness of SHAP zero is demonstrated through its application to two genomic models: TIGER, for predicting CRISPR guide RNA binding, and inDelphi, for predicting DNA repair outcomes. SHAP zero achieves significant speed improvements over existing methods like KernelSHAP and SHAP-IQ, showcasing its practical utility for large-scale genomic analyses. Importantly, it successfully identifies biologically relevant high-order features, such as GC content in TIGER and microhomologous motifs in inDelphi, which were previously computationally inaccessible.
AskBeacon -- Performing genomic data exchange and analytics with natural language by Anuradha Wickramarachchi, Shakila Tonni, Sonali Majumdar, Sarvnaz Karimi, Sulev Kõks, Brendan Hosking, Jordi Rambla, Natalie A. Twine, Yatish Jain, Denis C. Bauer https://arxiv.org/abs/2410.16700
AskBeacon empowers researchers and clinicians to interact with global genomic data resources through the power of natural language, removing technical barriers associated with complex query construction. By leveraging Large Language Models (LLMs), AskBeacon translates natural language questions into structured queries for the GA4GH Beacon protocol, enabling users to access and analyze data without specialized expertise. This user-friendly interface simplifies data retrieval, analysis, and visualization, streamlining the process from initial inquiry to publication-ready insights. Furthermore, AskBeacon facilitates federated queries across the global Beacon Network, empowering smaller research groups, including those representing underrepresented populations, to securely share and analyze data on a global scale.
AskBeacon's architecture employs both parallel and multi-step workflows for information extraction. The parallel workflow, using separate extractor templates for different query components, offers resilience but consumes more tokens. The multi-step workflow is more token-efficient but susceptible to chain termination. Evaluations of different LLMs, including open-source models like Gemma 2 and commercial models like GPT-4, highlighted Gemma 2's competitive performance in tasks like scope and granularity extraction, achieving F1-scores of 0.92 and 0.81 for parallel and multi-step workflows, respectively. This suggests the potential of fine-tuned smaller models for comparable performance to larger models.
Data security and human oversight are prioritized in AskBeacon. The platform maintains a clear separation between data extraction and analysis, allowing user validation and refinement. By utilizing sBeacon, a secure implementation of the Beacon protocol, data remains protected and is never directly exposed to the LLM. User access controls adhere to GA4GH guidelines, and security measures like static code analysis and sandboxing prevent malicious code execution.
DNA Language Model and Interpretable Graph Neural Network Identify Genes and Pathways Involved in Rare Diseases by Ali Saadat, Jacques Fellay https://arxiv.org/abs/2410.15367
Caption: This figure illustrates the framework for identifying causal genes and pathways in rare diseases using HyenaDNA and graph neural networks. Patient DNA sequences, with variants highlighted, are processed by HyenaDNA to generate gene embeddings, which are then used as input for both gene prioritization (case-vs-control and case-only) and pathway identification using GNNs and GNNExplainer. This approach enables the discovery of key genes and pathways associated with rare diseases, even with limited sample sizes.
This study presents a novel approach to gene prioritization and pathway identification in rare diseases, leveraging the power of DNA language models (DNA-LMs) and interpretable graph neural networks (GNNs). HyenaDNA, a long-range genomic foundation model, generates dynamic gene embeddings that capture the impact of deleterious variants, offering a granular view of their functional consequences. This approach addresses the challenges posed by small sample sizes in rare disease studies.
Two complementary methods for gene prioritization are introduced: case-vs-control and case-only. The case-vs-control approach uses logistic regression on gene embeddings to classify patients from controls, while the case-only approach calculates a distance score based on the Euclidean distance between mutant and non-mutant gene embeddings within the patient cohort. Pathway identification combines DNA-LMs, GNNs, and a Genetic Algorithm. Individual-specific protein-protein interaction (PPI) networks are constructed with gene embeddings as node features, and a GNN classifies patient graphs from controls. GNNExplainer scores node and edge importance, and a Genetic Algorithm identifies the subnetwork with the highest average explainability score. Pathway enrichment analysis then reveals over-represented pathways.
Validation on a cohort of children with severe respiratory illness and healthy controls identified IFIH1, a viral RNA sensor gene, as the top candidate in both prioritization methods. Pathway identification revealed a subnetwork involved in antiviral defense, with enrichment in interferon signaling and related pathways. This framework demonstrates the potential of DNA-LMs and GNNs for deciphering rare disease genetics, enabling targeted drug development and personalized medicine. Further research is needed to refine embedding interpretation and integrate multi-omics data. The probability of pathogenicity (PoP) is calculated using: PoP = (OP * 0.1) / ((OP - 1) * 0.1 + 1) , where OP (odds of pathogenicity) is a function of pathogenic and benign criteria.
Proteome-wide prediction of mode of inheritance and molecular mechanism underlying genetic diseases using structural interactomics by Ali Saadat, Jacques Fellay https://arxiv.org/abs/2410.17708
Caption: This figure illustrates the workflow for predicting the mode of inheritance (MOI) and functional effects of genetic variants. A graph neural network (GNN) predicts MOI from protein-protein interaction (PPI) networks, while AlphaFold-derived structures are used to predict functional effects (haploinsufficiency, gain-of-function, dominant-negative) through another GNN. The bar graph represents the distribution of predicted functional effects.
This research introduces a "graph-of-graphs" approach combining protein-protein interaction (PPI) networks and AlphaFold-derived protein structures to predict the mode of inheritance (MOI) and molecular mechanisms of genetic diseases. This framework utilizes graph neural networks (GNNs) and structural interactomics for proteome-wide predictions, offering a scalable method for understanding genetic disorders.
The methodology involves two prediction tasks. First, for MOI prediction, proteins are represented as nodes in the PPI network, and topological and protein-level features are used to classify proteins as autosomal dominant (AD), autosomal recessive (AR), or both (ADAR). Second, for molecular mechanism prediction, each protein is represented as a graph of amino acid residues, and structure-based features classify the functional effect of dominant-associated variants as haploinsufficiency (HI), gain-of-function (GOF), or dominant-negative (DN).
Using data from GenCC and OMIM for MOI and a curated dataset for molecular mechanisms, the study evaluated various GNN architectures. A GAT model achieved the highest recall for MOI prediction, while a GCN model performed best for functional effect prediction. Applying the trained GAT model to the entire PPI network revealed a distribution of predicted MOIs, and subsequent functional effect prediction on the AD/ADAR subset revealed a significant portion with combined effects.
Feature importance analysis highlighted the influence of constraint and conservation features for AD prediction, mitochondrial localization for AR prediction, and specific structural features for different functional effects. This work represents a significant advancement in predicting genetic disease inheritance and molecular mechanisms, paving the way for a deeper understanding of the complex interplay between genetic variation and disease.
This newsletter highlights significant advances in computational genomics, showcasing the transformative potential of novel approaches. Rawsamble's ability to perform de novo assembly directly from raw nanopore signals, bypassing basecalling, marks a potential paradigm shift in sequence analysis. The development of SHAP zero offers an efficient solution for explaining complex black-box genomic models, enabling the identification of crucial high-order feature interactions. AskBeacon democratizes access to genomic data by leveraging natural language processing, empowering researchers with varying technical expertise. The application of DNA language models and graph neural networks for rare disease gene discovery and pathway identification holds promise for advancing personalized medicine. Finally, the innovative use of structural interactomics for predicting genetic disease inheritance and molecular mechanisms offers a scalable and insightful approach to understanding the complex relationship between genotype and phenotype. These advancements collectively represent a substantial step forward in our ability to analyze, interpret, and ultimately utilize genomic data for improved healthcare and scientific discovery.