Several new studies explore innovative approaches to analyzing genomic and proteomic data, leveraging advancements in AI, network science, and language models. He et al. (2024) introduce Genome Misclassification Network Analysis (GMNA), a framework integrating misclassified instances from AI models with network analysis to refine genome classification and understand drivers of misclassification. Applying GMNA to SARS-CoV-2 genomes using Naive Bayes, convolutional neural networks, and transformer models, they demonstrate its potential for investigating the impact of human mobility on viral spread. This work highlights the potential of incorporating model errors into a network-based framework for deeper biological insights. Concurrently, Liang (2024) investigates the transferability of language models from natural language to DNA sequences, finding that pre-trained models like GPT-2 can achieve reasonable accuracy on DNA-pair classification tasks after fine-tuning on natural language data. This suggests a potential, albeit limited, transfer of capabilities between these seemingly disparate domains, opening avenues for future research on cross-domain language model applications.
A separate line of research focuses on extracting and analyzing ancient DNA (aDNA) from unconventional sources. Zhao et al. (2024) present a groundbreaking method for extracting petroleum DNA (pDNA) using nanoparticle affinity bead technology, proposing that petroleum serves as a novel type of fossil containing valuable ecological and evolutionary information. In a parallel study, Zhao et al. (2024) report the successful extraction of aDNA from 120-million-year-old Lycoptera fossils, pushing the boundaries of aDNA research significantly beyond the previously accepted limits. However, Zhao et al. (2024) also raise crucial concerns about the challenges of distinguishing oriDNA (original in situ DNA) from eDNA (environmental DNA) contamination in fossil samples, advocating for more stringent methodological frameworks and questioning the reliability of deamination patterns as a definitive marker for aDNA. These studies collectively highlight both the exciting potential and the significant methodological challenges inherent in aDNA research.
Moving beyond aDNA analysis, Tamir & Yuan (2024) introduce ProtGO, a transformer-based fusion model for predicting Gene Ontology (GO) terms from protein sequences. They demonstrate state-of-the-art accuracy, particularly on clustered split datasets, suggesting the model's ability to capture both short and long-term dependencies within protein structures. This work contributes to the growing field of automated protein annotation, leveraging the power of transformer models for improved accuracy and efficiency. Patel et al. (2024) address the evaluation of DNA Language Models (DNALMs) by introducing DART-Eval, a benchmark suite focused on regulatory DNA. Their findings reveal inconsistent performance of current DNALMs compared to ab initio models, raising important questions about the current capabilities and computational costs of these large language models in genomics.
Finally, two studies address specific challenges in analyzing complex biological data. Wrobel & Song (2024) propose KAMP, a robust and scalable K-statistic for quantifying immune cell clustering in spatial proteomics data, addressing the issue of spatial inhomogeneity that often biases traditional methods. Their application of KAMP to ovarian cancer data demonstrates its utility in identifying clinically relevant spatial patterns. Zhang & Roth (2024) introduce VEPerform, a web tool for evaluating the performance of variant effect predictors (VEPs) using balanced precision vs. recall curve analysis. This tool provides a valuable resource for assessing the accuracy and reliability of VEPs, which are increasingly important for interpreting the functional consequences of genetic variants. Downing (2024) explores the use of pangenome variation graphs (PVGs) for studying viral diversity, outlining tools for PVG construction, manipulation, and analysis. This work emphasizes the potential of PVGs for moving beyond traditional reference-based approaches and gaining deeper insights into viral evolution and population dynamics.
Ancient DNA from 120-Million-Year-Old Lycoptera Fossils Reveals Evolutionary Insights by Wan-Qian Zhao, Zhan-Yong Guo, Zeng-Yuan Tian, Tong-Fu Su, Gang-Qiang Cao, Zi-Xin Qi, Tian-Cang Qin, Wei Zhou, Jin-Yu Yang, Ming-Jie Chen, Xin-Ge Zhang, Chun-Yan Zhou, Chuan-Jia Zhu, Meng-Fei Tang, Di Wu, Mei-Rong Song, Yu-Qi Guo, Li-You Qiu, Fei Liang, Mei-Jun Li, Jun-Hui Geng, Li-Juan Zhao, Shu-Jie Zhang https://arxiv.org/abs/2412.06521
This groundbreaking study has successfully extracted ancient DNA (aDNA) from Lycoptera davidi fossils, a ray-finned fish from the Early Cretaceous period (approximately 120 million years old). This achievement significantly pushes the boundaries of aDNA research, previously limited to fossils less than 1 million years old due to challenges with DNA degradation and contamination. The researchers developed a rigorous protocol known as the "mega screen method" to isolate original in situ DNA (oriDNA) from environmental DNA (eDNA) contamination. This method involves two key steps: first, using "minimum E-value mode" in BLAST to classify DNA fragments based on lineage, and second, using "MS mode" to pinpoint sequences unique to a single lineage. They also introduced the Affinity Index (the percentage of sequences in a subset exceeding a similarity threshold with modern genomes) to assess the evolutionary closeness of identified subsets to their respective modern genomes.
The researchers extracted DNA from both the textured (fish remains) and non-textured (surrounding rock matrix) portions of the fossils. Using next-generation sequencing, they obtained 1,258,901 DNA sequences. The "mega screen method" identified 243 oriDNA sequences likely originating from the Lycoptera genome. Remarkably, these sequences averaged over 100 base pairs in length and showed no signs of deamination, a typical form of DNA damage observed in fossils. The internal volume ratios of the fossils were also measured (ranging from 10.4% to 11.3%) using the formula: Internal volume ratio of fossils = V1/V2 × 100%, where V1 = (W2 - W1) × density of water (W2 being the weight of the wet fossil, W1 being the weight of the dry fossil), and V2 is the total volume of the fossil.
Analysis of the oriDNA sequences revealed fascinating insights into Lycoptera's evolutionary history. 180 sequences aligned with local freshwater fish genomes (primarily Cypriniformes), 49 aligned with non-local (marine) fish genomes, and 14 aligned with other ray-finned fish but couldn't be classified at the order level. The average Affinity Index for the local and non-local fish groups were 80.34% and 83.85%, respectively. Four sequences were identified as originating from prey or parasites of Lycoptera, including a fish parasite (Ichthyosporea) and three sequences related to Macrobrachium nipponense (oriental river prawn). The researchers also discovered 10 transposase-encoding sequences, suggesting a novel mechanism for coding region formation called "coding region sliding replication and recombination." This mechanism involves the shifting and recombination of existing coding regions within the genome to generate new transposase sequences. The findings challenge existing evolutionary theories, suggesting a closer genomic link between Lycoptera and modern Cypriniformes (carp fishes) than previously thought. The presence of marine fish DNA hints at a genomic connection between Lycoptera and marine species, possibly reflecting an earlier stage of fish evolution. The discovery of the "coding region sliding replication and recombination" mechanism provides a new understanding of how fish genomes rapidly diversified during the Cretaceous period. Furthermore, the study proposes a novel perspective on the "Panspermia" theory, suggesting that DNA preserved within space-traveling fossils could potentially seed life on other planets.
DNA Fragments in Crude Oil Reveals Earth's Hidden History by Wan-Qian Zhao, Zhan-Yong Guo, Yu-Qi Guo, Mei-Jun Li, Gang-Qiang Cao, Zeng-Yuan Tian, Ran Chai, Li-You Qiu, Jin-Hua Zeng, Xin-Ge Zhang, Tian-Cang Qin, Jin-Yu Yang, Ming-Jie Chen, Mei-Rong Song, Fei Liang, Jun-Hui Geng, Chun-Yan Zhou, Shu-Jie Zhang, Li-Juan Zhao https://arxiv.org/abs/2412.06550
This study presents a groundbreaking approach to extracting and analyzing DNA from crude oil samples from the Nanyang Oilfield in central China, opening exciting possibilities for paleontology and petroleum geology. The research team employed a novel "nanoparticle affinity bead DNA extraction technology" to isolate DNA from the oil, a significant challenge given DNA's water solubility and oil's hydrophobic nature. After constructing DNA libraries and sequencing them, they generated a dataset of 3,159,020 petroleum DNA (pDNA) sequences. Similar to the Lycoptera study, they utilized the "Mega screening method," aligning the pDNA sequences with the NCBI database using a minimum E-value model followed by a more specific "MS mode" to identify the unique lineage origins of the sequences. Geochemical analysis of the oil samples was also conducted to assess maturity, alteration, depositional environment, and organic matter input.
The analysis revealed that the pDNA sequences were primarily environmental DNA (eDNA), with the majority (51.29%) aligning with bacterial genomes and a significant portion (33.16%) matching the human genome. Smaller percentages matched fungal, avian, and other vertebrate genomes. Surprisingly, very few sequences aligned with algal genomes (only 5), despite geochemical marker analysis (including n-alkanes, isoprenoids, β-carotene, and steranes) indicating that the oil originated primarily from algae and lower aquatic plants. This scarcity of algal DNA suggests that the original in situ DNA (oriDNA) from the oil-forming organisms was largely lost over geological time. However, the presence of ancient DNA (aDNA), including some potentially from ancient human lineages and marine organisms, offers a unique lens into past ecosystems and evolutionary history.
Among the human-aligned sequences, some exhibited variations from the modern human mitogenome, suggesting they may originate from ancient human lineages, potentially Homo erectus, whose fossils have been found near the oilfield. The presence of DNA from marine organisms, like ray-finned fish species that emerged during the mid-to-late Cretaceous period, supports previous findings of marine incursions in the region. Intriguingly, the study also identified DNA from non-native species, such as yaks and turkeys, which the authors hypothesize could be molecular remnants of ancient Pangaea ancestors that migrated due to continental drift. The study proposes a model for the preservation of biological indicators in oil, where oriDNA is substantially lost over time while eDNA from various sources gradually accumulates. This model suggests that oil, predominantly sourced from algae and lower aquatic plants, acts as a new type of fossil, preserving fragments of DNA that offer a glimpse into Earth's hidden history. The research highlights the potential of pDNA analysis to revolutionize our understanding of petroleum geology, paleontology, and even human evolution, emphasizing the need for a global pDNA database to facilitate further research in this emerging field.
A Misclassification Network-Based Method for Comparative Genomic Analysis by Wan He, Tina Eliassi-Rad, Samuel V. Scarpino https://arxiv.org/abs/2412.07051
Caption: This world map displays the communities identified through Genome Misclassification Network Analysis (GMNA) applied to SARS-CoV-2 genomes. Each colored node represents a group of countries whose viral genomes are frequently misclassified as each other, reflecting shared genomic features and the influence of human mobility. The connecting lines indicate the degree of "indistinguishability" between these communities, with thicker lines representing higher probabilities of misclassification.
This paper presents Genome Misclassification Network Analysis (GMNA), a novel alignment-free framework for comparative genomics that leverages information from misclassified instances in AI models. Rather than solely focusing on classification accuracy, GMNA utilizes the relationships among misclassified data to generate insights into the underlying processes driving these errors. The framework constructs a weighted misclassification network where nodes represent ensembles of genome sequences grouped by metadata classes (e.g., geographic region), and edges connect ensembles based on the likelihood of misclassification between them. The edge weights, representing "indistinguishability" scores, are calculated as the inverse of the symmetrized empirical probability of misclassification: indistinguishability(Xrᵢ, Xrⱼ) = 1 – (Pf(Xrᵢ, rᵢ, j) + Pf(Xrⱼ, rⱼ, i)) / 2, where Pf(Xrᵢ, rᵢ, j) represents the empirical probability of a sequence from region rᵢ being misclassified as from region rⱼ. This approach allows for efficient comparison of large ensembles of genome sequences without the computational burden of traditional alignment-based methods.
The GMNA framework was evaluated using a dataset of over 500,000 SARS-CoV-2 genomes labeled by sampling location. Both Naive Bayes and Convolutional Neural Network (CNN) models were employed as classifiers within the framework, with supplementary experiments using transformer-based models like Enformer. To address the trade-off between prediction accuracy and generating sufficient misclassified data for analysis, a leave-one-class-out (LOCO) model was introduced. This model iteratively trains the classifier on a subset of the data excluding one class (the "centroid") and then uses the trained model to classify the excluded class, ensuring a substantial number of misclassifications for analysis.
Results from the SARS-CoV-2 analysis revealed strong spatial dependencies in the genome sequences. Community detection applied to the misclassification network revealed clustering of geographically proximate regions, indicating that genomes from neighboring areas are more likely to be misclassified as each other due to shared genomic features. This observation was supported by comparison with a configuration model, which randomizes network connections while preserving the degree distribution. Furthermore, a significant correlation was found between the centrality of countries in a global flight network and the indistinguishability of their corresponding genome ensembles. This suggests that human mobility, particularly international travel, plays a significant role in shaping the genetic variation and evolution of SARS-CoV-2. Analysis of temporal trends revealed that prediction accuracy increased initially but declined after the emergence of variants of concern (Alpha, Delta, Gamma), potentially due to increased mixing of viral lineages through travel. The GMNA framework offers a computationally efficient and adaptable tool for large-scale comparative genomic analysis. By leveraging misclassifications, it provides a unique perspective on the relationships between genome sequences and associated metadata, offering insights into phylogenetic structure, evolutionary patterns, and the impact of factors like human mobility on genomic diversity. The framework's flexibility to incorporate various AI models makes it applicable to a wide range of genomic datasets and research questions, extending its potential beyond this study to other areas like healthcare and medical imaging.
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA by Aman Patel, Arpita Singhal, Austin Wang, Anusri Pampari, Maya Kasowski, Anshul Kundaje https://arxiv.org/abs/2412.05430
Caption: Precision-Recall curves for variant effect prediction on African caQTL and Yoruban dsQTL datasets, demonstrating the superior performance of the baseline model (ChromBPNet) compared to various DNALMs.
Genomic DNA language models (DNALMs), inspired by the success of LLMs in other domains, hold the promise of learning generalizable representations of DNA sequences. However, their effectiveness on regulatory DNA, a crucial non-coding component of the genome, has remained largely untested. DART-Eval (DNA RegulaTory Evaluations), a new benchmark, addresses this gap by rigorously evaluating DNALMs on tasks specific to regulatory DNA and comparing them to simpler, established models. The benchmark focuses on biologically relevant tasks like motif discovery, cell-type-specific regulatory activity prediction, and variant effect prediction, assessing performance in zero-shot, probed, and fine-tuned settings.
DART-Eval evaluates several prominent DNALMs, including Caduceus, DNABERT-2, GENA-LM, HyenaDNA, Mistral-DNA, and Nucleotide Transformer, against ab initio models designed for specific tasks. The benchmark uses carefully curated datasets of regulatory elements, transcription factor binding motifs, and regulatory genetic variants, controlling for confounders like GC content and linkage disequilibrium that have affected previous evaluations. For example, in the regulatory element identification task, DNALMs are challenged to distinguish true regulatory sequences from compositionally matched controls. In the motif sensitivity task, they must identify 1,443 specific regulatory sequence features. The benchmark also includes tasks predicting quantitative measures of regulatory activity from sequence and the counterfactual effects of regulatory genetic variants.
The results paint a somewhat disappointing picture for current DNALMs. While they perform reasonably well on simpler tasks like distinguishing regulatory elements from background sequences in a zero-shot setting, their performance deteriorates on more complex tasks. Surprisingly, embedding-free methods generally outperform embedding-based methods. Furthermore, simpler ab initio supervised models often match or exceed the performance of much larger, fine-tuned DNALMs, especially in counterfactual prediction tasks where DNALMs struggle significantly. For instance, in the variant effect prediction task, the ab initio baseline model, ChromBPNet, achieved substantially higher AUROC scores (0.77 and 0.89 for the two datasets) compared to the best performing fine-tuned DNALMs (0.62 and 0.67, respectively). These findings raise important questions about the current state of DNALMs and suggest avenues for future development. The authors highlight the limitations of current DNALM architectures and training strategies, particularly the challenges posed by the sparsity and uneven distribution of regulatory features in the genome. They propose potential solutions, such as balanced sampling of training examples, incorporating regulatory annotations, and training on functionally related subsets of the genome. The inconsistent performance of DNALMs and their lack of clear advantage over simpler models, despite significantly higher computational costs, underscore the need for more focused research on model architecture, training data, and evaluation strategies. DART-Eval provides a valuable resource for driving this research and accelerating the development of truly effective regulatory DNA models.
ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences by Azwad Tamir, Jiann-Shiun Yuan https://arxiv.org/abs/2412.05776
Caption: The ProtGO model processes a protein sequence by tokenizing it and feeding it into three specialized transformer blocks (Prot_CC, Prot_MF, and Prot_BP), each fine-tuned for a specific GO term aspect. The outputs of these blocks are then combined to generate a final, comprehensive set of GO term predictions for the input protein sequence.
The rapid growth of protein sequence data necessitates efficient and accurate annotation methods. Manual annotation is slow and resource-intensive, driving the development of automated systems. Existing methods, while promising, often struggle with long sequences, complex structural dependencies, and high computational costs. This paper introduces ProtGO, a novel transformer-based fusion model designed to address these limitations and accurately predict Gene Ontology (GO) terms from full-scale protein sequences.
ProtGO leverages transfer learning and a unique fusion architecture. The model employs three specialized transformer blocks (Prot_CC, Prot_MF, and Prot_BP), each pre-trained on the massive BFD100 dataset and fine-tuned on specific GO term aspects (Cellular Component, Molecular Function, and Biological Process, respectively). This selective fine-tuning allows for efficient training while maintaining high accuracy. The protein sequence is tokenized and fed into each block in parallel. The outputs of the three blocks are then combined to generate the final GO term predictions. This fusion approach allows the model to capture both short and long-term dependencies within the protein structure, leading to improved accuracy. The training process utilizes a negative log-likelihood loss function for pre-training and a cross-entropy loss function, given by Loss = - (1/NM) * Σ_{i=1}^{N} Σ_{j=1}^{M} y_{ij}log(ŷ_{ij}), for fine-tuning.
The performance of ProtGO was evaluated on two datasets: a random split and a more challenging clustered split, where sequences in the training and testing sets originate from distinct distributions. On the random split, ProtGO achieved accuracies of 86.06%, 94.60%, and 78.30% for Biological Process, Molecular Function, and Cellular Component, respectively, significantly outperforming benchmark models like Proteinfer and Proteinfer_EN. The F1 scores also showed marked improvement. Importantly, ProtGO maintained its superior performance on the clustered split, demonstrating its robustness and ability to generalize to diverse protein structures. Accuracies on this split were 82.16%, 91.51%, and 73.28% for the three respective GO aspects. This resilience to distribution shifts highlights ProtGO's deeper understanding of protein structure and motifs associated with GO terms.
ProtGO offers several advantages over existing methods. Its single model architecture, as opposed to ensemble methods, enhances efficiency and usability. The selective fine-tuning strategy reduces computational requirements compared to full fine-tuning of large transformer models. Furthermore, ProtGO exhibits minimal dependence on input sequence length, making it suitable for diverse applications. The sequence length analysis showed stable accuracy up to 1000 tokens, with the potential for further improvement by adjusting the truncation threshold. The results demonstrate the potential of transformer-based fusion models for accurate and efficient GO term prediction, paving the way for enhanced protein annotation and downstream bioinformatics applications. Future work could focus on gathering more GO term annotation datasets and integrating multiple protein annotation models into a unified framework for further performance gains.
This newsletter highlights significant advancements in genomics and proteomics research, showcasing the growing influence of AI, network science, and language models in these fields. From leveraging misclassifications in AI models to analyze viral spread (GMNA) to extracting ancient DNA from unconventional sources like petroleum and 120-million-year-old fossils, these studies push the boundaries of what's possible. While the extraction of aDNA from Lycoptera fossils and petroleum offers exciting glimpses into ancient ecosystems and evolutionary history, it also underscores the methodological challenges of distinguishing oriDNA from eDNA contamination and the need for rigorous validation. The development of ProtGO, a transformer-based fusion model for predicting GO terms from protein sequences, demonstrates the potential of AI for automating and enhancing protein annotation. However, the evaluation of DNALMs using DART-Eval reveals that these complex models don't always outperform simpler ab initio methods, highlighting the need for careful benchmark development and evaluation in this rapidly evolving field. Collectively, these studies demonstrate both the immense potential and the inherent complexities of applying cutting-edge computational techniques to unlock the secrets of the genome and proteome. They offer valuable insights and pave the way for future research, emphasizing the importance of rigorous methodology, careful evaluation, and a nuanced understanding of the biological context.