Several recent publications explore novel computational approaches for analyzing genomic data, focusing on single-cell RNA sequencing (scRNA-seq) and genome-wide association studies (GWAS). Ma and Chen (2024) introduce Ut-SNE, an uncertainty-aware t-SNE variant designed to address noise in scRNA-seq visualizations. By incorporating a probabilistic representation for each sample, Ut-SNE aims to provide a more accurate depiction of transcriptomic variability and prevent misinterpretations of cell subsets. Pierotti, Fitzgerald, and Birney (2024) present FlexLMM, a Nextflow pipeline for GWAS using linear mixed models. This framework offers flexibility in model specification and incorporates a two-step permutation procedure to establish significance thresholds while accounting for population structure – a crucial consideration in GWAS analysis.
Deep learning and large language models (LLMs) are also gaining traction in genomic analysis. Gustafsson and Rantalainen (2024) evaluate deep regression models for predicting gene expression directly from whole-slide images (WSIs), offering recommendations for training these models in high-dimensional genomic data settings. Honig et al. (2024) introduce GTA, a method using token alignment between genetic sequences and natural language tokens to leverage pre-trained LLMs for improved gene expression prediction. This approach incorporates symbolic reasoning and in-context learning to capture long-range regulatory grammar, outperforming existing models. Lee et al. (2024) propose FREEFORM, a knowledge-driven framework using LLMs for feature selection and engineering in genotype data. This approach leverages LLMs' intrinsic knowledge to improve phenotype prediction, particularly in low-shot regimes.
Standardized benchmarking is also addressed by Yang, Cole, and Li (2024) with OmniGenBench. This framework automates large-scale in-silico benchmarking across diverse genomic tasks, facilitating GFM development and evaluation. Saggi et al. (2024) explore integrating multi-omic data and quantum machine learning for lung subtype classification, highlighting quantum computing's potential in high-dimensional biological datasets.
Finally, explainable AI and feature reduction are explored. Elborough, Taylor, and Humphries (2024) present an efficient method for computing Shapley values for large, multidimensional time-series data, using superpixels from image processing to address computational challenges in high-dimensional settings. Mu et al. (2024) introduce TemporalPaD, a reinforcement learning framework for temporal feature representation and dimension reduction, combining reinforcement learning with neural networks for efficient feature extraction in temporal datasets.
OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models by Heng Yang, Jack Cole, Ke Li https://arxiv.org/abs/2410.01784
Caption: OmniGenBench is an open-source framework for automated, large-scale benchmarking of Genomic Foundation Models (GFMs) across diverse genomic modalities and downstream tasks, including RNA/DNA design, structure prediction, and translation efficiency prediction. The framework provides standardized benchmark suites, open GFMs, custom metrics, and a comprehensive software toolkit with application interfaces for model wrappers, data processors, and an online leaderboard to facilitate community collaboration and accelerate GFM development. The image visually outlines the key components of OmniGenBench, illustrating its benchmarking capabilities, software toolkit, and online resources.
Genomic Foundation Models (GFMs) hold immense potential, but their development and adoption are hindered by a lack of standardized benchmarks and accessible software. OmniGenBench addresses this by providing an open-source framework for large-scale, automated benchmarking of GFMs. It tackles the unique challenges of genomic data, including data scarcity, metric inconsistency, reproducibility issues, and the need for adaptive benchmarking across diverse genomic modalities. Integrating four large-scale benchmarks encompassing 42 million genomic sequences from 75 datasets, and supporting over 10 open-source GFMs, OmniGenBench democratizes access for researchers.
The AutoBench pipeline within OmniGenBench standardizes benchmark suites, ensuring consistent evaluation and minimizing biases. It addresses data scarcity and bias by integrating diverse datasets and performing filtering for tasks like structure prediction to prevent data leakage. Metric reliability is enhanced through common metrics and automated performance recording. Reproducibility is prioritized by adhering to FAIR principles and providing detailed metadata and benchmark settings. OmniGenBench supports adaptive benchmarking, allowing evaluation across diverse genomes and species, revealing cross-genomic insights and potential novel applications.
Beyond benchmarking, OmniGenBench offers a comprehensive genomic toolkit, featuring genome embedding extraction, data augmentation, and common genomic tasks like RNA design. User-friendly interfaces and tutorials simplify GFM implementation and fine-tuning. An online hub and leaderboard promote community collaboration and transparency, showcasing detailed task-wise performance for DNA and RNA downstream tasks. This centralized platform facilitates efficient testing and experimentation, accelerating GFM development.
Comprehensive benchmark results demonstrate OmniGenBench’s superior performance, particularly in RNA structural modeling. For instance, on the RGB benchmark, it achieved an RMSE of 0.7121 for mRNA degradation rate prediction and an AUC of 64.13 for SNMD. On the PGB, it achieved an F1 score of 87.55 for PolyA and 98.41 for Splice Site prediction. While specialized models performed competitively on specific tasks, OmniGenome consistently delivered strong results across diverse benchmarks, highlighting its versatility and the importance of structural modeling. Despite its strengths, limitations like the lack of in-vivo validation data and model scale constraints need to be addressed in future research.
Long-range gene expression prediction with token alignment of large language model by Edouardo Honig, Huixin Zhan, Ying Nian Wu, Zijun Frank Zhang https://arxiv.org/abs/2410.01858
Caption: The Genetic sequence Token Alignment (GTA) model predicts gene expression by aligning genetic sequence features, extracted by Sei, with the token embeddings of a frozen pre-trained large language model (LLM). This alignment, facilitated by a learnable set of text prototypes and a cross-attention mechanism, allows GTA to model long-range interactions within the genome and incorporate gene annotations as prompts for enhanced predictive power. The output head then generates predictions of gene expression levels.
Predicting gene expression from DNA is crucial but challenging due to complex gene regulation and long-range interactions within the genome. Existing models struggle to capture these distal regulatory elements due to limited input sequence length. The GTA model addresses this by leveraging pretrained LLMs, extending the modeled sequence context to 1 million base pairs – a five-fold increase. GTA aligns genetic sequence features, extracted using a pretrained genomic sequence model (Sei), with the tokens of a frozen LLM. This alignment uses learnable text prototypes, which are linear combinations of the LLM's token embeddings, and a cross-attention mechanism, calculated as A<sub>i</sub> = Attention(Q<sub>i</sub>, K<sub>i</sub>, V<sub>i</sub>) = softmax(Q<sub>i</sub>K<sub>i</sub><sup>T</sup> / √d<sub>k</sub>)V<sub>i</sub>.
Trained on lymphoblastoid cell data and evaluated on the Geuvadis consortium data, GTA significantly outperformed existing methods, achieving a Spearman correlation of 0.65, a 10% improvement. This highlights the importance of incorporating long-range interactions. GTA also offers improved interpretability by identifying influential sections of the input genetic context through its attention mechanism. It successfully learns biologically meaningful attention patterns, focusing on regulatory features like enhancers and transcription factors.
GTA incorporates gene annotations from NCBI Gene as prompts, enabling in-context learning and enhancing prediction. This leverages biological knowledge encoded in these annotations. While GTA relies on pre-extracted features and uses a causal attention mechanism, potentially limiting bidirectional interaction capture, it paves the way for exploring alternative feature extractors and attention mechanisms. Overall, GTA represents a significant advance in gene expression prediction, offering improved accuracy, interpretability, and the potential to integrate diverse biological knowledge.
Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models by Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen https://arxiv.org/abs/2410.01795
Caption: The FREEFORM framework uses LLMs for feature selection (A & B) and feature engineering to improve phenotype prediction from genotype data. Selected features are transformed into engineered features, which are then bagged and used to train an ensemble of classifiers. The final prediction is an ensemble of the classifier predictions.
Predicting phenotypes from genotypes is challenging due to high-dimensional data, interpretability issues, and limited samples. FREEFORM, a knowledge-driven framework, leverages LLMs for feature selection and engineering in genotype data. Motivated by the knowledge within LLMs and their ability to handle complex biomedical concepts, FREEFORM enhances prediction accuracy while maintaining interpretability, especially in low-data regimes.
FREEFORM operates in two stages. First, it employs LLM-driven feature selection using Self-Consistent Hierarchical Selection and Self-Consistent Sequential Forward Selection. These leverage the LLM's knowledge to identify informative variants without relying on training data, addressing limitations of data-driven methods in few-shot settings. Second, FREEFORM uses LLMs for feature engineering, generating interpretable interaction terms. It employs "free-flow reasoning," allowing the LLM to generate unstructured output for enhanced creativity, then uses the LLM to parse this output and create executable Python code for feature generation. Ensembling with bagging and order shuffling improves robustness and mitigates overfitting. The final prediction is made by averaging class probabilities: p(x) = (1/K) * Σ<sub>k=1</sub> p(f<sub>k</sub>(x)), and ŷ = arg max p(x)*.
Evaluated on genomic ancestry and hereditary hearing loss datasets, FREEFORM outperformed data-driven baselines (LASSO, PCA, RF-based Gini Importance) in few-shot scenarios. For example, in the ancestry task, FREEFORM achieved comparable performance to LASSO with just 10 shots, while LASSO required 80. In feature engineering, FREEFORM consistently ranked at the top or matched baseline performance (Logistic Regression, Random Forest, XGBoost, TabPFN, FeatLLM), enhancing the performance of Logistic Regression and Random Forest in low-shot regimes.
FREEFORM's knowledge-driven approach excels in few-shot regimes, while its focus on interpretable interaction terms maintains transparency. The framework is model-agnostic and applicable to various downstream classifiers. Future work will focus on generating more complex features, enhancing knowledge retrieval, and integrating interpretability mechanisms for tasks like causal discovery.
This newsletter highlights the rapid advancements in computational genomics, particularly the innovative applications of LLMs and the growing importance of robust benchmarking. OmniGenBench provides a crucial framework for evaluating GFMs, paving the way for standardized comparisons and accelerating model development. GTA demonstrates the power of LLMs in capturing long-range interactions for enhanced gene expression prediction, while FREEFORM showcases their utility in feature selection and engineering, especially in low-data scenarios. These contributions collectively underscore the transformative potential of AI in unlocking the complexities of the genome and advancing our understanding of human health and disease.