Subject: Genomic Analysis: A Leap Forward with Foundation Models and Causal Reasoning
Hi Elman,
Several recent publications explore the application of foundation models and machine learning to genomic analysis, addressing challenges in multi-task learning, cross-modal integration, and experimental validation. Li et al. (2025) introduce Omni-DNA, a family of cross-modal, multi-task genomic foundation models (GFMs) trained using a two-stage approach: pretraining on DNA sequences with a next-token prediction objective, followed by multi-task finetuning. This approach achieves state-of-the-art performance on multiple benchmarks and demonstrates cross-modal capabilities by mapping DNA sequences to both textual functional descriptions and images (Li et al., 2025). Similarly, Wu et al. (2025) present GENERator, a long-context generative GFM trained on a massive eukaryotic DNA dataset. GENERator exhibits strong performance in sequence generation, including the creation of promoter sequences with specific activity profiles, and adheres to the central dogma by generating protein-coding sequences that translate into structurally plausible proteins (Wu et al., 2025). Liu et al. (2025) also leverage the central dogma with their Life-Code model, which uses a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide sequences. This approach, combined with a codon tokenizer and hybrid long-sequence architecture, allows Life-Code to capture complex interactions between coding and non-coding regions and achieve state-of-the-art performance on various multi-omics tasks (Liu et al., 2025). These studies highlight the growing potential of foundation models for diverse genomic applications.
Beyond foundation models, several papers focus on specific challenges in genomic analysis. Higgins et al. (2025) introduce stIHC, a novel iterative hierarchical clustering method for identifying spatial gene co-expression modules from spatial transcriptomics data. stIHC outperforms existing methods in detecting rare spatial expression patterns and provides insights into the functional organization of complex tissues (Higgins et al., 2025). Hutton and Meyer (2025) provide a comprehensive overview of trajectory inference methods for single-cell omics data, discussing their strengths, weaknesses, best practices, and applications in studying cell differentiation, development, and disease (Hutton & Meyer, 2025). Huang et al. (2025) develop scGSL, a graph neural network model that enhances cell type prediction and cell interaction analysis within the tumor microenvironment by leveraging non-spatial scRNA-seq data (Huang et al., 2025). These contributions demonstrate the ongoing development of specialized computational tools for analyzing complex genomic data.
The importance of bridging computational predictions with experimental validation is emphasized by Wang et al. (2025). Their review explores various methods for validating bioinformatics findings, including gene expression analysis, protein-protein interaction verification, and pathway validation, highlighting the crucial role of collaboration between bioinformatics and experimental research (Wang et al., 2025). This theme of connecting in silico predictions to in vitro validation is also relevant to the challenges discussed by James et al. (2025), who examine the limitations of machine learning for whole-genome phenotype prediction in bacteria. They highlight the difficulty of extracting meaningful causal relationships from predictive models due to high-dimensionality and spurious associations, emphasizing the need for approaches that go beyond simple pattern recognition (James et al., 2025).
Addressing the limitations of existing methods, Sadia et al. (2025) introduce CausalGeD, a model that combines diffusion and autoregressive processes to generate spatial gene expression patterns by leveraging causal relationships between genes. By incorporating causal attention, CausalGeD outperforms existing methods in integrating scRNA-seq and spatial transcriptomics data (Sadia et al., 2025). This work underscores the importance of considering causal relationships in genomic analysis, a theme echoed by Liang (2025), who demonstrates the ability of LLMs to rediscover the central dogma through language transfer capabilities. By training a GPT-like model on protein and DNA sequences, Liang (2025) shows that LLMs can learn the underlying principles of genetic code without prior knowledge, opening new avenues for AI-driven biological research (Liang, 2025).
Omni-DNA: A Unified Genomic Foundation Model for Cross-Modal and Multi-Task Learning by Zehui Li, Vallijah Subasri, Yifei Shen, Dongsheng Li, Yiren Zhao, Guy-Bart Stan, Caihua Shan https://arxiv.org/abs/2502.03499
Caption: The architecture of Omni-DNA, a genomic foundation model, is shown. It undergoes pretraining with next-token prediction on DNA sequences and then multi-task fine-tuning with an expanded vocabulary that includes task-specific and cross-modal tokens, enabling it to perform diverse tasks like classification, function prediction, and DNA-to-image mapping. The diagram illustrates the data flow from raw DNA sequences through the BPE tokenizer to the Omni-DNA transformer model during both pretraining and fine-tuning stages.
Genomic Foundation Models (GFMs) hold great promise for automating genome annotation, but current models often require separate fine-tuning for each task, creating significant overhead as models scale. They also struggle with diverse output formats, limiting their application. Omni-DNA addresses these limitations with a family of cross-modal, multi-task models, ranging from 20 million to 1 billion parameters. Inspired by Large Language Models (LLMs), Omni-DNA uses a two-stage approach. First, it's pretrained on DNA sequences using next-token prediction. Then, its vocabulary is expanded with multi-modal, task-specific tokens for simultaneous multi-task fine-tuning.
The pretraining stage carefully explores different configurations, such as non-parametric LayerNorm versus RMSNorm, ROPE versus ALiBi positional embeddings, and a deduplicated dataset of 30 billion nucleotides. This rigorous approach leads to state-of-the-art performance on 18 of 26 tasks in the Nucleotide Transformer and GB benchmarks, even outperforming bidirectional models in a traditional single-task setting. The cross-modal, multi-task fine-tuning dynamically expands the vocabulary and addresses the resulting distribution shift using Important Key Repetition and NEFTune. This allows a single Omni-DNA model to handle multiple tasks concurrently, such as 10 acetylation and methylation tasks, outperforming models trained individually for each task.
Performance evaluation covers both conventional genomic tasks and novel cross-modal tasks. On the Nucleotide Transformer benchmark, Omni-DNA (1B) and Omni-DNA (116M) achieve the best (0.767) and second-best (0.755) average performance across 18 tasks, respectively. On the Genomic Benchmark, Omni-DNA (116M) achieves the highest average score, exceeding DNABERT-2 in seven out of eight tasks. Multi-task fine-tuning on 10 related acetylation and methylation tasks demonstrates a synergistic effect, leveraging the relationships between tasks to improve performance beyond single-task models. Two novel tasks, DNA2Func (mapping DNA to textual functional descriptions) and Needle-in-DNA (mapping DNA to images representing embedded motifs), further demonstrate Omni-DNA’s cross-modal capabilities. On DNA2Func, Omni-DNA achieves a weighted F1 score of 0.730 and MCC of 0.701, surpassing GPT4o and OLMO-1B. On Needle-in-DNA, it achieves a Macro F1 score of 0.987 and an invalid percentage of 1%, demonstrating remarkable generalization ability. These results showcase Omni-DNA's potential as a unified GFM, capable of handling diverse tasks and modalities, paving the way for more complex genomic applications.
GENERator: A Long-Context Generative Genomic Foundation Model by Wei Wu, Qiuyi Li, Mingyang Li, Kun Fu, Fuli Feng, Jieping Ye, Hui Xiong, Zheng Wang https://arxiv.org/abs/2502.07272
Caption: Figure A illustrates the distribution of genes and nucleotides within eukaryotic DNA, categorized by organism type. Figure B depicts the 6-mer tokenization and transformer decoder architecture used in Generator's pre-training for next token prediction. Figure C compares Generator to other large language models in terms of pre-training data volume and context length, showcasing its advantage in handling long DNA sequences. Figure D outlines Generator's applications, including sequence comprehension, central dogma adherence (DNA to protein), and promoter design for targeted activity.
GENERator is a new generative genomic foundation model designed to predict and interpret DNA sequences. It addresses limitations in existing genomic language models, which often lack robustness and have limited applications due to constraints in model structure and training data. GENERator boasts a massive context length of 98k base pairs (bp) and 1.2B parameters, trained on a vast dataset of 386B bp of eukaryotic DNA. This extensive training allows GENERator to achieve state-of-the-art performance on both established and newly proposed benchmarks, including Genomic Benchmarks, NT tasks, and new Gener tasks focusing on gene and taxonomic classification with longer sequences.
A key innovation in GENERator is the use of "gene sequence training," which focuses on semantically rich gene regions rather than the entire genome. This contrasts with traditional "whole sequence training" and proves surprisingly effective, likely due to the functional significance and relative sparsity of gene segments within the vast genomic landscape. A 6-mer tokenizer was found to be optimal for next-token prediction pre-training, outperforming Byte Pair Encoding, a common choice in natural language processing. This highlights the unique characteristics of DNA data and the need for specialized approaches. The pre-training utilizes a transformer decoder architecture, similar to Llama, and incorporates optimizations like Flash Attention and Zero Redundancy Optimizer for efficient handling of long sequences.
GENERator goes beyond benchmark performance by adhering to the central dogma of molecular biology. It generates protein-coding DNA sequences that translate into proteins structurally similar to known families. This was demonstrated by generating sequences for the Histone and Cytochrome P450 families, with the resulting proteins exhibiting structural similarity (TM-score > 0.8) to known structures despite low sequence identity (< 0.3). Furthermore, GENERator shows significant promise in sequence optimization, particularly in designing promoter sequences with targeted activity profiles. Evaluated on the DeepSTARR dataset, a GENERator-based activity predictor outperforms existing methods, and generated promoters exhibit distinct activity profiles. This research highlights the potential of generative models in genomics, positioning GENERator as a key tool for advancing both genomic research and biotechnological applications. Future work includes a prokaryotic-plus-viral version of GENERator and a specialized model called Generanno for gene annotation. All materials, including data, code, and model weights, will be open-sourced to promote transparency and collaboration.
CausalGeD: Blending Causality and Diffusion for Spatial Gene Expression Generation by Rabeya Tus Sadia, Md Atik Ahamed, Qiang Cheng https://arxiv.org/abs/2502.07751
Caption: The image illustrates the architecture of CausalGeD, a novel diffusion-based model for spatial gene expression prediction. It depicts both the training phase, which uses an encoder, noise injection, and a Causal Attention Transformer (CAT), and the inference phase, which incorporates a decoder and upsampling to generate spatial expression patterns. Crucially, the model leverages causal relationships between genes through a combination of autoregressive and diffusion processes.
Integrating single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data is essential for understanding spatial gene expression, but existing methods often achieve limited structural similarity (below 60%). This limitation stems from their failure to account for causal relationships between genes. CausalGeD addresses this challenge by combining diffusion and autoregressive processes to leverage these causal relationships for enhanced spatial gene expression generation.
CausalGeD integrates causality by adapting the Causal Attention Transformer (CAT) from image generation to gene expression data. Unlike traditional causal analysis requiring predefined relationships, CausalGeD learns these relationships directly from the data. The model operates in two stages: training with an encoder and CAT module for forward diffusion, and inference integrating these components with a decoder for reverse diffusion to generate expression values. The core innovation lies in the combined diffusion and autoregressive approach, with the autoregressive component processing genes sequentially, reflecting regulatory mechanisms, and the diffusion process refining the predictions. This is captured by the formula:
*q(x^s_{i,t}:T|k_s, z^s_0, k_{1:s-1}) = q(x^s_{i,t}|k_s)Π_{t'=t}^{T-1} q(x^s_{i,t'}|x^s_{i,t'+1},k_s, z^s_0, k_{1:s-1})
where x^s_{i,t} represents the latent representation of spatial data for gene i at diffusion step t and autoregressive step s, k_s is the subset of genes processed at step s, and z^s_0 represents the initial latent representation.
Evaluated on ten diverse tissue datasets and compared to nine state-of-the-art baselines using metrics like Pearson Correlation Coefficient (PCC), Structural Similarity Index Measure (SSIM), Root Mean Square Error (RMSE), and Jensen-Shannon divergence (JS), CausalGeD consistently outperforms, achieving improvements of 5-32%. For example, on the human breast cancer (HBC) dataset, CausalGeD improves SSIM by 21.8% compared to the best baseline. On the mouse embryo (ME) dataset, it shows a 34.9% improvement in PCC. Ablation studies validate the importance of CausalGeD's key components, including the decoder training strategy, encoder variation, AR step decay, transformer block depth, and diffusion time steps. The superior performance of CausalGeD translates to more accurate and biologically meaningful insights, potentially revealing previously unclear cell-cell communication patterns in tumor microenvironments and enabling better understanding of developmental gene regulation. CausalGeD represents a significant advance in spatial gene expression prediction, offering a powerful tool for uncovering complex biological mechanisms in spatial contexts.
Controllable Sequence Editing for Counterfactual Generation by Michelle M. Li, Kevin Li, Yasha Ektefaie, Shvat Messica, Marinka Zitnik https://arxiv.org/abs/2502.03569
Caption: Panel (a) illustrates controllable sequence editing of a patient's health journey, allowing for "what if" scenarios with different intervention timings (2 hours, 2 days, 1 week) and observing their impact on the trajectory, highlighting the reversibility of edits. Panel (b) contrasts traditional sequence editing (original) with CLEF's predicted counterfactual, demonstrating how CLEF preserves historical data while enabling time-sensitive interventions. The dotted purple lines in both panels represent counterfactual trajectories generated by modifying the original sequence.
Counterfactual thinking, the ability to reason about "what if" scenarios, is crucial in fields like biology and medicine. Existing sequence models for counterfactual generation often lack fine-grained control over when and where edits occur, limiting their usefulness in situations requiring precise, localized modifications. Current approaches typically focus on univariate sequences or assume global interventions, neglecting cases where interventions have delayed and localized effects. This limitation hinders their application in complex biological processes like cellular reprogramming or modeling patient immune dynamics, where interventions (e.g., drug administration or genetic perturbations) have time-delayed, localized effects.
CLEF (ControLlable sequence Editing for counterFactual generation) addresses this gap, enabling precise counterfactual reasoning about both immediate and delayed effects. CLEF learns temporal concepts that encode trajectory patterns of sequences, allowing accurate counterfactual generation based on a given condition. Unlike existing methods, CLEF selectively edits relevant time steps while preserving unaffected portions of the sequence, maintaining both temporal causality and the consistency of inherent dependencies. CLEF includes a sequence encoder, a condition adapter, a concept encoder (learning temporal concepts c = GELU(FFN(h<sub>x</sub> ⊕ h)), where h<sub>x</sub> are sequence features and h combines time delta and condition embeddings), and a concept decoder (generating counterfactuals x̂<sub>:,t<sub>j</sub></sub> = c ⊗ x<sub>:,t<sub>i</sub></sub>). Trained using Huber loss, CLEF minimizes the difference between predicted and actual counterfactual sequences.
Evaluated on cellular and patient trajectory datasets and compared to baselines like VAR, Transformer, XLSTM, and MOMENT, CLEF demonstrates superior performance. For immediate sequence editing, CLEF improves MAE by up to 36.01% compared to baselines. Remarkably, CLEF enables one-step generation of counterfactual sequences at any future time step (delayed sequence editing), outperforming baselines by up to 65.71% in MAE. Furthermore, CLEF shows improved performance in zero-shot counterfactual generation of cellular trajectories, with MAE improvements up to 14.45% and 63.19% for immediate and delayed editing, respectively. A case study on type 1 diabetes mellitus patients shows CLEF's ability to simulate realistic "healthy" counterfactual trajectories by intervening on specific temporal concepts related to glucose and white blood cell levels, opening avenues for personalized medicine and treatment optimization. These results highlight CLEF’s potential for generating precise, context-specific counterfactuals while preserving temporal and structural constraints, representing a significant advancement in counterfactual sequence editing for biomedical applications. While promising, the authors acknowledge limitations like the current definition of temporal concepts and the potential for incorporating real-world causal models for further improvement.
This newsletter showcases the rapid advancement of computational methods in genomic research. The emergence of powerful foundation models like Omni-DNA and GENERator offers a unified approach to diverse genomic tasks, demonstrating impressive performance in multi-task learning and cross-modal integration. Furthermore, the ability of GENERator to adhere to the central dogma by generating structurally plausible proteins from DNA sequences highlights the potential of these models to bridge the gap between genotype and phenotype.
Beyond foundation models, the development of specialized algorithms like CausalGeD and CLEF addresses specific challenges in genomic analysis. CausalGeD's integration of diffusion and autoregressive processes with causal relationships between genes significantly improves spatial gene expression prediction. Similarly, CLEF's ability to perform controllable sequence editing for counterfactual generation offers a powerful tool for exploring "what if" scenarios in biological systems, particularly in applications requiring precise and localized modifications.
A recurring theme across these papers is the importance of incorporating causal reasoning into genomic analysis. Whether through explicit modeling like in CausalGeD or implicit learning by LLMs as demonstrated by Liang (2025), the integration of causality promises to unlock a deeper understanding of complex biological processes. Moreover, the emphasis on bridging in silico predictions with in vitro experiments underscores the need for combining computational power with experimental validation to advance biological knowledge. These advancements pave the way for future discoveries and highlight the transformative potential of computational methods in shaping the future of genomic research.