Subject: Genomics & Transcriptomics Advancements: LLMs, Spatial Analysis, and Novel Tools
Hi Elman,
This collection of preprints highlights advancements in genomics, transcriptomics, and bioinformatics, with a particular emphasis on leveraging computational approaches for analyzing complex biological data. Tauer, Trembath-Reichert, and Ward (2024) present metagenomic data from microbial mats in New Zealand's Waikite Valley hot springs, offering a valuable resource for studying thermophilic microbial communities across a temperature gradient. This work complements the growing body of research exploring microbial diversity in extreme environments and provides a foundation for genomic analyses of novel phototrophic bacteria. Concurrently, Liang (2024) introduces LLaMA-Gene, a large language model adapted for gene tasks. By expanding the vocabulary and pre-training on DNA and protein sequences, LLaMA-Gene achieves state-of-the-art performance in tasks like gene classification and interaction prediction, showcasing the potential of LLMs in genomics research.
Several preprints focus on spatial transcriptomics and its analytical challenges. Emons et al. (2024) introduce pasta, an R package applying spatial statistics to spatially-resolved omics data, bridging the gap between imaging-based and sequencing-based approaches. Zhu et al. (2024) propose SUICA, a deep learning model using implicit neural representations to enhance spatial resolution and gene expression prediction in ST data. Similarly, Han et al. (2024) present UMPIRE, a framework integrating spatial transcriptomics with pathology image representation learning, demonstrating improved performance in molecular-related downstream tasks. These contributions collectively advance the field by providing novel computational tools and frameworks for analyzing and interpreting complex spatial omics data.
Beyond spatial transcriptomics, the application of computational methods to other genomic challenges is also evident. Kapun (2024) investigates the influence of chromosomal inversions on genetic variation in Drosophila melanogaster, utilizing population genomics analysis tools to explore their impact on clinal variation. Chen et al. (2024) introduce ScPace, a timestamp calibration model for time-series single-cell RNA-seq data, addressing the issue of noisy timestamps and improving the accuracy of pseudotime analysis. Jiang and Wong (2024) present gghic, an R package extending the ggplot2 framework for visualizing 3D genome organization, facilitating the exploration of chromatin interactions and genomic annotations.
The power of large language models is further explored by Liu et al. (2024) in their Single-Cell Omics Arena (SOAR) benchmark. SOAR evaluates the performance of various LLMs in cell type annotation tasks using single-cell and multiomics data, demonstrating the potential of LLMs for automated cell type identification and cross-modality translation. Finally, Benedetti et al. (2024) introduce iSEEtree, an interactive Shiny app for exploring hierarchical data, particularly relevant for microbiome analysis using the TreeSummarizedExperiment data structure. This tool democratizes access to complex data analysis by providing a user-friendly visual interface. Liang and colleagues (2024) also contribute to this trend by exploring the role of chromatin interactions in interpreting non-coding genomic variants, while Patsakis et al. (2024) introduce MAFcounter, a tool for efficient k-mer counting in multiple alignment format files, further enhancing the bioinformatics toolkit for sequence analysis. Ge et al. (2024) provide a comprehensive review of deep learning in single-cell and spatial transcriptomics, offering valuable insights into the current state of the field and future directions.
Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective by Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren https://arxiv.org/abs/2412.03614
Caption: This figure illustrates the experimental workflows for microfluidic-based and spatial barcoding-based single-cell and spatial transcriptomics techniques. It then visually depicts the key data science challenges (data sparsity, diversity, scarcity, and correlation) in analyzing these data and highlights deep learning-based solutions, such as dimensionality reduction, noise reduction, data integration, and modeling spatiotemporal dependencies, to address these challenges. Finally, it showcases the application of AI tools like foundational models for tasks such as planning and action generation in this domain.
Single-cell and spatial transcriptomics have revolutionized our understanding of cellular heterogeneity and spatial organization within tissues. However, analyzing these complex datasets presents significant challenges, which this review comprehensively addresses. The authors identify four key data science challenges: data sparsity, data diversity, data scarcity, and data correlation, and demonstrate how deep learning (DL) offers effective solutions. Traditional methods struggle with the inherent high-dimensionality, sparsity, and noise present in these data, while DL excels at automatically identifying meaningful patterns.
Data sparsity, stemming from the vast number of genes and the variability in their expression across cells, leads to the curse of dimensionality, noise, and uncertainty. DL addresses these issues through dimensionality reduction techniques like scvis, a variational autoencoder (VAE) that preserves global data structure while mapping high-dimensional data to a lower-dimensional space. For noise reduction and imputation, DL methods like CLEAR utilize contrastive learning, and DCA employs a zero-inflated negative binomial (ZINB) noise model within an autoencoder framework. scVI tackles uncertainty by explicitly incorporating batch annotations and addressing batch effects through conditional independence assumptions, modeling gene expression with a ZINB distribution: P(X<sub>ng</sub> | Z<sub>n</sub>, S<sub>n</sub>, I<sub>n</sub>), where z<sub>n</sub> represents biological differences, I<sub>n</sub> captures capture efficiency and sequencing depth, and s<sub>n</sub> denotes batch annotation.
Data diversity refers to the integration of multiple data modalities, including multi-omics data and paired single-cell and spatial transcriptomics data. DL methods like LIGER use non-negative matrix factorization (NMF) to uncover shared and modality-specific gene expression patterns while minimizing distances between datasets. Seurat, LIGER, and Harmony leverage shared latent spaces and mutual nearest neighbors for aligning single-cell and spatial transcriptomics data. Integrating multi-source data, often from different samples or platforms, requires aligning independent feature spaces. Seurat v3 identifies common anchors across datasets, while DAVAE integrates large-scale unpaired data using a variational approximation network, a generative Bayesian neural network, and a domain adversarial classifier.
Data scarcity, particularly the lack of high-quality annotations and missing modalities, hinders analysis. DL offers solutions through data simulation and modality completion. scDesign3 uses statistical modeling to generate realistic single-cell multi-omics and spatial transcriptomics data with known cell proportions, following a generalized additive model for location scale and shape (GAMLSS): θ<sub>j</sub> (μ<sub>ij</sub>) = a<sub>jo</sub> + A<sub>j</sub>b<sub>z</sub> + Q<sub>j</sub>c<sub>i</sub> + f<sub>jci</sub> (x<sub>i</sub>). For missing modalities, DL methods like totalVI and UniPort utilize VAE architectures for shared latent space modeling.
Data correlation, encompassing spatial and temporal dependencies, requires capturing complex interactions. DL-based graph frameworks, including graph neural networks (GNNs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs), offer flexible solutions. DeepLinc constructs a cell adjacency graph and learns embedding features reflecting interaction likelihood. Integrating prior knowledge, such as pathway information and regulatory networks, further enhances analysis. GLUE integrates multi-omics data through a guidance graph, while DeepCCI uses the LRIDB database to define receptors and predict interactions. Benchmark evaluations demonstrated the superior performance of DL methods in imputation, data integration, and cell-cell interaction prediction.
The review concludes by highlighting future directions, including developing innovative AI methods like foundational models and agent-based approaches, the need for robust benchmark datasets and biologically relevant evaluation metrics, and exploring DL applications in practical biological and medical scenarios. These advancements promise to further unlock the potential of single-cell and spatial transcriptomics, driving deeper insights into complex biological systems and accelerating biomedical discoveries.
Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics by Minghao Han, Dingkang Yang, Jiabei Cheng, Xukun Zhang, Linhao Qu, Zizhi Chen, Lihua Zhang https://arxiv.org/abs/2412.00651
The Unified Molecule-enhanced Pathology Image REpresentationn Learning (UMPIRE) framework aims to revolutionize computational pathology by integrating spatial transcriptomics data with pathology image analysis. Current approaches often rely heavily on visual-language models, which may not fully capture the molecular intricacies of diseases. UMPIRE addresses this limitation by leveraging gene expression profiles to guide multimodal pre-training, enhancing the molecular awareness of learned pathology image representations. This approach provides a robust, task-agnostic training signal, moving beyond the limitations of image augmentation or text descriptions.
UMPIRE utilizes a two-stage pre-training process. First, a BERT-like gene encoder, termed Visiumformer, is pre-trained on a massive dataset of approximately 4 million spatial transcriptomics gene expression entries, called ViSTomics-4M. This dataset was compiled from various public sources. Second, a pre-trained pathology vision transformer (either Phikon or UNI) is aligned with the gene encoder using symmetric contrastive learning (SCL) on paired pathology image-gene expression samples. The SCL loss function, L<sub>SCL</sub>, aims to minimize the distance between paired image and gene embeddings (h<sub>i</sub>, g<sub>i</sub>) while maximizing the distance between unpaired embeddings:
L<sub>SCL</sub> = (1/2M) * Σ<sub>i=1</sub><sup>M</sup> [log(Σ<sub>n=1</sub><sup>M</sup> exp(τh<sub>i</sub>g<sub>i</sub>) / Σ<sub>m=1</sub><sup>M</sup> exp(τg<sub>i</sub>h<sub>m</sub>)) + log(Σ<sub>n=1</sub><sup>M</sup> exp(τg<sub>i</sub>h<sub>i</sub>) / Σ<sub>m=1</sub><sup>M</sup> exp(τh<sub>i</sub>g<sub>m</sub>))]
where τ is a temperature parameter. A reconstruction loss was also explored.
UMPIRE's performance was evaluated on various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images (WSIs). For gene expression prediction, UMPIRE significantly outperformed existing methods. In classification tasks, UMPIRE consistently enhanced performance. Finally, in WSI classification for gene mutation status, UMPIRE outperformed the baseline in most sub-tasks.
The results highlight the effectiveness of UMPIRE's multimodal data integration approach. The framework's ability to leverage molecular information significantly improves performance across various downstream tasks, offering a promising new direction for computational pathology research.
LLaMA-Gene: A General-purpose Gene Task Large Language Model Based on Instruction Fine-tuning by Wang Liang https://arxiv.org/abs/2412.00471
The development of general-purpose task models like ChatGPT has become a significant research direction in gene-based large language models (LLMs). While instruction fine-tuning is crucial for such models, existing methods primarily rely on natural language instructions, which differ significantly from gene sequences. This paper introduces LLaMA-Gene, a novel LLM designed to bridge this gap by extending the capabilities of the LLaMA model to encompass gene language.
The construction of LLaMA-Gene involves expanding the vocabulary using Byte Pair Encoding (BPE), tailored for both DNA and protein sequences. This is followed by continuous pre-training on these sequences. Downstream gene task data is then converted into a unified instruction format for fine-tuning. The model utilizes an Alpaca-style prompt formatting method, including the instruction, input sequence, and expected output. Specific prompt templates were designed for different task types, including classification, structure prediction, and regression.
LLaMA-Gene was trained using the LoRA method, focusing on specific network layers related to the attention mechanism and feed-forward networks. Evaluation was performed on datasets covering various gene-related downstream tasks. LLaMA-Gene achieved comparable results to state-of-the-art models on tasks like gene classification and gene sequence interaction. While showing a performance gap in protein-related tasks compared to current SOTA methods, the results validate the effectiveness of the proposed method.
This research demonstrates a promising direction for building unified large language models for gene tasks, moving beyond task-specific models. The instruction-tuned LLaMA-Gene model can interact conversationally and leverage techniques like prompt engineering, opening new possibilities for interpreting biological sequences.
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data by Junhao Liu, Siwei Xu, Lei Zhang, Jing Zhang https://arxiv.org/abs/2412.02915
Caption: This figure compares the zero-shot and zero-shot chain-of-thought (CoT) performance of various large language models (LLMs) on the SOAR-RNA benchmark using Exact Match (EM) and F1 scores. Mixtral-8x22B with zero-shot CoT demonstrates competitive performance against the specialized model Cell2Sentence, highlighting the potential of LLMs for automated cell type annotation.
Single-cell sequencing has revolutionized our understanding of cellular heterogeneity, but cell type annotation remains labor-intensive. Large language models (LLMs) offer a potential solution for automating this process. This paper introduces SOAR, a benchmark study evaluating LLM performance on cell type annotation using single-cell data. The benchmark encompasses 11 datasets, 8 instruction-tuned LLMs, and 1226 annotation tasks. The study also explores chain-of-thought (CoT) prompting and extends LLM application to multiomics data.
SOAR comprises two components: SOAR-RNA for single-cell RNA sequencing (scRNA-seq) data, and SOAR-MultiOmics incorporating other modalities. For scRNA-seq, differentially expressed genes (DEGs) were used as LLM input. Two prompting strategies were employed: zero-shot and zero-shot CoT. For multiomics data, a cross-modality alignment module mapped ATAC-seq data to the RNA-seq modality using a variational autoencoder (VAE). The objective function for the multiomics alignment is defined as: L<sub>Int</sub> = L<sub>Rec</sub> + L<sub>Adv</sub>, where L<sub>Rec</sub> is the reconstruction loss and L<sub>Adv</sub> is the adversarial loss.
Evaluation on SOAR-RNA revealed that open-source LLMs, particularly Mixtral-8×22B, achieved comparable performance to the domain-specific model Cell2Sentence. Zero-shot CoT prompting significantly improved performance across all open-source LLMs. On SOAR-MultiOmics, several LLMs demonstrated comparable performance to Cell2Sentence on both RNA-seq and ATAC-seq data.
The study demonstrates that LLMs can robustly interpret single-cell data without requiring additional fine-tuning. Zero-shot CoT prompting enhances their reasoning capabilities. The successful application of LLMs to multiomics data further expands their potential in genomics research.
SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics by Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng https://arxiv.org/abs/2412.01124
Caption: SUICA uses a graph-augmented autoencoder and implicit neural representations to create a continuous, compact representation of discrete spatial transcriptomics data. This approach enables super-resolution analysis, imputation of missing values, and improved performance in downstream tasks like cell type identification, as shown by the enhanced resolution and imputation in the processed images. The framework leverages a combination of regression and classification losses, including a Dice loss, to optimize the model's ability to capture and amplify biologically relevant signals.
Spatial Transcriptomics (ST) offers powerful insights into gene expression within tissues, but the discrete nature and super-high dimensionality of the data present modeling challenges. SUICA, a novel computational framework, leverages Implicit Neural Representations (INRs) to transform discrete ST data into a continuous and compact form, enabling super-resolution analysis.
SUICA incorporates a Graph-augmented Autoencoder (GAE) to address the dimensionality challenge. This GAE leverages contextual information from unstructured spots and generates structure-aware embeddings. SUICA addresses the sparsity issue by adopting a regression-by-classification approach. Pseudo-probabilities are constructed, and a classification-based Dice loss (L<sub>dice</sub>) is employed alongside conventional regression losses (L<sub>recons</sub>) to optimize the model:
L<sub>dice</sub> = 1 - (2∑(tanh(y) sgn(y<sub>gt</sub>)) + ε) / (∑tanh(y) + ∑ sgn(y<sub>gt</sub>) + ε')
L<sub>recons</sub> = (1/M<sub>y</sub>)∑(1/M<sup>+</sup><sub>y</sub>)(y - y<sub>gt</sub>)² + (1/M<sub>y</sub>*)|y - y<sub>gt</sub>| + λL<sub>dice</sub>
Benchmarking experiments demonstrate SUICA's superior performance. Importantly, SUICA also enhanced bio-conservation, showcasing its ability to preserve and amplify biologically relevant signals. SUICA's ability to impute gene expression values is a key advantage, successfully restoring biologically relevant signals in regions with low or missing ground-truth expression.
This newsletter highlights the growing influence of computational methods, particularly deep learning and LLMs, in genomics and transcriptomics. The development of tools like SUICA for enhanced spatial resolution and LLaMA-Gene for general-purpose gene tasks showcases the innovative application of these technologies. The SOAR benchmark underscores the potential of LLMs for automated cell type annotation, while UMPIRE's integration of spatial transcriptomics with pathology images demonstrates the power of multimodal approaches. The comprehensive review by Ge et al. provides a valuable overview of the field, emphasizing the importance of addressing data sparsity, diversity, scarcity, and correlation in single-cell and spatial transcriptomics analysis. The trend towards more integrative and accessible tools promises to accelerate research and deepen our understanding of complex biological systems.