Several recent studies have explored novel computational approaches for analyzing and interpreting complex genomic data, focusing on improving the efficiency and accuracy of gene editing and variant effect prediction. Qin et al. (2024) (Qin et al., 2024) introduced NAIAD, an active learning framework for optimizing combinatorial CRISPR screens. This approach leverages single-gene perturbation effects and adaptive gene embeddings to efficiently identify synergistic gene pairs, outperforming existing models by up to 40% in identifying optimal combinations. Concurrently, Weinberger et al. (2024) (Weinberger et al., 2024) developed ContrastiveVI+, a generative model addressing the challenge of variable guide efficiency in pooled CRISPR screens. This model disentangles perturbation-induced variations from background noise while simultaneously inferring the efficacy of genomic edits, leading to improved recovery of known perturbation effects.
Beyond gene editing, several studies focused on leveraging deep learning for variant effect prediction and genomic sequence understanding. Kathail et al. (2024) (Kathail et al., 2024) reviewed the application of genomic deep learning models, including supervised sequence-to-activity models and self-supervised language models, for predicting the effects of non-coding variants. They highlighted the importance of ground truth data for model evaluation and discussed downstream applications in understanding disease-relevant variants. Complementing this, Jeon et al. (2024) (Jeon et al., 2024) introduced LoRA-BERT, a pre-trained bidirectional encoder representation for robust and accurate prediction of long non-coding RNAs (lncRNAs). This model outperforms existing tools in lncRNA and mRNA prediction by capturing nucleotide-level information, offering potential for understanding diseases linked to lncRNAs. Further expanding the application of AI in genomics, Nelson et al. (2024) (Nelson et al., 2024) presented LA4SR, a re-engineered language model for microbial sequence classification. This model demonstrated high accuracy and speed in classifying the algal dark proteome, highlighting the potential of AI language models for analyzing uncharacterized proteins.
Addressing the critical issue of reproducibility in genomic studies, Jiang and Ayday (2024) (Jiang & Ayday, 2024) proposed a novel method for validating GWAS findings without requiring access to the original datasets. Their approach uses p-values to estimate contingency tables and calculates the Hamming distance between derived and publicly available minor allele frequencies (MAFs), enabling the detection of unintentional errors in reported GWAS results.
The application of deep learning to single-cell RNA sequencing (scRNA-seq) data analysis was also explored. Cui et al. (2024) (Cui et al., 2024) introduced a White-Box Diffusion Transformer for generating synthetic scRNA-seq data. This hybrid model combines the generative capabilities of diffusion models with the interpretability of white-box transformers, offering a promising approach for addressing data limitations in scRNA-seq studies. Similarly, Andrade et al. (2024) (Andrade et al., 2024) developed a Mixed Effects Deep Learning (MEDL) Autoencoder for interpretable analysis of scRNA-seq data. This framework models both batch-invariant and batch-specific components, improving visualization and predictive accuracy while capturing cellular heterogeneity across diverse datasets.
Finally, Xiao et al. (2024) (Xiao et al., 2024) presented RNA-GPT, a multimodal generative system for RNA sequence understanding. This model integrates RNA sequence encoders with large language models, enabling it to process user-uploaded RNA sequences and provide concise, accurate responses to complex queries based on extensive RNA literature. This work, along with the introduction of the RNA-QA dataset, represents a significant step towards leveraging AI for streamlining RNA research and discovery. Collectively, these studies demonstrate the growing potential of AI and deep learning for addressing critical challenges in genomics research, from optimizing gene editing experiments to enhancing the interpretability and reproducibility of complex genomic data.
Active learning for efficient discovery of optimal gene combinations in the combinatorial perturbation space by Jason Qin, Hans-Hermann Wessels, Carlos Fernandez-Granda, Yuhan Hao https://arxiv.org/abs/2411.12010
Caption: The figure illustrates the NAIAD active learning framework for combinatorial CRISPR screening. It cycles through training a predictive model (using gene embeddings and single-gene effects), recommending promising gene pairs for experimental validation, and incorporating new experimental data to refine the model. The right panel details NAIAD's model architecture, showing how single-gene effects and gene embeddings are combined to predict combinatorial effects, which are then used for recommendation.
The vast potential of combinatorial CRISPR screening for identifying synergistic gene interactions that influence cellular phenotypes is tempered by the sheer scale of possible gene combinations, making exhaustive experimentation impractical. NAIAD, an active learning framework, offers a solution by efficiently identifying optimal gene pairs in these screens. This framework tackles two critical challenges: achieving high predictive accuracy with minimal initial training data and providing a smart recommendation system to guide subsequent experimental rounds.
NAIAD's innovative architecture incorporates adaptive gene embeddings that scale with the growing dataset, mitigating overfitting and capturing complex gene interactions as more data becomes available. The model also utilizes an overparametrized representation of single-gene perturbation effects, effectively conditioning combinatorial predictions on established single-gene impacts.
The model predicts the combined effect (Y<sub>i+j</sub>) of a two-gene perturbation (i + j) using the following formula:
Y<sub>i+j</sub> = φ([Y<sub>i</sub>, Y<sub>j</sub>]W<sub>1</sub>)A + f(φ(W<sub>2</sub>X<sub>gene</sub><sup>i</sup>), φ(W<sub>2</sub>X<sub>gene</sub><sup>j</sup>))A
where Y<sub>i</sub> and Y<sub>j</sub> represent single-gene perturbation effects, X<sub>gene</sub><sup>i</sup> and X<sub>gene</sub><sup>j</sup> are gene embeddings, W<sub>1</sub>, W<sub>2</sub>, and A are learnable parameters, φ is an activation function, and f is a permutation-invariant function. This structure allows NAIAD to learn both additive and synergistic effects of gene combinations.
A key component of NAIAD is its Maximum Predicted Effects (MPE) based recommendation system. This system prioritizes gene pairs predicted to have the largest phenotypic effects for subsequent experimental testing, maximizing information gain in each round and accelerating the discovery of potent gene combinations. Alternative acquisition functions, such as uncertainty sampling and Upper Confidence Bound (UCB) sampling, were also explored.
Evaluated on four combinatorial CRISPR datasets encompassing over 350,000 genetic interactions, NAIAD, trained on relatively small datasets, outperformed existing models significantly. With an average of four observations per gene in the training data, NAIAD achieved a >40% improvement in Root Mean Square Error (RMSE) compared to the next best model. In active learning simulations, MPE sampling proved highly effective in identifying strong gene pairs, uncovering more than twice as many strong perturbations as uniform sampling after four rounds. While MPE sampling may not always achieve the lowest overall MSE, its focus on rapidly identifying the most potent combinations is crucial for therapeutic development. This research demonstrates the power of active learning and sophisticated modeling for efficiently navigating the complex combinatorial perturbation space.
Modeling variable guide efficiency in pooled CRISPR screens with ContrastiveVI+ by Ethan Weinberger, Ryan Conrad, Tal Ashuach https://arxiv.org/abs/2411.08072
Caption: Figure: ContrastiveVI+ improves disentanglement of perturbation effects in pooled CRISPR screens. (a-c) UMAP visualizations of cell cycle phase, gene program, and perturbation probability, respectively, highlighting ContrastiveVI+'s ability to separate confounding factors. (d) ContrastiveVI+ outperforms existing methods in disentangling cell cycle effects and recovering gene program annotations. (e-h) Perturbation labels and gene expression visualizations demonstrate ContrastiveVI+'s ability to identify subtle, single-cell-level perturbation effects.
Pooled CRISPR screens coupled with single-cell RNA sequencing offer a powerful method for studying gene function. However, analyzing this data presents significant computational hurdles. Perturbation-induced variations can be subtle compared to other sources of variation, such as cell cycle effects, and inconsistent guide RNA efficiency means some cells escape perturbation even when expressing a guide. Existing methods like contrastive latent variable models (cLVMs) address the former by disentangling perturbation effects, but they often rely on a single prior for all perturbations, potentially masking subtle effects and failing to explicitly model guide efficiency.
ContrastiveVI+, a novel generative model, tackles both these challenges. For a cell i with guide RNA c<sub>i</sub>, ContrastiveVI+ models gene expression x<sub>i</sub> using background latent variables z<sub>i</sub> (shared with controls) and salient latent variables t<sub>i</sub> (perturbation-specific). Importantly, it incorporates a binary variable y<sub>i</sub> indicating whether a cell underwent perturbation (y<sub>i</sub> = 1) or escaped (y<sub>i</sub> = 0). The salient variables are then drawn from a mixture model: t<sub>i</sub> | y<sub>i</sub>, c<sub>i</sub> ~ y<sub>i</sub> · N(µc<sub>i</sub>, I) + (1 − y<sub>i</sub>) · N(µø, I). This allows for perturbation-specific effects (µc<sub>i</sub>) and a shared "null" effect (µø) for escaping cells. Inference is performed using variational inference with specific regularization terms to promote disentanglement and accurate mapping of non-perturbed cells.
Evaluated on three publicly available pooled CRISPR screening datasets, ContrastiveVI+ demonstrated superior performance. On an ECCITE-seq dataset, it improved the separation of confounding factors like cell cycle and batch effects compared to existing methods, as measured by entropy of mixing. It also highlighted biologically relevant clusters corresponding to known components of the interferon-gamma pathway. On a larger CRISPRi dataset, ContrastiveVI+ again showed improved disentanglement and better captured known pathway annotations. Furthermore, it outperformed Mixscape in identifying perturbed versus escaping cells, as measured by changes in maximum mean discrepancy (MMD) between perturbed and control cells after filtering.
On a CRISPRa dataset, ContrastiveVI+ successfully separated cells by gene program labels while controlling for cell cycle. Crucially, it revealed heterogeneity within gene program clusters, highlighting instances where some perturbations had more variable effects than others. For instance, cells perturbed to activate CEBPE alone showed variable responses, while cells with CEBPE activated in combination with other genes often clustered with the CEBPE-only cells, suggesting that CEBPE activation alone might be sufficient to drive a specific cellular state. These subtle effects, missed by previous pseudobulk analyses, showcase the power of ContrastiveVI+'s single-cell resolution. Overall, ContrastiveVI+ offers improved disentanglement, identification of escaping cells, and the ability to uncover subtle, single-cell-level perturbation effects, making it a valuable tool for analyzing pooled CRISPR screens.
LA4SR: illuminating the dark proteome with generative AI by David R. Nelson, Ashish Kumar Jaiswal, Noha Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani https://arxiv.org/abs/2411.06798
The LA4SR framework (language modeling with artificial intelligence for algal amino acid sequence representation) utilizes generative AI, specifically language models (LMs) and large language models (LLMs), to transform microbial sequence classification. Researchers re-engineered open-source LMs like GPT-2, BLOOM, and Mamba for microbial sequence analysis, employing two distinct training approaches: one using full-length protein sequences (TI-inclusive) and the other using scrambled start/stop sites (TI-free) to assess the impact of terminal information (TI). Trained on ~77 million sequences from 161 microalgal genomes and known contaminants, representing ten microalgal phyla, the models utilized pre-training via architecture mimicry and fine-tuning with parameter-efficient techniques like LORA and QLORA.
LA4SR achieved impressive results, with F1 scores up to 95 and significantly improved recall and speed compared to BLASTP (up to 16,580x faster and ~3x the recall rate). Notably, these models effectively classified the algal "dark proteome" – the ~65% of proteins typically yielding no hits in alignment-based searches. Larger models (>1B parameters) achieved high accuracy (F1 > 86) with minimal training data (<2% of the total dataset), demonstrating robust generalization. High accuracy was maintained even with scrambled TI, indicating the models learned position-independent amino acid patterns.
Custom explainability tools, including HELIX, DeepLift LA4SR, and Deep Motif Miner Pro (DMMP), were developed to understand the models' decision-making. Analysis revealed the dominance of glycine and glutamine in distinguishing algal and bacterial sequences, likely reflecting evolutionary adaptations related to nitrogen metabolism. Specific motifs associated with ATP/GTP-binding sites were identified in bacterial-like algal sequences, suggesting potential horizontal gene transfer. Layer-wise analysis revealed increasing representation complexity up to the fifth layer, followed by convergence in the final layer, indicating feature refinement as information propagates through the network.
Validation with real-world data from axenic and xenic algal cultures further demonstrated LA4SR's ability to differentiate true contaminants from algal sequences with bacterial-like characteristics, crucial for accurate genome analysis. The TI-free approach enhanced performance, enabling faster token generation. These findings represent a paradigm shift in high-throughput bioinformatics, offering new possibilities for understanding uncharacterizable protein sequences and highlighting the potential of transfer learning to bridge general language understanding and biological sequence analysis. While the underrepresentation of certain algal lineages in existing databases presents a limitation, the success of LA4SR paves the way for more transparent, reliable, and biologically meaningful microbial genomics analysis.
Validating GWAS Findings through Reverse Engineering of Contingency Tables by Yuzhou Jiang, Erman Ayday https://arxiv.org/abs/2411.11169
Caption: This figure illustrates the novel validation technique for GWAS findings. It shows the process of reconstructing contingency tables using public data and reported p-values, calculating MAFs, and comparing them to public MAFs to determine the Hamming distance and assess the reliability of the results. This method enables validation without requiring access to the original sensitive genomic data.
Reproducibility is essential in GWAS, but data sharing restrictions due to privacy concerns often impede validation efforts. This novel method addresses this challenge by leveraging publicly available data and reported study outcomes to validate GWAS findings without needing the original dataset. The focus is on detecting unintentional errors that can arise during research, rather than deliberate data fabrication.
The method uses reported p-values of SNPs in GWAS studies. By leveraging public reference datasets, a partial contingency table for each SNP can be constructed. The missing part of the contingency table, representing the case group, is then reverse-engineered using a grid search to find the case group distribution that best replicates the reported p-values. Once reconstructed, MAFs are calculated for the case group using the formula: MAF = (2T₀ + T₁)/2(T₀ + T₁ + T₂), where T₀, T₁, and T₂ represent genotype counts in the case group. These calculated MAFs are compared to publicly available phenotype-specific MAF data, and the average Hamming distance between them serves as the basis for validation.
The average Hamming distance between calculated and public MAFs across all SNPs is computed. If this distance exceeds a predefined threshold, the GWAS findings are flagged for further scrutiny. Conversely, if the distance falls within the acceptable range, the findings are considered reliable. This method offers a practical way for researchers to validate both their own and others' findings, promoting trust and accuracy in genomic research.
Evaluation using three real-world SNP datasets from OpenSNP demonstrated the method's effectiveness in detecting errors, even with minor discrepancies (e.g., 1% of SNPs reported incorrectly). The method is robust even with slight noise in reported p-values or minor MAF deviations between the private target dataset and the public dataset. The study also emphasized the importance of p-value precision, showing that providing p-values with at least 7 decimal places ensures optimal performance. This technique offers a promising solution to the GWAS reproducibility challenge by balancing the need for rigorous validation with data privacy, thereby enhancing transparency and trust in genomic research. It also provides a valuable self-check tool for researchers before publication, further strengthening the reliability of GWAS outcomes.
This newsletter highlights significant advancements in computational genomics, spanning gene editing, variant prediction, reproducibility, and single-cell analysis. NAIAD offers an efficient approach to optimizing combinatorial CRISPR screens, addressing the vast search space through active learning and adaptive gene embeddings. ContrastiveVI+ tackles the challenge of variable guide efficiency in pooled CRISPR screens by disentangling perturbation effects and modeling escaping cells, leading to more accurate identification of perturbation-induced variations. LA4SR leverages the power of language models to illuminate the "dark proteome," providing rapid and accurate microbial sequence classification and offering insights into previously uncharacterized proteins. Finally, the novel GWAS validation method using p-values and MAF comparisons addresses the critical need for reproducibility without compromising data privacy. These innovative approaches demonstrate the transformative potential of AI and deep learning in accelerating genomic discovery and enhancing the reliability of research findings.