Subject: Cutting-Edge Statistical Modeling and Inference: A Roundup of Recent Preprints
Hi Elman,
This newsletter delves into a fascinating collection of preprints exploring diverse applications of statistical modeling and inference. From spatial confounding in geostatistics to uncertainty quantification in nuclear physics and optimized clinical trial design, these papers showcase the expanding reach and sophistication of statistical methodologies across various scientific disciplines.
This collection of preprints showcases a wide range of applications for statistical modeling and inference. The topics covered span diverse fields, including geostatistics, nuclear physics, clinical trial design, and environmental health. A key trend across many papers is the increasing use of Bayesian approaches. For example, Lamouroux et al. (2024) (Lamouroux et al., 2024) tackle spatial confounding in geostatistical data using Gaussian Markov Random Fields (GMRFs) within the R-INLA framework. Similarly, Lartaud et al. (2024) (Lartaud et al., 2024) leverage Bayesian inverse problems and surrogate models for uncertainty quantification in neutron and gamma noise analysis, demonstrating how incorporating gamma correlations can reduce uncertainty. Qu (2024) (Qu, 2024) also employs a Bayesian framework, using Poisson regression with a hierarchical structure to model the relationship between air pollution and health outcomes while accounting for measurement error in pollution data.
Beyond Bayesian methods, the development of novel statistical models for complex data emerges as another prominent theme. Kang and Gu (2024) (Kang & Gu, 2024) introduce a blockwise mixed membership model (BM3) for analyzing multivariate longitudinal data, applying it to identify Parkinson's Disease subtypes. Miller et al. (2024) (Miller et al., 2024) propose Diverse Expected Improvement (DEI), a novel Bayesian optimization method for finding diverse optimal solution sets, particularly relevant for applications like engine control. Soale et al. (2024) (Soale et al., 2024) explore how metric choice impacts dimension reduction in Fréchet regression, emphasizing the critical role of metric selection when analyzing complex data types. Finally, Manna et al. (2024) (Manna et al., 2024) present BigVAR, a statistical model for predicting daily water table depth, incorporating time series autoregression and lagged variables.
Methodological advancements in specific statistical techniques also feature prominently. Asikanius et al. (2024) (Asikanius et al., 2024) offer guidelines for planning and interpreting clinical trials involving interim analyses, addressing operational and interpretational challenges. Zafar and Nicholls (2024) (Zafar & Nicholls, 2024) investigate using Posterior Predictive Checks (PPCs) to select the learning rate in Generalized Bayesian Inference (GBI), a novel approach to handling model misspecification. Wei (2024) (Wei, 2024) proposes nonparametric covariance regression models for high-dimensional neural data with restricted covariates, employing Gaussian processes and graph information. Wang et al. (2024) (Wang et al., 2024) provide a comprehensive guide to Simulation-Based Inference (SBI) in computational biology, comparing neural and statistical SBI methods. Zeileis (2024) (Zeileis, 2024) demonstrates the use of Rasch models and measurement invariance assessment for analyzing exam data.
The remaining preprints cover a diverse range of applications and methodological contributions. Several focus on specific applications, such as evaluating the impact of reduced malaria prevalence on birthweight using a pair-of-pairs design (Wang et al., 2024) (Wang et al., 2024), calibrating microscopic traffic models with macroscopic data (Wang et al., 2024) (Wang et al., 2024), and discovering governing equations from experimental data using laser vibrometry and WSINDy (Schmid et al., 2024) (Schmid et al., 2024). Others introduce novel methods, like a robust method for multi-view co-expression network inference (Pandeva et al., 2024) (Pandeva et al., 2024), a framework for estimating heterogeneous treatment effects in survival outcomes (Bo & Ding, 2024) (Bo & Ding, 2024), and a method for learning non-Gaussian spatial distributions using Bayesian transport maps (Chakraborty & Katzfuss, 2024) (Chakraborty & Katzfuss, 2024). Theoretical considerations are also addressed, with Hemerik and Koning (2024) (Hemerik & Koning, 2024) discussing the pitfalls of post hoc significance level selection, and Regueiro et al. (2024) (Regueiro et al., 2024) evaluating stochastic gradient variational Bayes in stochastic blockmodels. Finally, Melikechi et al. (2024) (Melikechi et al., 2024) introduce a fast nonparametric feature selection method with error control, and Chance (2024) (Chance, 2024) presents a model-based temporal decorrelation process for state estimates.
These preprints collectively highlight the increasing sophistication and breadth of statistical methods being applied across diverse scientific fields. The emphasis on Bayesian methods, novel model development, and rigorous uncertainty quantification reflects a broader trend towards more robust and data-driven scientific inquiry.
Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments by Kosuke Imai, Kentaro Nakamura https://arxiv.org/abs/2410.00903
Caption: This figure compares the performance of a novel causal inference method using LLMs (Proposed) against existing methods (Diff-in-Means, T-Learner with BERT, DR-Learner with BERT) across varying levels of confounding (weak, moderate, strong). The metrics evaluated include bias, RMSE, 95% CI coverage, and average runtime, demonstrating the superior performance of the proposed method under the separability assumption, particularly in scenarios with strong confounding. The results are presented for both "Under separability" and "No separability" conditions.
Causal inference with text data, especially when texts are treatments, presents significant challenges due to the high-dimensional and unstructured nature of the data. Disentangling true treatment features from confounding factors is difficult, often leading to biased estimates. This paper introduces a groundbreaking approach that leverages the power of generative AI, specifically large language models (LLMs), to enhance causal representation learning. The key innovation lies in utilizing LLMs not only for generating realistic text treatments but also for exploiting their internal representations for more accurate causal effect estimation. This eliminates the need to learn causal representation from the data, a significant advantage over existing methods, leading to more efficient and precise estimates.
The methodology involves an experimental design where treatment and control prompts are fed into an LLM to generate texts. These generated texts are then presented to survey respondents, and their reactions are measured. Critically, the LLM's internal representation of the generated texts is extracted. This internal representation is then analyzed using a neural network architecture based on TarNet. This network simultaneously learns a deconfounder, denoted as f(Rᵢ), which is a lower-dimensional representation of the LLM's internal representation (Rᵢ), and the conditional potential outcome function, μₜ(f(Rᵢ)) = E[Yᵢ(t, Uᵢ) | f(Rᵢ)], for treatment levels t = 0, 1. Here, Yᵢ(t, Uᵢ) represents the outcome for individual i under treatment t and unobserved confounders Uᵢ. The propensity score, π(f(Rᵢ; Â)) = Pr(Tᵢ = 1 | f(Rᵢ; Â)), is estimated as a function of the learned deconfounder. Finally, the average treatment effect (ATE), τ = E[Yᵢ(1,Uᵢ) – Yᵢ(0, Uᵢ)], is estimated using the double machine learning (DML) framework. The paper extends this methodology to estimate the local average treatment effect (LATE) of perceived treatment features using an instrumental variable approach, where the actual treatment feature acts as the instrument.
The researchers conducted simulation studies based on a candidate profile experiment, using the open-source LLM Llama3 to generate candidate biographies. They compared their proposed estimator's performance against established methods, including a difference-in-means estimator and two BERT-based approaches. Under the crucial assumption of separability between treatment and confounding features, the proposed estimator significantly outperformed the competitors, demonstrating lower bias and root mean squared error (RMSE) while maintaining accurate nominal coverage for its 95% confidence intervals. These improvements were especially pronounced in scenarios with strong confounding. Moreover, the proposed estimator proved to be considerably more computationally efficient, running over ten times faster than the BERT-based methods. However, when the separability assumption was violated, all estimators performed poorly, emphasizing the importance of this assumption for accurate causal inference. This study powerfully demonstrates the transformative potential of generative AI in causal inference with unstructured data.
A Comprehensive Guide to Simulation-based Inference in Computational Biology by Xiaoyu Wang, Ryan P. Kelly, Adrianne L. Jenner, David J. Warne, Christopher Drovandi https://arxiv.org/abs/2409.19675
Caption: This figure illustrates the three-stage workflow for Simulation-Based Inference (SBI) in computational biology: pre-analysis (model suitability and computational cost assessment), SBI stage (algorithm selection and model refinement), and uncertainty quantification (posterior checks and iterative improvement). The workflow emphasizes an iterative process, incorporating feedback from biologists and evaluating various SBI methods like ABC, SNPE, and others based on the specific model and data characteristics.
Computational models are indispensable for understanding complex biological processes. However, parameter inference, especially with real-world data, remains a formidable challenge. Simulation-Based Inference (SBI) methods, encompassing statistical approaches like Approximate Bayesian Computation (ABC) and Bayesian Synthetic Likelihood (BSL), and neural approaches like Neural Posterior Estimation (NPE) and Neural Likelihood Estimation (NLE), offer valuable solutions. Choosing the right SBI method for a specific scenario, however, can be daunting. This paper provides a comprehensive guide for computational biologists navigating the complexities of SBI, offering a practical three-stage workflow: pre-analysis, SBI application, and uncertainty analysis, with an emphasis on iterative refinement based on the results.
The pre-analysis stage focuses on assessing computational cost and model suitability. This involves estimating simulation times and performing inference on synthetic datasets to evaluate parameter identifiability and sensitivity. The SBI stage involves selecting the appropriate algorithm based on the pre-analysis results and addressing potential model misspecification. For real-world data, where misspecification is likely, the authors recommend robust versions of the chosen algorithms, such as robust BSL (RBSL) and robust SNLE (RSNLE). The uncertainty analysis stage involves posterior predictive checks to evaluate model performance and identify areas for improvement. The authors stress the importance of collaboration with biologists to refine the model based on these insights.
The guidelines were applied to two agent-based models: a biphasic tumor growth model (BVCBM) and a stochastic cell invasion model. For BVCBM, using real pancreatic tumor data, Sequential Monte Carlo ABC (SMC ABC) and RSNLE were selected. Results showed SMC ABC performed reasonably well, while RSNLE struggled due to the noise inherent in real-world data. For the cell invasion model, SMC ABC, Sequential Neural Posterior Estimation (SNPE), and SNLE were chosen. BSL, while potentially accurate, proved computationally prohibitive. SMC ABC performed best for cell proliferation, while all three performed similarly for cell movement, highlighting the need for improved experimental design to address non-identifiability of movement parameters.
This study reveals a key trade-off: neural SBI methods require fewer simulations but can produce biased estimations, especially with real-world data. Statistical methods, while computationally more demanding, offer improved accuracy with increasing simulation numbers. This suggests that with sufficient computational resources, statistical SBI may outperform neural SBI. The authors emphasize that no single algorithm is universally superior, highlighting the importance of the proposed guidelines. The paper also underscores the critical role of model refinement in the calibration process, advocating for close collaboration with biologists to ensure models accurately reflect the underlying biology.
Robust Multi-view Co-expression Network Inference by Teodora Pandeva, Martijs Jonker, Leendert Hamoen, Joris Mooij, Patrick Forré https://arxiv.org/abs/2409.19991
Caption: This graph showcases the performance of MVTLASSO compared to Glasso combined with ICA or standardization on Bacillus subtilis gene expression data. MVTLASSO demonstrates superior performance by identifying a significantly higher number of true positive edges in the inferred gene co-expression network across a range of possible false positives. This highlights MVTLASSO's ability to more accurately reconstruct gene regulatory networks from multi-view transcriptome data.
Inferring gene co-expression networks (GCNs) from transcriptome data is crucial for understanding cellular processes, but it is a complex task plagued by challenges like spurious correlations and batch effects due to variations in experimental designs and data sources. Existing methods often rely on simplifying assumptions like Gaussian models or clustering techniques, limiting their robustness and accuracy. This paper introduces MVTLASSO, a robust method for high-dimensional graph inference from multiple independent studies, offering a more sophisticated approach to GCN inference.
MVTLASSO is based on the premise that each dataset is a noisy linear mixture of gene loadings that follow a multivariate t-distribution with a shared, sparse precision matrix (Θ) across all studies. This model, represented by Xd = SdAd + ZdBd (where X is the gene expression matrix, S the gene loading matrix, A the sample loading matrix, Z the noise random variables, and B another sample loading matrix for the noise variables), allows for the identification of the co-expression matrix up to a scaling factor. The method uses an Expectation-Maximization (EM) algorithm to estimate the model parameters, including the crucial precision matrix Θ, which defines the GCN structure.
The researchers evaluated MVTLASSO on both synthetic and real-world gene expression data. In synthetic experiments, MVTLASSO consistently outperformed established methods like GLASSO and TLASSO across varying signal-to-noise ratios and numbers of views (datasets). For instance, with a 50:50 signal-to-noise ratio, increasing the number of views significantly improved MVTLASSO's performance. The method was then applied to gene expression data from Bacillus subtilis. Using two datasets comprising four views after splitting, MVTLASSO demonstrated superior performance in identifying true positive edges compared to GLASSO combined with Independent Component Analysis (ICA) or standardization. Across a range of penalty parameters used in stability selection, MVTLASSO consistently inferred a higher number of true positive edges, validated against existing biological knowledge. MVTLASSO presents a promising new approach for robust GCN inference from multi-view transcriptome data. Its ability to handle non-Gaussian data and account for sample correlations and batch effects makes it a valuable tool for researchers seeking to unravel complex gene regulatory networks.
Estimating Interpretable Heterogeneous Treatment Effect with Causal Subgroup Discovery in Survival Outcomes by Na Bo, Ying Ding https://arxiv.org/abs/2409.19241
Estimating heterogeneous treatment effects (HTE) in survival outcomes is vital for understanding how treatment efficacy varies across patients. Existing methods often focus on post-hoc subgroup identification or lack interpretability, hindering their clinical applicability. This paper introduces an interpretable HTE estimation framework that simultaneously estimates CATE and selects relevant subgroups, addressing the need for clinically actionable insights into treatment heterogeneity, especially in areas like disease progression and targeted therapies.
The method focuses on estimating the conditional average treatment effect (CATE), defined as τ(x; t*) = E[I(T(1) > t*) – I(T(0) > t*)|X = x], where T(1) and T(0) are the potential survival times under treatment and control, respectively, given patient characteristics X at a specific time t*. The framework uses the concept of "pseudo-individualized treatment effect" (pseudo-ITE) to address the challenge of observing only one potential outcome per patient. It integrates three meta-learners with double robustness properties – DR-learner, DEA-learner, and R-learner – to construct pseudo-ITE. These meta-learners utilize inverse probability censoring weighting (IPCW) to handle censored data. Inspired by the RuleFit algorithm, the framework generates interpretable "candidate subgroups" from tree-based methods, specifically conditional inference trees (CTree), and uses them as new covariates in a penalized regression model to predict CATE.
The researchers evaluated their method through extensive simulations mimicking randomized clinical trials, considering both low-dimensional independent and high-dimensional weakly-correlated signal settings. Results showed that using DEA-learner to construct pseudo-ITE generally yielded the best performance in both prediction accuracy and subgroup identification, especially in selecting sparse and relevant subgroups. While DR-learner also demonstrated some sparseness, BAFT and CSF (using predicted CATE as pseudo-ITE) tended to select numerous subgroups, many of which were irrelevant. In high-dimensional correlated settings, relaxing Bonferroni correction or omitting multiple testing adjustments in CTree was recommended for better performance.
The framework was then applied to a real-world dataset from the Age-related Eye Disease Study (AREDS) trials, focusing on the effect of an antioxidant and mineral supplement on age-related macular degeneration (AMD) progression. Using AREDS as training data and AREDS2 as test data, the method identified fifteen subgroups based on single nucleotide polymorphisms (SNPs), baseline AMD severity, and education. These subgroups exhibited significant treatment heterogeneity, with some experiencing enhanced treatment effects while others showed adverse effects. Validation in the AREDS2 dataset confirmed these findings, demonstrating the method's ability to identify clinically relevant subgroups and quantify treatment effect variations. This research offers a promising approach to interpretable HTE estimation in survival outcomes, potentially leading to more personalized treatment strategies.
This newsletter highlights a diverse range of advancements in statistical modeling and inference. From leveraging the power of generative AI for causal inference with text data to developing robust methods for multi-view co-expression network inference and providing practical guidelines for simulation-based inference in computational biology, these preprints showcase the increasing sophistication and applicability of statistical methodologies across diverse scientific disciplines. A common thread across several of these works is the emphasis on addressing real-world complexities in data, whether it be handling spatial confounding, accounting for model misspecification, or dealing with censored survival outcomes. The focus on interpretability, as seen in the HTE estimation framework, underscores the growing need for statistical methods that not only provide accurate predictions but also offer actionable insights for researchers and practitioners. These advancements collectively contribute to a more robust and data-driven approach to scientific inquiry, paving the way for more impactful discoveries and more effective interventions in various fields.