Subject: Cutting-Edge Statistical Methodologies and Applications
Hi Elman,
This newsletter explores a collection of preprints showcasing diverse statistical methodologies applied across various domains, including environmental science, healthcare, and engineering. A key focus is improving model performance and robustness when dealing with real-world data challenges like missing data, censored observations, and complex dependencies. For instance, Hakvoort et al. (2025) (Hakvoort et al., 2025) introduce tuned weighted scoring rules to enhance probabilistic wind speed forecasts, especially for extreme events. Van Biesbroeck et al. (2025) (Van Biesbroeck et al., 2025) propose a Bayesian sequential design of experiments approach with a constrained reference prior to robustly estimate seismic fragility curves, addressing likelihood degeneracy issues prevalent in small datasets. Doser et al. (2025) (Doser et al., 2025) develop a multivariate spatial model for small area estimation of forest inventory parameters, accounting for zero-inflation, correlations among species, and spatial autocorrelation.
The development of novel statistical methods for specific applications is another recurring theme. P. & S.M. (2025) (P. & S.M., 2025) explore extropy-based divergence measures and their applications, while Baghel & Mondal (2025) (Baghel & Mondal, 2025) employ exponential-polynomial divergence for robust inference in nondestructive one-shot device testing. Jingyuan et al. (2025) (Jingyuan et al., 2025) introduce a joint regularized deep neural network for heterogeneous network estimation in single-cell transcriptomic data, addressing cellular heterogeneity and nonlinear relationships among genes. Hahn et al. (2025) (Hahn et al., 2025) present an adaptive multi-wave sampling method for efficient chart validation in healthcare studies.
Causal inference and prediction are also prominent. Sun et al. (2025) (Sun et al., 2025) develop a hybrid Markov model for real-time bus travel time prediction, while Wójcik & Macek (2025) (Wójcik & Macek, 2025) investigate the Markovian character of fluctuations in solar wind turbulence. Wu et al. (2025) (Wu et al., 2025) propose WOLCAN, a Bayesian latent class analysis method for non-probability samples, to study dietary behaviors. Gao et al. (2025) (Gao et al., 2025) develop a doubly robust omnibus sensitivity analysis for externally controlled trials.
Advancements in Bayesian statistics are also highlighted. Xu & Zhou (2025) (Xu & Zhou, 2025) introduce a Bayesian synthetic control method, while Zhou et al. (2025) (Zhou et al., 2025) present Shiny-MAGEC, a Bayesian R Shiny application for meta-analysis. Kral et al. (2025) (Kral et al., 2025) develop a model-based bi-clustering approach using a multivariate Poisson-lognormal model.
Finally, several papers address specific data analysis challenges. Lukić et al. (2025) (Lukić et al., 2025) propose a variance-based segmentation algorithm for analyzing software system signals. Sutton et al. (2025) (Sutton et al., 2025) evaluate spatial multilevel regression and poststratification (MRP). Rethwisch & Hofmann (2025) (Rethwisch & Hofmann, 2025) present a visualization framework for forensic bullet comparisons.
A Spatiotemporal, Quasi-experimental Causal Inference Approach to Characterize the Effects of Global Plastic Waste Export and Burning on Air Quality Using Remotely Sensed Data by Ellen M. Considine, Rachel C. Nethery https://arxiv.org/abs/2503.04491
Caption: Effect of 2018 Policy, Varying by Port Proximity Index
This study uses a novel spatiotemporal, quasi-experimental approach to quantify the impact of China's 2018 plastic waste import ban on Indonesia's air quality. The ban diverted substantial plastic waste to Indonesia, increasing concerns about air quality degradation due to open burning. The researchers cleverly leverage remotely sensed data, including PM₂.₅ estimates, dump site locations, shipping activity (a proxy for plastic waste imports), and meteorological data. This approach overcomes the lack of ground-level monitoring data common in many countries.
The core of the analysis is a multiply-robust estimator for causal exposure-response curves. This estimator addresses potential confounding by using pre-intervention years as controls and by accounting for varying levels of exposure to the intervention (proxied by dump site proximity to ports). Built upon the efficient influence function (EIF), the estimator is robust to misspecification of outcome or propensity score models. A spatial weighted bootstrap procedure quantifies uncertainty, considering spatial and temporal correlations.
The study found a statistically significant increase in monthly PM₂.₅ near Indonesian dump sites post-ban (2018-2019) compared to pre-ban levels (2012-2017). The increase ranged from 0.76-1.72 µg/m³ (15-34% of the WHO annual limit) at sites with high port proximity. The average increase across all dump sites was 1.14 µg/m³ (4.4%). A post-hoc analysis of fire data corroborated the findings, showing increased fires at dump sites after the ban. This research demonstrates the power of combining causal inference with remote sensing for policy evaluation in data-scarce environments.
Heterogeneous network estimation for single-cell transcriptomic data via a joint regularized deep neural network by Yang Jingyuan, Li Tao, Wang Tianyi, Shuangge Ma, Mengyun Wu https://arxiv.org/abs/2503.06389
Caption: The figure illustrates the JRDNN-KM framework for single-cell network estimation.
Single-cell transcriptomics reveals cellular heterogeneity, but analyzing this data is challenging. Existing network estimation methods struggle with non-linearity, zero-inflation, and cell population diversity. The Joint Regularized Deep Neural Network incorporating Mahalanobis distance-based K-means clustering (JRDNN-KM) method addresses these issues by combining deep learning with clustering to estimate multiple networks for different cell subgroups.
JRDNN-KM uses deep neural networks to model non-linear gene relationships. For each cell subgroup k and gene j, it uses a zero-inflated conditional Gaussian distribution:
X<sub>ij</sub>|X<sub>i,\j</sub>; C<sub>i</sub> = k ~N (f<sub>kj</sub>(W<sub>k,j</sub> * X<sub>i,\j</sub>), 1) π<sub>j</sub> + δ<sub>0</sub>(X<sub>ij</sub>) (1 – π<sub>j</sub>)
where f<sub>kj</sub>(.) is a non-parametric function (the neural network), W<sub>k,j</sub> are sparse weight vectors (network connections), and π<sub>j</sub> is the non-zero expression probability. Regularization encourages shared network structures across subgroups. The estimated networks refine cell subgroup assignments through Mahalanobis distance-based K-means clustering.
JRDNN-KM outperformed existing methods in simulations, achieving high ARI (Adjusted Rand Index) and F1 scores for subgroup identification and network reconstruction, even with high non-linearity and dropout. Applied to real datasets, JRDNN-KM accurately identified cell subgroups and revealed biologically meaningful connections, including hub genes and shared pathways. For example, in lung adenocarcinoma data, it identified known cancer-related hub genes (ASPM, RPL11) and unique hub genes in specific cell lines (FTL, AKR1C3). This method offers powerful insights into gene interplay within heterogeneous cell populations.
VACT: A Video Automatic Causal Testing System and a Benchmark by Haotong Yang, Qingyuan Zheng, Yunjian Gao, Yongkun Yang, Yangbo He, Zhouchen Lin, Muhan Zhang https://arxiv.org/abs/2503.06163
Caption: This diagram illustrates the VACT (Video Automatic Causal Testing) system.
Text-to-video models (VGMs) are evolving rapidly, but their understanding of causality remains a challenge. While visually impressive, generated videos can lack physical realism. VACT, a new automated testing system and benchmark, addresses this by rigorously evaluating the causal reasoning of VGMs.
VACT uses a large language model (LLM) to automatically identify causal factors and rules from scenario descriptions, creating causal graphs and systems without human input. This automation enables scalable testing across diverse scenarios. VACT performs intervention experiments, manipulating causal factors in text prompts and analyzing the generated videos for deviations from real-world physics.
The benchmark introduces three levels of causal consistency: text consistency (accurate variable generation), generation consistency (stable outputs for identical inputs), and rule consistency (adherence to causal relationships). These are evaluated using accuracy and variance metrics. For example:
Text Consistency: $s_{all} = \frac{1}{n_1} \sum_{i=1}^{n_1} \sum_{V \in V} 1(V^{(i)} = \hat{V}^{(i)})$
Generation Consistency: $s_2^{truth} = \frac{1}{n_2|Y|} \sum_{k=1}^{n_2} \sum_{Y \in Y} d(Y, S_k)$
Benchmarking leading VGMs revealed widespread struggles with causal consistency. Text accuracy ranged from 55% to 65%. While some generation consistency was observed, it often resulted from "degenerative" rules producing fixed outcomes. Rule consistency remained low (around 70%). VACT provides a crucial tool for assessing and improving causal reasoning in VGMs, paving the way for more realistic video generation.
Multivariate spatial models for small area estimation of species-specific forest inventory parameters by Jeffrey W. Doser, Malcolm S. Itter, Grant M. Domke, Andrew O. Finley https://arxiv.org/abs/2503.07118
Caption: Comparison of Coefficient of Variation (CV) between Model-Based and Design-Based Estimates of County-Level Biomass for 20 Tree Species
National Forest Inventories (NFIs) are essential, but their design-based estimation methods often lack precision at management-relevant scales, particularly for individual species. This research introduces a multivariate spatial model for small area estimation (SAE) to address this. The model handles the complexities of species-specific data, including zero-inflation, interspecies correlations, and spatial autocorrelation. By fitting to plot-level data, it enables species-level parameter estimation with associated uncertainty across any user-defined small area.
The model uses a two-stage hierarchical Bayesian framework. Stage 1 jointly estimates species presence/absence using a Bernoulli model with a logit link, incorporating climate variables and a spatially varying intercept (modeled using Nearest Neighbor Gaussian Processes – NNGP for computational efficiency). Stage 2 estimates species-specific biomass (conditional on presence) using a log-normal sub-model with predictors like canopy cover, climate variables, elevation, and another NNGP-modeled spatial intercept.
Tested on FIA data from the southern US, the model estimated county-level biomass for 20 tree species. Compared to design-based and k-nearest neighbor (kNN) methods, the model-based estimates showed high correlation with both (average 0.85 with design-based) and substantially improved precision (91.5% of county-level estimates had lower CVs than design-based). A simulation study confirmed the model's superior accuracy (lower RMSE). This model allows for reliable estimation of species-level parameters at management-relevant scales, crucial for various forest management activities.
This newsletter highlights a range of innovative statistical approaches addressing critical challenges in diverse fields. From improving probabilistic forecasts and robustly estimating fragility curves to untangling complex cellular networks and evaluating the causal reasoning of AI, these preprints demonstrate the power of statistical methods. The emphasis on handling real-world data complexities, such as missing data, censored observations, and spatial autocorrelation, is particularly noteworthy. Furthermore, the development of novel Bayesian methods and deep learning approaches for specific applications underscores the evolving landscape of statistical methodology. The practical implications of these advancements are significant, with potential impacts on environmental policy, healthcare interventions, and the development of more reliable AI systems. The continued development and application of these sophisticated statistical tools promise to further enhance our understanding of complex systems and inform more effective decision-making across various disciplines.