Subject: Cutting-Edge Statistical and Machine Learning Research
Hi Elman,
This collection of preprints explores diverse applications of statistical and machine learning methods, with a notable emphasis on Bayesian techniques. Several papers focus on improving model robustness and uncertainty quantification. Jacobson et al. (2025) (Jacobson et al., 2025) introduce WOMBAT v2.S, a Bayesian inversion framework that integrates solar-induced fluorescence (SIF) data with CO$_2$ concentration measurements to refine estimates of natural carbon fluxes. Similarly, Fazeliasl et al. (2025) (Fazeliasl et al., 2025) propose a Bayesian nonparametric approach for robust mutual information estimation, enhancing the training of generative models. Paul et al. (2025) (Paul et al., 2025) develop methods for quantifying uncertainty in cell tracking algorithms, drawing inspiration from both Bayesian inference and classification. These works collectively demonstrate the growing importance of incorporating uncertainty awareness into complex models.
Another recurring theme is the development of novel statistical models for specific applications. Ben Nasr et al. (2025) (Ben Nasr et al., 2025) challenge the assumption of log-normality for wavelet leaders in multifractal analysis, proposing a new model based on log-concave distributions. Diebold et al. (2025) (Diebold et al., 2025) investigate the "wisdom of crowds" effect in macroeconomic forecasts, analyzing the performance of averaged forecasts as the number of respondents grows. Yang et al. (2025) (Yang et al., 2025) introduce the beta-generalized Lindley distribution for modeling wind speed, demonstrating its superior fit compared to existing distributions. These papers highlight the ongoing need for tailored statistical models that accurately capture the nuances of specific data types and research questions.
Several preprints leverage machine learning for diverse tasks. Geng and Michailidis (2025) (Geng & Michailidis, 2025) present a neural network-based change point detection method for large-scale time-evolving data. Oexner et al. (2025) (Oexner et al., 2025) benchmark various machine learning methods for risk prediction modeling from large-scale survival data, finding that penalized Cox proportional hazards models remain highly effective. Alyaev et al. (2025) (Alyaev et al., 2025) introduce the DISTINGUISH workflow, a real-time, AI-driven system for geosteering during directional drilling, incorporating generative adversarial networks (GANs) and dynamic programming. These contributions showcase the versatility of machine learning in addressing complex problems across different domains.
Beyond these core themes, several papers explore specialized applications of statistical methods. Durham et al. (2025) (Durham et al., 2025) propose a multilevel modeling approach for analyzing clustered SMARTs in health policy research. Bocchi et al. (2025) (Bocchi et al., 2025) demonstrate the explainability and trustworthiness of GENEOnet, a group equivariant non-expansive operator network, in computational biochemistry. Shannon et al. (2025) (Shannon et al., 2025) leverage national forest inventory data and Bayesian spatio-temporal modeling to estimate forest carbon density. These diverse applications underscore the broad relevance of statistical and machine learning methods in addressing real-world challenges.
Finally, a subset of papers focuses on causal inference and its applications. Scutari et al. (2025) (Scutari et al., 2025) employ causal networks to model the interplay of environmental and mental factors in dermatitis using infodemiological data. Ahn et al. (2025) (Ahn et al., 2025) introduce SMAHP, a method for survival mediation analysis of high-dimensional proteogenomic data. Zhu et al. (2025) (Zhu et al., 2025) use causal inference to explore the impact of government policy on computer usage during the COVID-19 pandemic. These studies demonstrate the increasing use of causal inference techniques to disentangle complex relationships and inform policy decisions.
On the Wisdom of Crowds (of Economists) by Francis X. Diebold, Aaron Mora, Minchul Shin https://arxiv.org/abs/2503.09287
Caption: Crowd Size Signature Plot for Forecast Combination
The "wisdom of crowds" posits that aggregating diverse opinions can lead to accurate predictions. This concept has significant implications for economic forecasting, particularly when using surveys like the U.S. Survey of Professional Forecasters (SPF), which influences policy decisions. This research investigates how the predictive power of combined forecasts changes as the number of forecasters increases, focusing on real GDP growth and inflation forecasts.
The study analyzes SPF data using "crowd size signature plots," which track the mean squared error (MSE) of k-forecast averages as k (the number of forecasts in the average) grows. They also analyze the change in MSE from adding one more forecast to the average, as well as the overall performance of k-averaging relative to using no averaging at all. Complementing the empirical analysis, the researchers develop a theoretical model based on "equicorrelation" of forecast errors—all forecast error variances are identical, and all pairwise correlations are identical. The model's parameters are estimated by minimizing the difference between the empirical signature plots from the SPF and the theoretically derived signature plots from the equicorrelation model.
The results show a remarkable fit of the equicorrelation model to both growth and inflation forecasts. This suggests that simple averaging is a near-optimal or optimal strategy in this context. The study reveals that the gains from diversification (increasing k) diminish rapidly. For realistic correlation levels (around 0.5), most of the benefit is achieved by averaging just five forecasts. Interestingly, the benefits of diversification are more pronounced for inflation forecasts than for growth forecasts, likely due to the lower correlation among individual inflation forecasts. The analytical results for the equicorrelation model demonstrate that the MSE of a k-forecast average is given by: MSE(k; ρ, σ) = (σ²/k)[1 + (k-1)ρ], where ρ is the correlation between forecast errors and σ² is the variance of individual forecast errors.
While the findings suggest that fewer forecasters than currently used in the SPF might suffice to capture most diversification benefits, the authors advise against prematurely reducing the panel size. They emphasize that the average MSE masks variation in performance across different combinations of forecasters. In any given period, the best-performing k-average could deviate significantly from the average performance. Future research could explore the performance of "best k-average" forecasts, potentially refining survey design. Furthermore, incorporating time-varying correlation structures into the model could enhance realism and provide deeper insights into forecast combination dynamics.
Multilevel Primary Aim Analyses of Clustered SMARTs: With Applications in Health Policy by Gabriel Durham, Anil Battalahalli, Amy Kilbourne, Andrew Quanbeck, Wenchu Pan, Tim Lycurgus, Daniel Almirall https://arxiv.org/abs/2503.08987
Caption: Comparison of Longitudinal and Static Methods for Estimating Causal Effects of Embedded cAIs on CBT Delivery
Clustered, sequential, multiple assignment, randomized trials (cSMARTs) are valuable for optimizing adaptive interventions in health policy. However, current analytical methods for cSMARTs often focus on static, end-of-study outcomes, potentially overlooking important temporal dynamics. This paper introduces a novel three-level marginal mean modeling approach specifically designed for analyzing longitudinal, nested outcomes in cSMARTs. This innovation allows researchers to examine the dynamic effects of clustered adaptive interventions (cAIs) more thoroughly, enabling a broader range of causal inquiries.
The proposed methodology addresses the complexities of nested data structures common in health policy interventions, where repeated measurements are nested within individuals, who are further nested within clusters (e.g., schools). The method integrates established techniques for analyzing adaptive interventions with longitudinal outcomes and methods for comparing clustered adaptive interventions. It uses a three-level marginal mean model, denoted as 𝜇𝑡(𝑑, 𝑋; 𝜃), where d represents the embedded cAI, X denotes baseline covariates, and θ represents the parameters to be estimated. This model allows for flexible specification of temporal trends and treatment effects, accommodating the sequential nature of SMARTs. Estimation is performed using a weighted estimating equation that accounts for the fact that responding clusters in prototypical SMARTs are consistent with multiple embedded cAIs.
The method's effectiveness was evaluated using data from the Adaptive School-Based Implementation of CBT (ASIC) study, a cSMART designed to improve CBT adoption in Michigan high schools. The analysis showed that incorporating repeated measurements significantly improved statistical efficiency, even when the primary outcome was a static, end-of-study measure. Specifically, the 95% confidence intervals for pairwise comparisons of embedded cAIs were approximately 26% narrower using the longitudinal approach compared to the existing static approach. Moreover, the longitudinal analysis revealed dynamic treatment effects masked by the static analysis, suggesting that adding a facilitation component for slower-responding schools accelerated CBT delivery growth.
Simulation studies further validated the method's performance, demonstrating its consistency and negligible bias across various sample sizes. The simulations confirmed the efficiency gains observed in the ASIC data analysis, particularly with high within-unit correlation. Importantly, incorporating repeated measurements did not compromise statistical performance, even with linear underlying temporal trends. The study also explored different working variance modeling choices, suggesting that more flexible models, particularly heteroscedastic ones with respect to time, can further enhance estimator efficiency.
A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation by Forough Fazeliasl, Michael Minyi Zhang, Bei Jiang, Linglong Kong https://arxiv.org/abs/2503.08902
Caption: This diagram illustrates the DPMINE framework, which leverages a Dirichlet Process prior to improve Mutual Information estimation. The DPMINE module interacts with both the encoder and generator of a generative model, enhancing stability and performance by maximizing MI between data and latent spaces. This leads to improved sample quality and mitigates mode collapse, as demonstrated in experiments with VAE-GAN models.
Estimating mutual information (MI) is essential for capturing dependencies between variables, but traditional methods struggle with high dimensionality and intractable likelihoods. Existing neural MI estimators, while promising, can be unstable and sensitive to sample variability. This paper presents a novel approach—the Dirichlet process mutual information neural estimation (DPMINE)—leveraging Bayesian nonparametric techniques for more robust and accurate MI estimation. DPMINE addresses the limitations of both Jensen-Shannon (JS)-based and Kullback-Leibler (KL)-based estimators by incorporating a Dirichlet process (DP) prior on the data distribution. This prior acts as a regularizer, smoothing the resulting distribution and reducing sensitivity to fluctuations and outliers, especially in small sample settings like mini-batches.
The core of DPMINE lies in constructing the MI loss with a finite representation of the DP posterior. This integrates prior knowledge and empirical data, creating a more stable and robust loss function. The approach effectively reduces variance in the estimation process, stabilizing gradients during training and improving the convergence of the MI approximation. Specifically, for KL-based MINE, the paper proves that the expected DP-based DV lower bound, E[L^{DP}_{DV}], is asymptotically larger than the standard DV lower bound, L_{DV}, providing a tighter bound on the true MI:
lim<sub>n,N→∞</sub> E[L^{DP}{DV}(f₁(X), f₂(X))] ≥ L{DV}(X₁, X₂)
This theoretical advantage translates to practical improvements.
Experiments on synthetic and real-world datasets demonstrate DPMINE's effectiveness. In simulations, DPMINE consistently outperforms traditional empirical distribution function (EDF)-based methods, demonstrating better convergence and fewer estimation fluctuations, even in high-dimensional settings. This robustness to dimensionality is a key advantage. The paper also showcases DPMINE's application in refining the training of a Bayesian nonparametric VAE-GAN model. By maximizing MI between the data and latent spaces, DPMINE significantly reduces mode collapse, a common issue in generative models, and improves the quality and diversity of generated samples. For example, in 3D CT image generation using a COVID-19 dataset, DPMINE achieved a superior MS-SSIM score of 0.435 compared to a-WGAN+MINE (0.398) and BiGAN+MINE (0.023). Similar improvements were observed in FID and KID scores.
While the paper focuses on generative models, the DPMINE framework has broader implications for BNP learning procedures. Its ability to provide robust and accurate MI estimates makes it a valuable tool for representation learning, reinforcement learning, and Bayesian decision-making.
Leveraging national forest inventory data to estimate forest carbon density status and trends for small areas by Elliot S. Shannon, Andrew O. Finley, Paul B. May, Grant M. Domke, Hans-Erik Andersen, George C. Gaines III, Arne Nothdurft, Sudipto Banerjee https://arxiv.org/abs/2503.08653
Caption: This figure compares the performance of a Bayesian spatio-temporal model ("Full Model") against a direct estimator for forest carbon density across varying sample sizes (n<sub>j,t</sub>). The full model consistently demonstrates lower bias, RMSE, and narrower confidence intervals, particularly for small sample sizes, indicating improved precision and accuracy in estimating forest carbon.
Estimating forest parameters for small areas is challenging due to the cost of data collection and resulting data sparsity. Traditional design-based estimators are unreliable in these situations, and model-based approaches like the Fay-Herriot (FH) model often require unavailable direct estimates for small sample sizes. This paper introduces a novel Bayesian spatio-temporal small area estimation (SAE) model that directly uses plot-level National Forest Inventory (NFI) measurements and auxiliary data, bypassing the need for direct estimates and addressing the limitations of existing methods.
The proposed model is applied to estimate live forest carbon density (LFCD) across the contiguous United States (CONUS) using data from the U.S. Forest Service Forest Inventory and Analysis (FIA) program. The model incorporates plot-level LFCD measurements from FIA and remotely sensed tree canopy cover (TCC) as an auxiliary covariate. It accounts for spatial and temporal variability through temporally-varying and spatially-varying regression coefficients, and integrates a dynamic spatio-temporal intercept term, u<sub>j,t</sub>, modeled as a dynamically evolving Conditional Autoregressive (CAR) spatial random effect:
y<sub>i,j,t</sub> = x<sup>T</sup><sub>j,t</sub>β<sub>t</sub> + š<sup>T</sup><sub>j,t</sub>η<sub>j</sub> + u<sub>j,t</sub> + ε<sub>i,j,t</sub>
where y<sub>i,j,t</sub> is the LFCD observation at plot i in county j at time t, x<sub>j,t</sub> and š<sub>j,t</sub> are vectors of covariates, β<sub>t</sub> are temporally-varying regression coefficients, η<sub>j</sub> are spatially-varying regression coefficients, and ε<sub>i,j,t</sub> are residual error terms.
A simulation study using 100 replicates sampled from a simulated population mimicking FIA data characteristics compared the model to a design-based direct estimator. The proposed model consistently showed improved precision and accuracy, especially with small sample sizes. While both the full model and direct estimates exhibited some bias for small sample sizes, the overall bias and root mean squared error (RMSE) were consistently lower for the full model. Coverage percentages were similar, but the full model produced substantially narrower coverage intervals, indicating greater precision.
Applying the model to FIA data revealed valuable insights into county-level forest carbon dynamics across the CONUS. The model effectively captured abrupt changes in LFCD, such as those caused by disturbances like the 2013 Rim Fire in California, and provided more stable and informative estimates compared to the highly variable direct estimates. It also successfully leveraged data unusable in traditional design-based frameworks and FH models, such as cases with very small sample sizes or identical plot measurements. Estimated trends in LFCD revealed significant increases in areas like Northern Maine, the Southeast, and coastal western regions, while decreases were observed in the Sierra Nevadas, Northern Rockies, and parts of the Appalachians.
This newsletter highlights a convergence of innovative approaches in statistical and machine learning methodologies. From refining the "wisdom of crowds" in economic forecasting to enhancing the robustness of mutual information estimation using Bayesian nonparametrics, the preprints discussed showcase advancements in handling uncertainty and complexity. The development of specialized models, like the three-level marginal mean model for analyzing clustered SMARTs and the spatio-temporal model for estimating forest carbon density, demonstrates the increasing sophistication of statistical tools for addressing real-world challenges. The common thread weaving through these diverse applications is a focus on improving the reliability, interpretability, and practical utility of statistical and machine learning models. These advancements promise to significantly impact various fields, from health policy and environmental science to finance and artificial intelligence.