Subject: Cutting-Edge Advances in Statistical Methodology and Applications
Hi Elman,
This newsletter explores a collection of preprints showcasing diverse methodological advancements and applications across various domains, including imaging genetics, time series analysis, and network science. Su et al. (2024) introduce a novel canonical correlation analysis based linear model for imaging genetics, tackling the challenges of high-dimensional data and spatial information inherent in brain imaging. Their method allows for simultaneous detection of significant brain regions and selection of relevant genetic variants associated with a specific phenotype, demonstrated through an analysis of reaction speed using UK Biobank data. For visualization and clustering, Bartoszek and Luo (2024) present the RMaCzek package for identifying clusters within Czekanowski's diagrams, providing a valuable tool for precise delineation of visually apparent clusters in data relationships. Kinoshita et al. (2024) leverage Jensen-Shannon divergence and rank distribution analysis to investigate spatio-temporal patterns in bike-sharing usage across six major cities, revealing consistent weekday/weekend usage patterns and highlighting the potential of such data for understanding urban mobility.
Several papers focus on novel approaches to time series analysis. GrandPre et al. (2024) propose a model-free method for directly estimating irreversibility from time series data, applying it to neural activity in the retina. Gao et al. (2024) revisit Principal Component Analysis (PCA) for temporal dimension reduction in time series, demonstrating its effectiveness in enhancing the computational efficiency of various deep learning models without compromising accuracy. Asgari et al. (2024) introduce functional structural equation modeling with latent variables, addressing the challenges of sparse data by modeling latent variables as Gaussian processes. Ahn et al. (2024) develop a Gamma-Gamma observation-driven state-space model for claim size modeling, extending previous work by allowing for flexible variance behavior and ensuring consistency with evolutionary credibility. Ng et al. (2024) employ ARIMA models with covariates to analyze the impact of the COVID-19 pandemic and the Great Recession on the U.S. rail freight industry, providing a framework for scenario construction and parameter selection.
Further contributions include advancements in statistical methodology and their applications. Klauenberg et al. (2024) identify key training needs in measurement uncertainty across various disciplines, emphasizing the importance of addressing technical topics such as Monte Carlo methods and multivariate measurands. Chen et al. (2024) utilize a recurrent neural network approach for predicting customer lifetime value in SaaS applications, incorporating multiple time dimensions and demonstrating improved accuracy compared to traditional models. Smolyak et al. (2024) propose FAIR (Functionally Adaptive Interaction Regularization), a novel framework for maximizing predictive performance across imbalanced subgroups while maintaining model interpretability, which is particularly relevant for healthcare applications. Ghosh et al. (2024) develop time series models for forecasting malaria cases in Indian states, integrating their findings into an interactive R Shiny tool for practical application.
Several preprints explore specific applications of statistical methods. Santucci and Lax (2024) investigate the "hangover effect" in professional sports, using bookmaker spreads to assess the impact of visiting "party cities" on subsequent game performance. Wu et al. (2024) examine the robustness of amortized Bayesian inference for cognitive models, proposing a data augmentation approach to improve robustness against contaminant observations. Kuchibhotla (2024) refines concentration inequalities for martingales with bounded increments, providing near-optimal results. Wang et al. (2024) explore the Magnitude-Shape Plot framework for anomaly detection in crowded video scenes, demonstrating its effectiveness compared to traditional functional detectors.
Finally, the collection also includes studies on social networks, econometrics, and other specialized areas. Jokiel-Rokita et al. (2024) present methods for estimating conditional inequality measures using quantile regression, demonstrating their application in analyzing salary inequalities. Oliveira et al. (2024) analyze the impact of homophily in networks, revealing "homophily traps" for minority groups. Lan (2024) investigates dynamic spillover effects in the cryptocurrency market before and after the pandemic. Several other preprints delve into specific applications ranging from renewable energy impact analysis (Suri et al., 2025) to network reconstruction limits (Murphy et al., 2025), COVID-19 vaccination rates (Hegde et al., 2025), quantum dot intensity fluctuations (Yang et al., 2025), Bayesian marketing mix modeling (Ravid, 2025), mobile health data analysis (Sun et al., 2025), mortality forecasting (Lim et al., 2025), agent-based modeling (O'Gara et al., 2025), Formula 1 competitiveness (Pedroche, 2025), family planning tools (Alkema et al., 2025), high-dimensional dynamical systems (Lin et al., 2025), LLM-generated text analysis (Park et al., 2025), outlier detection (Hu et al., 2025), optimal sampling strategies (Shen and Ning, 2025), novel statistical distributions (Vila and Quintino, 2025), robust hypothesis testing (Hilbert et al., 2025), tensor topic modeling (Liu and Donnat, 2025), AI-driven demand analysis (Bach et al., 2025), rail freight response analysis (Ng et al., 2025), epidemic modeling (Wairimu et al., 2025), policy evaluation (Nassiri et al., 2025), Markov-switching processes (Tsai et al., 2025), bias control in prior event rate ratio methods (Ma et al., 2024), network models of expertise (Rahman et al., 2024), and electrocardiogram classification (Frausto-Avila et al., 2024).
Direct estimates of irreversibility from time series by Trevor GrandPre, Gianluca Teza, William Bialek https://arxiv.org/abs/2412.19772
Caption: Extrapolation of D̅N(T) to the N → ∞ limit for accurate estimation of KL divergence.
The arrow of time, a fundamental concept in physics, can be quantified by the Kullback-Leibler (KL) divergence (D<sub>KL</sub>) between the distributions of forward and reverse trajectories in a system. Traditional methods for estimating this divergence often rely on specific models, which can introduce errors if the model is incorrect. This paper presents a model-free method to directly estimate irreversibility from time series data, circumventing the limitations of model-dependent approaches. Crucially, the method addresses the challenge of finite sample sizes by correcting for systematic errors that arise from limited data.
The method focuses on analyzing trajectories over a time window T. A trajectory, denoted as γ<sub>T</sub>, is drawn from a distribution Pr(γ<sub>T</sub>). The time-reversed trajectory, ž<sub>T</sub>, has a potentially different distribution. The KL divergence, D<sub>KL</sub> [P<sub>T</sub>(γ<sub>T</sub>)||P<sub>T</sub>(ž<sub>T</sub>)], quantifies the evidence for the arrow of time. The method discretizes the trajectory into "words" and uses the frequency of these words to estimate the underlying probability distributions. However, directly using these estimates in the D<sub>KL</sub> formula leads to systematic errors, especially for finite datasets.
The paper addresses this by identifying and correcting for the expected dependence of these errors on the dataset size N, extrapolating to the N → ∞ limit. This extrapolation is justified by the decreasing correlation between consecutive words as word length increases. The systematic error in the D<sub>KL</sub> estimate takes the form A/N + B/N<sup>2</sup> + ..., where A depends on the number of possible words and their time-reversed counterparts, and B is non-universal. The method's accuracy is validated using simulated data from both equilibrium and non-equilibrium systems. In a two-state Markovian system, where the true D<sub>KL</sub> is zero, the method correctly recovers D<sub>KL</sub> = 0 within error bars. For a three-state system with broken detailed balance, the method accurately estimates the entropy production rate σ, consistent with theoretical predictions.
Applying this method to experimental data from salamander retinal ganglion cells responding to naturalistic movies reveals evidence for irreversibility in single neurons. This finding highlights the non-Markovian nature of neural activity, as Markovian models of binary spike trains would necessarily exhibit D<sub>KL</sub> = 0. Furthermore, the study observes a super-linear growth of D<sub>KL</sub>(T) with increasing time window T on perceptually relevant timescales (40-200 ms), indicating a synergistic accumulation of evidence for the arrow of time. This model-free approach provides a robust way to quantify irreversibility directly from time series data, avoiding the pitfalls of potentially incorrect models.
Maximizing Predictive Performance for Small Subgroups: Functionally Adaptive Interaction Regularization (FAIR) by Daniel Smolyak, Courtney Paulson, Margrét V. Bjarnadóttir https://arxiv.org/abs/2412.20190
Caption: This bar chart compares the Mean Squared Error (MSE) of FAIR against other linear modeling approaches (group indicator, joint lasso, separate linear) across subgroups (0, 1) and overall (total) in a prediction task. FAIR demonstrates superior performance, particularly for subgroup 1, highlighting its ability to improve predictions for smaller, potentially underrepresented groups while maintaining comparable performance overall.
In healthcare, applying machine learning models requires a strong emphasis on fairness, ensuring that models perform equally well across different patient groups regardless of size or characteristics. Traditional "one-size-fits-all" models, created by pooling the entire population, can lead to suboptimal care for certain groups, especially those with markedly different outcomes. The need for interpretable models in clinical settings often restricts modelers to linear models, even when more complex models might offer improved accuracy. However, maximizing performance across groups while maintaining interpretability presents several challenges, including heterogeneous covariate effects, a large number of covariates, and imbalanced group representation in datasets.
This paper introduces FAIR (Functionally Adaptive Interaction Regularization), a novel modeling framework designed to address these challenges and maximize performance across imbalanced subgroups. FAIR builds upon familiar linear regression approaches commonly used in healthcare. It employs a full linear interaction model between group membership and all other covariates, incorporating sample weighting by group size and independent regularization penalties for each group. This approach balances learning from larger groups with tailoring predictions to smaller focal groups, while preserving model interpretability. The FAIR objective function is:
min<sub>β=[β₁...βₖ]</sub> { (1/n₁)||y₁ – X₁β₁||² + Σ<sub>k=2</sub><sup>K</sup> (1/nₖ)||yₖ − Xₖ(β₁ + βₖ)||²} + Σ<sub>k=1</sub><sup>K</sup> λₖ||βₖ||ᵣ
where y<sub>k</sub> and X<sub>k</sub> represent outcomes and features for group k, β₁ are base coefficients, β<sub>k</sub> are interaction coefficients for group k, n<sub>k</sub> is the size of group k, and λ<sub>k</sub> is the regularization penalty for group k.
The performance of FAIR was evaluated through numerical experiments and a real-world application using a diabetes patient dataset. In simulations, FAIR consistently outperformed baseline models (separate models and group indicator) and often outperformed the joint Lasso, a related method, across various data conditions. Specifically, FAIR maintained superior performance even with variations in group size, noise levels, and the number of differing coefficients between groups. In the real-world application predicting hospital length of stay, FAIR again outperformed the comparison models for the smaller "Injury" diagnosis group while maintaining comparable performance for the larger "Respiratory" diagnosis group. Furthermore, FAIR demonstrated a significant speed advantage, being 10-19 times faster than the joint Lasso implementation.
Testing and Improving the Robustness of Amortized Bayesian Inference for Cognitive Models by Yufei Wu, Stefan Radev, Francis Tuerlinckx https://arxiv.org/abs/2412.20586
Caption: Empirical influence functions for different t-distributions.
Outliers pose a significant challenge in data analysis, and cognitive modeling is no exception. Traditional methods for handling outliers, such as hard cutoffs or data transformations, can be arbitrary and reduce statistical power. This paper addresses the challenge of outliers in the context of amortized Bayesian inference (ABI), a powerful technique that uses deep learning to accelerate Bayesian parameter estimation. The authors focus on the Drift Diffusion Model (DDM), a widely used model for analyzing reaction time data in cognitive psychology, which is known to be sensitive to outliers.
The core idea is to enhance the robustness of ABI by incorporating contaminants directly into the training process. Instead of training the neural density estimator solely on pristine simulated data, the authors inject outliers into the simulated data, effectively exposing the network to corrupted data during training. Several contamination distributions were tested, including t-distributions with varying degrees of freedom and a uniform distribution. The performance of these "robust" estimators was then evaluated using tools from robust statistics: the empirical influence function (EIF) and the breakdown point (BP). The EIF quantifies the influence of an outlier on parameter estimates, while the BP represents the minimum proportion of contamination that can cause the estimator to break down.
The results demonstrate that training with contaminated data significantly improves the robustness of ABI for both a simple toy example (estimating the mean μ of a normal distribution) and the more complex DDM. In both cases, using a Cauchy distribution (a t-distribution with 1 degree of freedom) for contamination yielded the most robust estimator. For the toy example, the EIF of the robust estimator closely resembled that of Tukey's biweight function, a well-established robust estimator in classical statistics. For the DDM, the robust estimators showed substantially less sensitivity to both short and long outliers compared to the standard estimator. For instance, the BP for the Ter parameter increased from ~5% to ~20% when using the Cauchy contamination.
This robustness, however, comes at a cost. The robust estimators exhibited some loss in efficiency, meaning they had higher posterior variance compared to the standard estimator when applied to uncontaminated data. For the toy example, the efficiency loss was around 9% for the Cauchy-based estimator, comparable to efficiency losses observed with traditional robust estimators. For the DDM, the efficiency losses were higher, ranging up to 43% for certain parameters, highlighting the inherent trade-off between robustness and efficiency in statistical estimation.
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs by Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolò Dalmasso, Natraj Raman, Sumitra Ganesh https://arxiv.org/abs/2501.00555
Caption: The plots show the accuracy of LLMs on MMLU benchmarks with varying coverage parameters (α) after applying CROQ with CP-OPT and logit scores, compared to baseline accuracy. CROQ with CP-OPT generally improves accuracy across different model sizes and MMLU benchmark levels (4, 10, 15) compared to the baseline and using logits, especially for Llama-3 on MMLU-15 with a maximum improvement of 7.24%.
Large language models (LLMs) are increasingly utilized for decision-making tasks, such as answering multiple-choice questions (MCQs) and selecting appropriate tools/APIs. However, LLMs can be overconfident in their predictions, even when incorrect, which poses risks in high-stakes applications. This paper introduces a two-pronged approach to enhance both the safety and accuracy of LLM-driven decision-making: CP-OPT, a framework for optimizing conformal prediction scores, and CROQ, a novel method inspired by the Monty Hall problem.
Conformal prediction (CP) is a model-agnostic technique for quantifying uncertainty by generating prediction sets that contain the true answer with a user-specified probability (e.g., 95%). While CP offers coverage guarantees regardless of the score function used, the quality of the score significantly impacts the size of the prediction sets. CP-OPT addresses this by learning scores that minimize set sizes while maintaining coverage. The authors formulate an optimization problem to find the optimal score function g and threshold τ:
g, τ = arg min<sub>g:X×Y→R,τ∈R</sub> S(g, τ) s.t. P(g, τ) ≥ 1 – α
where S(g, τ) represents the expected set size, and P(g, τ) represents the coverage conditional on τ. They use differentiable surrogates and empirical estimates to solve this problem in practice.
CROQ (Conformal Revision Of Questions), inspired by the Monty Hall problem, leverages the prediction sets generated by CP to improve accuracy. The method revises the MCQ or tool usage prompt by narrowing down the available choices to those within the prediction set. The smaller number of choices increases the LLM's chances of selecting the correct answer, analogous to the Monty Hall problem where eliminating incorrect choices (opening doors with goats) increases the probability of selecting the correct choice (the door with the car).
This newsletter highlights a diverse range of advancements in statistical methodology and their applications. From novel techniques for estimating irreversibility in time series data to innovative frameworks for maximizing predictive performance in imbalanced subgroups, the preprints discussed showcase the evolving landscape of statistical research. The development of FAIR regression offers a practical solution for improving predictions in healthcare settings, while the robustness enhancements to amortized Bayesian inference address the pervasive challenge of outliers in cognitive modeling. Furthermore, the application of conformal prediction and the Monty Hall problem-inspired CROQ method demonstrates a promising avenue for improving the safety and accuracy of LLM-driven decision-making. These advancements collectively contribute to a more robust and reliable application of statistical methods across diverse domains.