Subject: Cutting-Edge Advancements in Statistical Methodology and AI
Hi Elman,
This collection of preprints showcases a vibrant research landscape, pushing the boundaries of statistical methodology, machine learning, and their applications across various domains. A notable emphasis on Bayesian methods, network analysis, and handling complex data structures pervades the collection. Several papers focus on improving model accuracy and robustness in challenging scenarios. For instance, Zhu et al. (2025) (Zhu et al., 2025) introduce BISON, a unified Bayesian approach for bi-clustering spatial omics data, addressing the double-dipping issue in feature selection. Similarly, Han et al. (2025) (Han et al., 2025) delve into the theoretical properties of Winsorized PCA for subspace recovery, demonstrating its robustness to outliers. In the realm of causal inference, Montoya et al. (2025) (Montoya et al., 2025) apply the resource-constrained optimal dynamic treatment rule (RC ODTR) SuperLearner algorithm to HIV care retention, offering practical guidance and novel presentations of the RC ODTR. Meanwhile, Philipps et al. (2025) (Philipps et al., 2025) compare methods for handling intermittently measured, error-prone covariates in survival analysis, advocating for multiple imputation and joint modeling.
Another recurring theme is developing novel statistical models for specific applications. Lu et al. (2025) (Lu et al., 2025) propose a zero-inflated Poisson latent position cluster model for analyzing network data with missing values, introducing a novel partially collapsed Markov chain Monte Carlo algorithm. Genest et al. (2025) (Genest et al., 2025) extend previous work on noncentral Wishart mixtures to test random effects in factorial design models. In environmental epidemiology, Zhang et al. (2025) (Zhang et al., 2025) introduce a faster version of Bayesian Kernel Machine Regression (BKMR) using random Fourier features, enabling efficient estimation of joint health effects from multiple exposures. Addressing a different application, Ning et al. (2025) (Ning et al., 2025) investigate imbalanced regression loss functions for forecasting marine heatwaves, highlighting the improved performance of specialized losses like balanced MSE.
Several contributions focus on practical applications and real-world data analysis. Glynn et al. (2025) (Glynn et al., 2025) explore multi-indication meta-analysis methods for health technology assessment, demonstrating the potential for reducing uncertainty in decision-making. Orme et al. (2025) (Orme et al., 2025) propose a novel multi-view biclustering approach based on non-negative matrix tri-factorization, introducing the bisilhouette score as a bicluster-specific evaluation metric. Reluga et al. (2025) (Reluga et al., 2025) develop a causal small area estimation framework to analyze the impact of job stability on poverty in Italy. Furthermore, Gabashvili and Allsup (2025) (Gabashvili & Allsup, 2025) analyze resident turnover and community satisfaction in active lifestyle communities, revealing complex patterns and highlighting the need for sophisticated longitudinal tracking.
Beyond traditional statistical modeling, several papers explore innovative applications of machine learning and large language models (LLMs). Yardimci and Cavus (2025) (Yardimci & Cavus, 2025) introduce a Rashomon perspective for uncertainty quantification in survival predictive maintenance models. Seri et al. (2025) (Seri et al., 2025) compare recurrent and graph neural networks for sustainable greenhouse management. Zhou et al. (2025) (Zhou et al., 2025) explore Chain-of-Thought (CoT) reasoning with LLMs in chemical engineering. Kuzmanko (2025) (Kuzmanko, 2025) evaluates the performance of LLMs in quantitative management problem-solving, revealing limitations in precision despite promising capabilities. Finally, Chuang et al. (2025) (Chuang et al., 2025) investigate the use of LLMs for scoring and greenwashing corporate climate disclosures.
Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving by Jonathan Kuzmanko https://arxiv.org/abs/2502.16556
Large language models (LLMs) are transforming various fields, but their effectiveness in quantitative management decisions requires further investigation. Jonathan Kuzmanko's study examines the zero-shot performance of five leading LLMs—Llama 3.3 70b, Gemini 1.5 Pro, Grok, GPT4o, and Claude 3.5 Sonnet—on quantitative management problems involving calculations and constraints. Analyzing 900 responses across 20 diverse scenarios, the research explores the impact of presentation format, complexity, and repeated attempts on accuracy.
Contrary to prior findings, the format (direct, narrative, or tabular) and length of the provided information did not significantly affect the LLMs' accuracy. However, complexity, particularly constraints and irrelevant parameters, considerably hampered performance. Surprisingly, tasks requiring multiple solution steps were handled more effectively than anticipated. A key finding is the precision issue: only 28.8% of responses were exactly correct. While some models demonstrated slightly better binary accuracy, no significant "learning effect" emerged across iterations. Specifically, Claude-3.5-sonnet had the highest overall accuracy with a mean logarithmic distance of 0.205 (SD=0.251), followed by GPT-40 with 0.228 (SD=0.264). Interestingly, irrelevant parameters negatively correlated with accuracy, while solution steps positively correlated with binary accuracy.
These findings underscore both the potential and the limitations of LLMs in quantitative decision-making. Their ability to handle multi-step reasoning is promising, but the lack of precision poses a significant challenge. The study suggests that organizations should carefully evaluate LLMs before deployment, considering model capabilities and task characteristics. The robustness to presentation format is encouraging, but the sensitivity to complexity highlights the need for meticulous prompt engineering and structured information. Further research is necessary to explore the relationship between task complexity and LLM performance and investigate task-specific training and advanced prompting techniques for accuracy improvement.
A Study on Monthly Marine Heatwave Forecasts in New Zealand: An Investigation of Imbalanced Regression Loss Functions with Neural Network Models by Ding Ning, Varvara Vetrova, Sébastien Delaux, Rachael Tappenden, Karin R. Bryan, Yun Sing Koh https://arxiv.org/abs/2502.13495
Caption: Location: CR, Lead Month: 1, Data: Training
Marine heatwaves (MHWs), prolonged periods of extremely high ocean temperatures, pose significant threats to marine ecosystems and industries. Accurate forecasting, especially months in advance, is crucial for mitigation. However, predicting MHWs presents a challenging imbalanced regression problem due to the rarity of extreme temperature anomalies compared to moderate conditions. This study addresses this challenge by investigating the performance of various loss functions with a fully-connected neural network (FCN) for monthly MHW forecasts at 12 locations around New Zealand.
Using sea surface temperature anomaly (SSTA) data from the Simple Ocean Data Assimilation (SODA) dataset, the research employed SSTAs as both predictors and the target variable. MHWs were defined as events where SSTAs exceeded the 90th percentile of monthly climatology. The study compared standard loss functions like Mean Squared Error (MSE) and Mean Absolute Error (MAE) with specialized imbalanced regression loss functions, including Huber loss, weighted MSE, focal-R, balanced MSE, and a novel scaling-weighted MSE. The proposed scaling-weighted MSE incorporates anomaly magnitude and controllable hyperparameters (α, β, w<sub>90%</sub>, w<sub>80%</sub>) to balance false alarms and missed detections. The formula is:
L(y, ŷ) = (1/N) Σ<sub>i=1</sub><sup>N</sup> l<sub>i</sub>
where:
l<sub>i</sub> = (w<sub>90%</sub>· |y<sub>i</sub>|)<sup>β</sup>(y<sub>i</sub> - ŷ<sub>i</sub>)<sup>2</sup>; y<sub>i</sub> > y<sub>90%</sub>
l<sub>i</sub> = (w<sub>80%</sub>· |y<sub>i</sub>|)<sup>β</sup>(y<sub>i</sub> - ŷ<sub>i</sub>)<sup>2</sup>; y<sub>80%</sub> < y<sub>i</sub> ≤ y<sub>90%</sub>
l<sub>i</sub> = (|y<sub>i</sub>|)<sup>β</sup>(y<sub>i</sub> - ŷ<sub>i</sub>)<sup>2</sup>; otherwise
Models were trained to predict SSTAs one, two, three, and six months ahead, evaluated using MSE for overall SSTA prediction and Critical Success Index (CSI) for MHW prediction. Short-term (one-month) forecasts were significantly more accurate than longer lead times. Standard MSE and MAE excelled at predicting average conditions but struggled with extremes. Specialized loss functions, particularly balanced MSE and the scaling-weighted MSE, significantly improved MHW and suspected MHW (SSTAs between 80th and 90th percentiles) prediction. For one-month lead time, balanced MSE achieved an average CSI of 0.38, while scaling-weighted MSE (α = 2, β = 0.5) achieved 0.37. Longer lead times saw performance degradation across all models, with high "perfect underfitting" rates. The scaling-weighted MSE was the only loss function capable of producing somewhat meaningful long-term forecasts, albeit with lower accuracy.
This study underscores the importance of tailored loss functions for imbalanced regression, especially for rare, high-impact events like MHWs. While standard loss functions may suffice for average conditions, capturing extremes requires specialized approaches. The scaling-weighted MSE offers a promising avenue for improved MHW prediction, especially in balancing extreme capture and average condition accuracy. Future research should explore incorporating additional predictors and more sophisticated network architectures to enhance long-lead MHW forecasts, crucial for effective climate change adaptation.
Bridging the Data Gap in AI Reliability Research and Establishing DR-AIR, a Comprehensive Data Repository for AI Reliability by Simin Zheng, Jared M. Clark, Fatemeh Salboukh, Priscila Silva, Karen da Mata, Fenglian Pan, Jie Min, Jiayi Lian, Caleb B. King, Lance Fiondella, Jian Liu, Xinwei Deng, Yili Hong https://arxiv.org/abs/2502.12386
Caption: (a) MAE comparison for Setting I (b) MAE comparison for Setting II
The rapid progress of AI technology necessitates ensuring the reliability of these systems, as public trust hinges on their dependable performance. However, AI reliability research, particularly in academia, faces a major obstacle: the scarcity of readily available data. This paper addresses this gap by reviewing existing data, introducing key measurements and data types, and establishing DR-AIR, a comprehensive public data repository specifically for AI reliability research.
The paper outlines crucial measurements and data types for assessing AI reliability, including binary data (pass/fail), count data (number of failures), continuous measurement data (e.g., mean AUC), time-to-event data, recurrent event data (e.g., disengagements in autonomous vehicles), and degradation data. Covariates, such as algorithm type, dataset source, and simulation settings, are essential for AI reliability analysis and are incorporated into models using techniques like generalized linear models (GLMs) for binary and count data, and accelerated failure time (AFT) models for time-to-event data. For recurrent event data, the intensity function is modeled, often using a non-homogeneous Poisson process (NHPP).
The review of existing datasets suitable for AI reliability research includes examples like the AI Incident Database, which catalogs reported AI failures; datasets from controlled experiments testing algorithm robustness under different class imbalance scenarios; and datasets from physics-based simulations of autonomous vehicles, focusing on error propagation within the perception system. These examples illustrate the diverse nature of AI reliability data and the various collection methods, including laboratory tests, field tracking studies, and virtual simulations.
The central contribution is the establishment of DR-AIR, a publicly accessible repository curated specifically for AI reliability research. DR-AIR provides a centralized platform for researchers to access and share valuable reliability data, fostering collaboration and promoting new methods in this critical field. The repository includes detailed descriptions of each dataset and its variables, facilitating data understanding and usage. The datasets are freely available under the GPL-3.0 license, encouraging open access and collaboration.
The paper concludes with a call for continued contribution and sharing of AI reliability data within the research community. The authors emphasize the field's rapid evolution and the importance of a collective effort to build a robust and accessible data resource. This collaborative approach is crucial for advancing AI reliability research, leading to improved models, methodologies, and a deeper understanding of the factors influencing AI system dependability, ultimately contributing to greater public trust in AI technology.
BISON: Bi-clustering of spatial omics data with feature selection by Bencong Zhu, Alberto Cassese, Marina Vannucci, Michele Guindani, Qiwei Li https://arxiv.org/abs/2502.13453
Caption: This figure displays the performance of BISON compared to other spatial clustering methods (SpaRTaCo, BC, sparseBC, and K-means) across simulated datasets with varying gene counts (p = 500, 1000), signal strengths (Δ = 0.5, 1, 1.5), and proportions of non-discriminating genes. BISON consistently achieves the highest Adjusted Rand Index (ARI), a measure of clustering accuracy, demonstrating its superior performance in identifying spatial domains.
Spatially resolved transcriptomics (SRT) has revolutionized genomic studies, but analyzing the high-dimensional data it generates remains a challenge. Identifying spatially variable genes (SVGs), particularly spatial domain-marker SVGs, is crucial for understanding biological mechanisms. Existing methods, often two-stage approaches, suffer from "double-dipping," leading to inflated false positives. These methods can also overlook locally expressed genes. BISON (Bi-clustering of spatial omics data with feature selection), a new Bayesian method, addresses these challenges by simultaneously identifying informative genes and clustering both genes and spatial spots.
BISON utilizes a multivariate Poisson model, Y<sub>ji</sub>|z<sub>i</sub> = k, p<sub>j</sub> = r ~ Poi(s<sub>i</sub>g<sub>j</sub>μ<sub>rk</sub>), to directly model SRT count data, eliminating the need for ad-hoc normalization. It incorporates feature selection to identify spatial domain-specific discriminating genes (DGs), providing a lower-dimensional, biologically interpretable representation. A Markov random field (MRF) prior accounts for the spatial structure of SRT data, promoting contiguous spatial domain identification. A modified integrated complete likelihood (mICL) criterion determines the optimal number of gene and spot clusters.
In extensive simulations, BISON outperformed existing methods like SpaRTaCo, biclustering algorithms (BC and sparseBC), and a two-directional K-means approach. Across various scenarios with varying proportions of non-DGs (π<sub>0</sub>), signal strengths (Δ), and gene counts (p), BISON consistently achieved the highest Adjusted Rand Index (ARI) for spot and gene clustering. For example, with a weak signal (Δ = 0.5) and p = 500, BISON's ARI for spot clustering ranged from ~0.95 (π<sub>0</sub> = 0) to ~0.2 (π<sub>0</sub> = 0.8), consistently outperforming other methods. Similar trends were observed for gene clustering and with p = 1000. The mICL criterion effectively selected the correct number of clusters, particularly for spot clusters, though limitations were observed with small gene counts, weak signals, and high proportions of non-DGs. BISON also demonstrated robustness under model misspecification using Negative Binomial simulated data.
Applications to real datasets, a mouse olfactory bulb ST dataset and a human breast cancer 10x Visium dataset, further showcased BISON's effectiveness. In the mouse olfactory bulb dataset, BISON identified four spot clusters and three gene clusters (ARI = 0.53 compared to manual annotation). The identified DG groups showed distinct spatial expression patterns, corresponding to different tissue layers. In the human breast cancer dataset, BISON identified five spot clusters and four gene clusters (ARI = 0.487), with DG groups showing high expression in specific spatial domains. Gene ontology enrichment analysis of the DGs revealed enrichment for "extracellular exosome" terms in tumor domains and immune-related terms in others, providing biological insights. These results highlight BISON's ability to identify biologically relevant spatial domains and marker genes.
Including an infrequently measured time-dependent error-prone covariate in survival analyses: a simulation-based comparison of methods by Viviane Philipps, Laurence Freedman, Veronika Deffner, Catherine Helmer, Hendriek Boshuizen, Anne C.M. Thiébaut, Cécile Proust-Lima (on behalf of Measurement Error and Misclassification Topic Group (TG4) of the STRATOS Initiative) https://arxiv.org/abs/2502.16362
Caption: No association
Epidemiological studies frequently explore the relationship between time-varying exposures and event risks using survival analysis methods like the Cox model. However, challenges like infrequent measurement, measurement error, and informative truncation (where the event influences subsequent measurements) can complicate these analyses. This paper compares the naive last observation carried forward (LOCF) approach with more sophisticated methods, aiming to guide researchers toward accurate and reliable techniques.
The study used simulations to evaluate five methods: LOCF, classical two-stage regression calibration (RC) with or without post-event information (PE-RC), multiple imputation (MI), and joint modeling (JM). The target model was λ<sub>i</sub>(t) = λ<sub>0</sub>(t)exp(X<sub>i</sub>(t)γ), where λ<sub>i</sub>(t) is the hazard, X<sub>i</sub>(t) the true exposure, and γ the association parameter. Simulations varied the association strength, event frequency, measurement error magnitude, and exposure trajectory.
LOCF and, to a lesser extent, classical RC exhibited substantial bias in most scenarios, especially with increasing event censoring, larger measurement error, or stronger associations. Including post-event information (PE-RC) largely mitigated the bias in RC. MI performed relatively well, with minor bias in some scenarios, while JM provided accurate estimates in all scenarios. However, both MI and JM resulted in wider confidence intervals compared to other methods. Generally, MI and JM yielded lower mean squared errors.
The methods were illustrated using data from the Bordeaux 3C cohort, examining the association of body mass index (BMI) and Trail Making Test Part A (TMT-A) with dementia risk. All methods produced similar results for BMI, but LOCF and RC underestimated the TMT-A association compared to MI and JM. This aligns with simulation findings, suggesting that LOCF and RC may only be appropriate in specific cases with weak associations and minimal measurement error. The study emphasizes carefully considering simpler methods' limitations and advocates for MI or JM when feasible for analyzing time-varying exposures in survival analysis. These methods, though requiring more expertise, offer greater accuracy and reliability in handling data complexities.
This newsletter highlights impactful advancements across diverse statistical and AI domains. From refining Bayesian methods for spatial omics data analysis with BISON to addressing the intricacies of time-varying exposures in survival analysis using MI and JM, the preprints showcase a commitment to methodological rigor. The exploration of LLMs for quantitative management problem-solving reveals both promise and limitations, emphasizing the need for further research into improving precision. The introduction of DR-AIR, a repository for AI reliability data, addresses a crucial gap in the field, fostering collaboration and promoting the development of robust AI systems. The development of specialized loss functions, such as the scaling-weighted MSE for predicting marine heatwaves, underscores the importance of tailoring methods to specific applications. Collectively, these contributions demonstrate a dynamic research landscape focused on tackling complex data challenges and providing practical solutions for informed decision-making.