Subject: Statistical Modeling and Analysis Advancements
This collection of papers showcases a diverse range of advancements in statistical modeling and analysis across various domains. Several studies introduce novel methodologies for addressing complex data structures and challenges. Escobar-Hernández et al. (2024) (Escobar-Hernández et al., 2024) propose a new approach for comparing spatial risk patterns between subgroups using multivariate spatial disease mapping, applied to suicide risk. Guha Niyogi et al. (2024) (Guha Niyogi et al., 2024) explore different distributional representations of accelerometry data, finding that hazard functions provide the highest discriminatory accuracy for multiple sclerosis disability. Brindle et al. (2024) (Brindle et al., 2024) introduce VISTA, a clustering approach for irregularly sampled time series using state space mixture models, particularly relevant for healthcare and psychology data. These innovative methods address specific challenges in their respective fields, contributing to more robust and nuanced analyses.
Another prominent theme is the application of Bayesian methods to diverse problems. Medrano et al. (2024) (Medrano et al., 2024) introduce BSD, a Bayesian framework for analyzing neural spectral power, demonstrating its efficacy in model selection and parameter estimation. Zhou et al. (2024) (Zhou et al., 2024) present a Bayesian approach for learning causal graphs with limited interventional samples, leveraging uniform DAG sampling. Saha et al. (2024) (Saha et al., 2024) explore model compression for Bayesian neural networks using posterior inclusion probabilities for pruning and feature selection. These studies highlight the versatility and power of Bayesian methods in handling uncertainty and complex data structures.
Several papers focus on specific applications and empirical analyses. Muluken Liyew et al. (2024) (Muluken Liyew et al., 2024) analyze diurnal air temperature trends using Mann-Kendall trend tests and hierarchical clustering with dynamic time warping, revealing significant warming trends in Italian stations. Ma et al. (2024) (Ma et al., 2024) develop a Power System Vulnerability Index (PSVI) using XGBoost and SHAP, identifying hotspots of vulnerability across US counties. Budko et al. (2024) (Budko et al., 2024) investigate the association between potato plant vigor and seed tuber biochemistry, finding that vigor is dominated by genotype. These studies demonstrate the practical utility of statistical methods in addressing real-world problems.
Further contributions include advancements in specific statistical techniques. Baíllo and Cárcamo (2024) (Baíllo & Cárcamo, 2024) introduce the almost goodness-of-fit test, a procedure for validating parametric models up to a pre-specified margin of error. Sharpnack et al. (2024) (Sharpnack et al., 2024) present BanditCAT and AutoIRT, machine learning approaches for computerized adaptive testing and item calibration. Fusco et al. (2024) (Fusco et al., 2024) propose an extension of spatio-temporal stochastic frontier models to account for spatial and temporal effects in inefficiency. These papers offer valuable tools and techniques for researchers working with diverse statistical problems.
Finally, several papers explore the intersection of statistics and other disciplines. Ghosh et al. (2024) (Ghosh et al., 2024) investigate the estimation of distances from parallaxes in astronomy using robust Bayesian inference. Huang et al. (2024) (Huang et al., 2024) introduce the R package CDsampling for constrained D-optimal sampling in paid research studies. Miller (2024) (Miller, 2024) proposes a statistical approach to language model evaluations, emphasizing the importance of experiment analysis and planning. These interdisciplinary applications showcase the broad relevance and impact of statistical methodology.
Reconstructing East Asian Temperatures from 1368 to 1911 Using Historical Documents, Climate Models, and Data Assimilation by Eric Sun, Kuan-hui Elaine Lin, Wan-Ling Tseng, Pao K. Wang, Hsin-Cheng Huang https://arxiv.org/abs/2410.21790
Caption: Comparison of Celsius-Scaled Kriged REACHES and GHCN Temperatures in Beijing
This paper presents a significant advancement in historical climatology, addressing the challenge of reconstructing annual temperatures in East Asia from 1368 to 1911, a period lacking instrumental data. The researchers ingeniously leverage the Reconstructed East Asian Climate Historical Encoded Series (REACHES), a digitized collection of historical Chinese documents, to bridge this data gap. REACHES translates qualitative temperature descriptions into a four-level ordinal scale. However, its inherent bias towards extreme weather events and the resulting missing data, presumably representing normal conditions, presented significant hurdles.
To overcome these limitations and reconstruct spatially continuous temperatures, the researchers employed a sophisticated three-tiered statistical framework. The first tier utilizes kriging, a geostatistical interpolation technique. Assuming a zero-mean spatial Gaussian process, the method effectively infers temperatures at locations without direct historical records, treating normal weather as the baseline. The interval-censored nature of the REACHES index data is explicitly addressed, leading to a more robust interpolation. The kriging model utilizes the best linear predictor, formulated as Ŷ(s₀) = Cᵧ₂(s₀)Σ⁻¹₂(Z - μ₂), where Ŷ(s₀) represents the estimated temperature at location s₀, Cᵧ₂ is the covariance between the location and observed data, Σ₂ is the covariance matrix of the observed data Z, and μ₂ is the mean of Z. A two-step estimation procedure accounts for interval censoring and provides bias-corrected estimates.
The second tier employs quantile mapping to calibrate the kriged REACHES data to Celsius scales. This crucial step aligns the historical data with temperature distributions from the Last Millennium Ensemble (LME), a comprehensive global climate model. By matching quantiles, the method effectively transforms the ordinal REACHES data into physically meaningful temperature values.
Finally, the third tier implements Bayesian data assimilation, integrating the Celsius-scaled kriged data with LME simulations. A nonstationary autoregressive time series model, estimated using regularized maximum likelihood with a fused lasso penalty, captures the temporal dynamics of temperature at each location. This model serves as the prior distribution in a Bayesian framework. The Kalman filter and smoother then refine this prior by incorporating the Celsius-scaled REACHES data, generating posterior temperature estimates that benefit from both the climate model and the historical information.
Validation using early instrumental data from the Global Historical Climatology Network (GHCN) demonstrates the effectiveness of this approach. Correlations between the reconstructed temperatures and GHCN records, particularly in Beijing (correlation coefficient of 0.50), showcase the method's accuracy. The Bayesian data assimilation further enhances correlations in other locations like Shanghai and Hong Kong. Importantly, the reconstruction captures the cooling trend during the late Ming dynasty, corroborating historical accounts. This innovative integration of historical documentation, climate models, and advanced statistical methods provides a valuable resource for understanding past climate variability and informing future climate studies.
A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution by Zhengmian Hu, Tong Zheng, Heng Huang https://arxiv.org/abs/2410.21716
This paper explores the exciting potential of Large Language Models (LLMs) in revolutionizing authorship attribution, a critical task in forensic linguistics. Traditional approaches, relying on manual feature engineering, often struggle to capture the nuanced stylistic variations that distinguish authors. While deep learning offered some improvements, issues with interpretability and generalization persisted. This research introduces a novel Bayesian approach that leverages the probabilistic outputs of pre-trained LLMs for one-shot authorship attribution, bypassing the need for extensive fine-tuning or manual feature crafting.
The core of the proposed "Logprob method" lies in calculating the probability that a given text entails the previous writings of a specific author, given example texts from multiple candidates. This probabilistic assessment is achieved by carefully constructing prompts that guide the LLM to evaluate the likelihood of authorship based on stylistic cues. The method calculates the text-level log probability P(u|t(aᵢ)), which represents the probability that a new text u was written by the author of a given set of texts t(aᵢ)). This probability is estimated using the LLM's probability distribution over the vocabulary: P(u|t(aᵢ)) = PLLM(u|prompt_construction(t(aᵢ))). Bayes' theorem is then applied to infer the most likely author.
The researchers evaluated the Logprob method on the IMDb and Blog datasets using powerful pre-trained models like Llama-3-70B. In a challenging one-shot, 10-author scenario, the method achieved remarkable results, with a top-1 accuracy of 85% on the IMDb dataset and 82% on the Blog dataset. These results significantly outperform existing QA-based LLM methods, which often suffer from accuracy and stability issues. Furthermore, the Logprob method maintains robust performance even with a larger pool of 50 candidate authors, achieving a respectable 76% top-1 accuracy. Interestingly, the study also revealed potential gender bias in the Blog dataset, with higher accuracy for female-authored blogs, suggesting more distinct stylistic features in their writing.
This research highlights the effectiveness of the Logprob method in capturing subtle linguistic cues for authorship attribution. Its training-free nature not only simplifies the process but also reduces computational overhead. While acknowledging the dependence on LLM capabilities and potential biases in the training data, this work opens exciting new avenues for applying LLMs in forensic linguistics and beyond. Future research directions include refining prompt engineering techniques and mitigating potential biases to further enhance the method's performance and fairness.
BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration by James Sharpnack, Kevin Hao, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey, Alina A. von Davier https://arxiv.org/abs/2410.21033
This paper introduces a powerful framework for developing and deploying robust large-scale computerized adaptive tests (CATs), addressing the challenges posed by limited data and the need for efficient item calibration and test administration. The framework comprises two key components: AutoIRT for item calibration and BanditCAT for adaptive test administration.
AutoIRT tackles the crucial task of calibrating item parameters, such as difficulty and discrimination, even with scarce response data. It leverages automated machine learning (AutoML) to train a grading model using item features, including BERT embeddings and linguistic features. This non-parametric AutoML model is then used to fit an explanatory Item Response Theory (IRT) model, providing interpretable item parameters essential for adaptive testing. This approach is particularly valuable in "cold-start" and "jump-start" scenarios where response data is limited or non-existent.
BanditCAT reimagines CAT administration as a contextual bandit problem. The key innovation lies in defining the bandit reward as the Fisher information, F<sub>I<sub>t</sub></sub>(θ), for the selected item I<sub>t</sub>, given the latent test taker ability θ derived from IRT assumptions. Thompson sampling is employed to strike a balance between exploring items with diverse psychometric properties and exploiting highly discriminative items that provide precise information about θ. To manage item exposure and prevent overexposure of specific items, a randomization step is introduced before computing the Fisher information, injecting noise into the selection process. The total reward for a test session is the sum of the Fisher information for all administered items: F<sub>T</sub>(θ) = Σ<sup>T</sup><sub>t=1</sub> F<sub>I<sub>t</sub></sub>(θ).
This framework was successfully applied to launch two new item types on the Duolingo English Test (DET) practice test. Five practice test experiments compared the new adaptive approach with the existing control condition, which used four times the number of items for one item type. Results showed that while the control condition had higher reliability (RR) and score correlations due to the larger number of items, the adaptive approach achieved respectable reliability with significantly fewer items and shorter test duration. For instance, reducing the number of items from 72 to 18 for one item type did not result in a proportional decrease in reliability, demonstrating the efficiency of the adaptive approach. The adaptive approach also showed promising results when combining different item types. The maximum exposure rates remained comparable across conditions, indicating effective exposure control. This initial implementation, BanditCAT V1, provides a strong foundation for future development, including full Thompson sampling of item parameters and extensions to multidimensional ability estimation.
VISTA-SSM: Varying and Irregular Sampling Time-series Analysis via State Space Models by Benjamin Brindle, Thomas Derrick Hull, Matteo Malgaroli, Nicolas Charon https://arxiv.org/abs/2410.21527
Caption: This diagram outlines the VISTA algorithm for clustering irregularly sampled time series data. It uses an Expectation-Maximization (EM) algorithm to fit a mixture of linear Gaussian state space models, leveraging Kalman filtering and Rauch-Tung-Striebel (RTS) smoothing. The iterative process alternates between the E-step (computing expected values) and the M-step (updating model parameters) until convergence, ultimately returning cluster labels and model parameters.
This paper introduces VISTA (Varying and Irregular Sampling Time-series Analysis), a novel clustering approach tailored for the complexities of real-world time-series data, particularly prevalent in healthcare and psychology. These data often exhibit irregular sampling, missing data points, and varying temporal structures, posing challenges for traditional clustering methods that assume regular sampling. VISTA addresses these challenges head-on by utilizing a parametric state space mixture model.
At the heart of VISTA are linear Gaussian state space models (LGSSMs), providing a flexible framework to capture a wide range of time series dynamics. The method assumes that the population can be represented as a mixture of a given number of LGSSMs, allowing for an explicit derivation of the log-likelihood function. This facilitates the development of an expectation-maximization (EM) scheme for efficiently fitting model parameters to the observed data. Critically, VISTA handles irregularly sampled time series by utilizing an underlying continuous stochastic process:
where y(t) represents the observed variable, x(t) is the latent state variable, A is the system matrix, C is the output matrix, and w(t) and v(t) are white noise processes. This continuous formulation is then discretized to accommodate irregular time steps.
The authors rigorously validated VISTA's performance on both simulated and real-world datasets. On simulated data with known ground truth, VISTA successfully recovered the underlying cluster structure, achieving perfect cluster similarity with correct model specifications. The method was further tested on datasets related to state population growth, epilepsy seizure detection, COVID-19 epidemiological trends, and ecological momentary assessments of depression. In datasets with available ground truth labels, VISTA outperformed benchmark methods. For datasets without ground truth, model selection using the adjusted Bayesian Information Criterion (ABIC) determined the optimal number of clusters.
VISTA's ability to leverage the inherent irregularity of the data as a source of insight, rather than a nuisance, represents a significant advance. The open-source Python implementation further enhances its accessibility and potential for broad adoption. While further research is warranted to address potential parameter identifiability issues and the EM algorithm's sensitivity to initialization, VISTA provides a valuable new tool for uncovering hidden patterns in complex temporal data.
Hazard and Beyond: Exploring Five Distributional Representations of Accelerometry Data for Disability Discrimination in Multiple Sclerosis by Pratim Guha Niyogi, Muraleetharan Sanjayan, Kathryn C. Fitzgerald, Ellen M. Mowry, Vadim Zipunnikov https://arxiv.org/abs/2410.20620
Caption: Distributional Representations of Accelerometry Data for Predicting MS Disability
This study delves into the rich information embedded within accelerometry data for predicting disability in multiple sclerosis (MS), moving beyond traditional summary statistics. The researchers investigated five individual-level distributional representations of minute-level activity counts: density, survival, hazard, quantile, and total time on test (TTT) functions. Using data from the HEAL-MS project, encompassing 246 participants with MS, the study aimed to predict both binary disability status (low vs. high EDSS) and continuous EDSS scores, utilizing both original and log-transformed activity counts.
Scalar-on-function regression models were employed, with the distributional representations serving as predictors. The models followed the form: E{Yᵢ|Xᵢ₁, …, Xᵢmᵢ} = μᵢ, where g(μᵢ) = α + νᵢ with νᵢ = ∫ Dᵢ(z)β(z)dz. Here, Yᵢ represents the outcome (binary MS status or continuous EDSS score), Xᵢⱼ are the minute-level activity counts for individual i, g(·) is a link function, Dᵢ is the chosen distributional representation, and β(z) is a smooth coefficient function representing the functional effect of the distribution at level z. Smoothing splines were used to estimate β(z).
The results revealed the surprising power of the hazard function, which represents the instantaneous risk of disability worsening given current activity levels. It achieved the highest discriminatory accuracy for both binary and continuous outcomes. For binary disability status, the cross-validated area under the curve (AUC) using the hazard function was 0.77 (using original activity counts) and 0.77 (using log-transformed counts). This was more than double the accuracy achieved using density functions (AUCs of 0.71 and 0.67, respectively). Quantile functions also performed well, with cross-validated AUCs of 0.76 and 0.74. For continuous EDSS scores, the hazard function achieved a cross-validated R² of 0.26 (original counts) and 0.25 (log-transformed counts), again outperforming other representations.
These findings underscore the importance of considering tail behavior when analyzing digital health data. The hazard function's focus on the risk of an event at a given time, conditional on having survived up to that time, proves particularly insightful in the context of MS. This may reflect the increased risk of disability worsening associated with lower activity levels. The study suggests that using hazard functions, and to a lesser extent quantile functions, can substantially improve the prediction of disability progression in MS compared to traditional summary measures or other distributional representations. This opens new avenues for utilizing wearable sensor data to monitor disease progression and potentially personalize interventions. Future research will focus on incorporating temporal information into the distributional representations to further enhance predictive power.
This newsletter highlights significant progress in statistical methodology and its application across diverse domains. From reconstructing historical temperatures using Bayesian data assimilation to leveraging the power of LLMs for authorship attribution and developing novel clustering approaches for irregular time series, the featured papers showcase innovative solutions to complex data challenges. The emphasis on Bayesian methods, the exploration of novel distributional representations for digital health data, and the development of advanced machine learning techniques for adaptive testing demonstrate the ongoing evolution of the field. These advancements not only contribute to a deeper understanding of specific phenomena, such as climate change and disease progression, but also provide valuable tools and techniques for researchers across various disciplines. The interdisciplinary nature of these contributions further underscores the broad relevance and impact of statistical methodology in addressing real-world problems.