This collection of papers delves into various statistical methodologies and their applications across diverse fields, ranging from sports analytics to public health. A recurring theme is the refinement of existing models to address their inherent limitations and enhance their applicability. For instance, Rodriguez Avellaneda et al. (2024) https://arxiv.org/abs/2409.05036 introduce a spatio-temporal model grounded in log-Gaussian Cox processes to estimate the velocity of infectious disease spread, using it to analyze the dynamics of COVID-19 transmission in Colombia. In a similar vein, Brill et al. (2024) https://arxiv.org/abs/2409.04889 address the shortcomings of machine learning in predicting expected points in American football. Their proposed models account for factors like selection bias, overfitting, uncertainty quantification, and data dependence to improve prediction accuracy.
Beyond refining existing models, this collection also presents novel methodologies tailored for specific applications. Holý (2024) https://arxiv.org/abs/2409.05714 employs a dynamic ranking model based on the Plackett-Luce distribution to analyze and forecast success in men's ice hockey championships, factoring in elements like home advantage and player characteristics. Shifting to the realm of telecommunications, Gontier et al. (2024) https://arxiv.org/abs/2409.05468 present a methodology for modeling the spatial distribution of macro base stations using repulsive point processes, specifically determinantal point processes. They validate their approach using real-world cell tower data.
Innovation in data analysis techniques is another highlight of this collection. Nghiem et al. (2024) https://arxiv.org/abs/2409.05343 introduce a penalized functional alignment method to enhance empathic accuracy measurement by rectifying misalignments in emotional perception. Zhang et al. (2024) https://arxiv.org/abs/2409.05429 develop a comprehensive framework for estimating aircraft fuel consumption from flight trajectories, leveraging spectral transformation techniques and deep neural networks. Addressing the critical concern of data privacy, Hu et al. (2024) https://arxiv.org/abs/2409.04716 propose a collaborative Cox model for distributed data. Their approach eliminates the need for a centralized database, thereby enhancing privacy in multicenter clinical studies.
Finally, several papers focus on specific applications with significant societal implications. Sandeep & Mukhopadhyay (2024) https://arxiv.org/abs/2409.04673 present a multi-objective economic statistical design for the CUSUM control chart, aiming to minimize both cost and out-of-control run length. Nnyaba et al. (2024) https://arxiv.org/abs/2409.04642 introduce MuyGPs, a robust Gaussian Process approach for confident classification of electrocardiography data, with the potential to improve the reliability of automated heart disease diagnosis. Wood et al. (2024) https://arxiv.org/abs/2409.06473 provide a critical statistical analysis of the UK's COVID-19 response, highlighting biases in risk presentation and data interpretation.
This collection underscores the wide-ranging applicability of statistical methods in tackling real-world challenges across various domains.
Privacy enhanced collaborative inference in the Cox proportional hazards model for distributed data by Mengtong Hu, Xu Shi, Peter X.-K. Song https://arxiv.org/abs/2409.04716
Caption: Comparison of survival curves obtained using the privacy-preserving "Renew" method and the centralized "Oracle" method.
This paper introduces COLSA (Collaborative Operation of Linked Survival Analysis), a novel method for analyzing time-to-event data from multiple clinical sites without compromising patient privacy. Recognizing the limitations of traditional methods that require sharing individual-level data or constructing risk sets, COLSA offers a more secure approach.
Instead of relying on sensitive individual-level information, COLSA focuses on directly estimating the baseline hazard function in the Cox proportional hazards model using Bernstein basis polynomials. This enables the method to operate by sharing only aggregated summary statistics at the site level, significantly reducing the risk of disclosing individual patient data.
The researchers validated COLSA's effectiveness through both simulated and real-world data from the Scientific Registry of Transplant Recipients (SRTR). In simulations, COLSA demonstrated performance comparable to the gold standard of centralized analysis, showing negligible differences in bias and standard errors. Notably, traditional meta-analysis approaches exhibited significantly higher bias, particularly when handling categorical variables.
When applied to the SRTR data, COLSA yielded results consistent with the oracle method, demonstrating its ability to identify significant predictors of death-censored graft failure in kidney transplant recipients.
COLSA presents a practical and effective solution for conducting survival analysis in privacy-preserving settings. By eliminating the need for individual-level data sharing, COLSA facilitates collaborative research while ensuring the protection of patient privacy. This has the potential to accelerate research discoveries and improve healthcare outcomes by enabling researchers to leverage data from multiple sources without compromising data security.
Causal inference and racial bias in policing: New estimands and the importance of mobility data by Zhuochao Huang, Brenden Beck, Joseph Antonelli https://arxiv.org/abs/2409.08059
Caption: The image presents a visual representation of the impact of incorporating mobility data on estimating racial bias in police stops. The left panel depicts the relationship between the proportion of Black residents in a precinct and the estimated risk ratio of police encounters, using only administrative data. The right panel illustrates how incorporating mobility data significantly alters this relationship, suggesting a more nuanced understanding of racial disparities in policing.
This study tackles the complex issue of quantifying racial bias in policing, highlighting the limitations of administrative data and advocating for the integration of mobility data. The authors argue that relying solely on administrative data, which primarily captures encounters resulting in police stops, can lead to biased estimates.
The study introduces a novel "race and place" estimand, denoted as Ψ(r, x), to investigate whether individuals of a certain race experience different policing patterns based on the racial demographics of their location. This nuanced approach aims to capture potential disparities that might be overlooked by traditional metrics like the causal risk ratio (CRR).
To address the limitations of administrative data, the authors propose leveraging anonymized and aggregated cell phone mobility data. This data, they argue, provides a more comprehensive view of the population interacting with police throughout the day, offering a more accurate representation of potential police encounters.
Analyzing the NYPD "Stop-and-Frisk" data, the study uncovers a significant "race and place" effect. Black individuals are found to be 1.39 times more likely to experience police force in a precinct with 20% Black residents compared to one with 80% Black residents. Notably, incorporating mobility data significantly alters the estimated CRR, underscoring the importance of accounting for population movement in racial bias research.
While acknowledging the limitations of their study, particularly the reliance on untestable assumptions regarding unmeasured confounding, the authors conduct sensitivity analyses to demonstrate the robustness of their findings. The study concludes by emphasizing the need for improved data sources and further research to effectively address the multifaceted issue of racial bias in policing.
Unsupervised anomaly detection in spatio-temporal stream network sensor data by Edgar Santos-Fernandez, Jay M. Ver Hoef, Erin E. Peterson, James McGree, Cesar A. Villa, Catherine Leigh, Ryan Turner, Cameron Roberts, Kerrie Mengersen https://arxiv.org/abs/2409.07667
Caption: This figure compares the performance of different unsupervised anomaly detection methods using Matthews Correlation Coefficient (MCC). The results indicate that methods incorporating spatio-temporal information (Method 1 PPD, Finite mixtures, HMM) outperform the benchmark ARIMA model, with the iterative posterior predictive distribution (Method 1 PPD iter 2) and HMM demonstrating superior performance.
This paper presents a novel framework for identifying technical anomalies in high-frequency water quality data collected from in-situ sensors deployed in stream networks. Recognizing the limitations of traditional methods that often fail to account for the spatio-temporal dependencies inherent in such data, the authors propose an unsupervised anomaly detection framework that leverages these dependencies to improve detection accuracy.
The proposed framework consists of three key steps: data pre-processing, spatio-temporal modeling, and anomaly detection. The pre-processing step involves handling missing data, aligning time series from different sensors, and identifying water quality events. The spatio-temporal modeling step utilizes a Bayesian vector autoregressive spatial linear model, represented as:
[Yt | Yt-1, Xt, Xt−1, β, Φ1, Σ, σ²] = N(μt, Σ + σ²I),
where Yt represents the vector of observations at time t, Xt is the design matrix of covariates, β is the vector of regression coefficients, Φ1 captures temporal autocorrelation, Σ represents the spatial covariance matrix, and σ²I is the unstructured error term.
The anomaly detection step employs four unsupervised methods: posterior predictive distribution, finite mixture models, Hidden Markov Models (HMMs), and autoregressive integrated moving average (ARIMA) as a benchmark. The performance of these methods was evaluated using simulated and real-world sensor data from the Herbert River in Queensland, Australia.
Results from the simulation study revealed that all four spatio-temporal methods outperformed the benchmark ARIMA model in terms of accuracy and other performance metrics. Notably, the posterior predictive distribution method, when applied iteratively, and the HMM emerged as the top performers. In the case study, the posterior predictive distribution method effectively identified various types of anomalies, including drift, sudden shifts, periods of high variability, and constant offsets, with high accuracy.
This study highlights the importance of incorporating spatial and temporal dependencies in anomaly detection for stream network sensor data. The proposed unsupervised framework demonstrates its effectiveness in identifying technical anomalies, ultimately improving data quality and facilitating more informed water resource management decisions.
Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions by Soheun Yi, John Alison, Mikael Kuusela https://arxiv.org/abs/2409.06960
Caption: Visualization of the signal region selection method. The top row shows the original density ratio (γ) of 4b to 3b events and its individual components. The middle row displays the smoothed density ratio (γ^) and its components after applying a convolution kernel. The bottom row highlights the ratio between the original and smoothed density ratios (γ/γ^), revealing a peak indicative of a potential signal region.
This paper introduces a novel method for identifying signal regions (SRs) in high-energy physics data, addressing the challenge of searching for new particles without prior knowledge of their properties. This approach deviates from traditional methods that heavily rely on pre-existing models and instead focuses on the assumption that signal events are localized in the feature space.
The researchers utilize a classifier trained to distinguish between background events with three b-jets (3b) and four b-jets (4b), with the classifier's output serving as a representation of events in a simplified feature space. The core of the method lies in analyzing the density ratio of 4b events to 3b events. By smoothing this density ratio using a convolution operation, the researchers aim to isolate localized high-frequency features indicative of signal events, which would otherwise be masked by background noise. Mathematically, this involves comparing the original density ratio, γ(ζ) := P<sub>4b</sub>(ζ) / P<sub>3b</sub>(ζ), to the smoothed density ratio, γ^(ζ) := (P<sub>4b</sub> * K)(ζ) / (P<sub>3b</sub> * K)(ζ), where P<sub>4b</sub> and P<sub>3b</sub> represent the densities of 4b and 3b events, respectively, and K is a convolution kernel.
Applying this method to simulated data for HH → 4b events, the researchers demonstrated its effectiveness in identifying a data-driven SR within a high-dimensional feature space. The results show that a higher signal ratio and larger sample size lead to a more pronounced concentration of signal events within the identified SR, suggesting the method's potential for uncovering subtle signals, particularly with increasing data availability.
The study highlights the importance of selecting an appropriate convolution kernel. Using a kernel with a scale too large compared to the signal peak can lead to over-smoothing and loss of signal information. Conversely, a kernel with too small a scale may not effectively distinguish the signal from the background. Future work will focus on incorporating background estimation and hypothesis testing within the identified SR to confirm the presence of new physics signals.
This newsletter highlights the diverse applications of statistical methodologies in addressing real-world problems. From enhancing data privacy in multicenter clinical studies to uncovering hidden patterns of racial bias in policing and detecting anomalies in complex datasets, these papers showcase the power of statistical thinking. The development of novel methods, like COLSA for privacy-preserving survival analysis and the use of mobility data to improve estimates of racial bias in policing, demonstrates the continuous evolution of the field. Furthermore, the emphasis on model-agnostic approaches, such as the data-driven signal region selection for new physics detection, reflects the growing need for methods that can handle complex data without relying on strong prior assumptions. These advancements underscore the importance of interdisciplinary collaborations and the continued exploration of cutting-edge statistical techniques to address pressing societal challenges.