This collection of preprints explores diverse applications of statistical modeling and machine learning across a spectrum of fields, from environmental science and public health to economics and artificial intelligence. Several papers introduce novel methodologies for tackling complex data challenges. For instance, Rezvani et al. (2024) https://arxiv.org/abs/2409.13688 present MiNa, a new open-source dataset for automated microplastic and nanoplastic detection and classification. Using simulated scanning electron microscopy images and object detection algorithms, this resource aims to accelerate research in this critical area by providing a standardized benchmark for algorithm development. Similarly, focusing on logistical challenges, Wouda et al. (2024) https://arxiv.org/abs/2409.13386 develop an integrated selection and routing (ISR) policy for urban waste collection. Their prize-collecting vehicle routing problem approach demonstrates significant cost savings, highlighting the potential of optimization techniques for improving resource allocation in real-world scenarios. Finally, Liu et al. (2024) https://arxiv.org/abs/2409.13224 tackle the analysis of correlated time series by proposing a variational inference spectral density estimation method. This method employs a blocked Whittle likelihood approximation and a discounted regularized horseshoe prior to address the challenges of correlated noise in gravitational wave detector data, crucial for accurate signal interpretation.
The application of advanced statistical methods for improved prediction and decision-making is another key theme. Jia et al. (2024) https://arxiv.org/abs/2409.13655 introduce Adaptive Mixture Importance Sampling (AMIS) for optimizing key performance indicators in recommender systems. Their work demonstrates improved performance over traditional importance sampling methods in both simulated and real-world ad auction scenarios. Bassolas et al. (2024) https://arxiv.org/abs/2409.13362 analyze spatiotemporal variability in e-bike battery levels in bike-sharing systems. They develop a Markov-chain approach for predicting bike availability and battery levels, contributing to more efficient management of these systems. In healthcare, Sahu et al. (2024) https://arxiv.org/abs/2409.13000 introduce the Large Medical Model (LMM), a generative pre-trained transformer trained on patient claims data. This model demonstrates improved performance in healthcare cost and risk prediction, showcasing the power of large language models for enhanced healthcare analytics.
Methodological advancements within specific statistical domains are also prominent. Bhattacharya et al. (2024) https://arxiv.org/abs/2409.13053 propose "Balanced point adjustment (BA)" for unbiased evaluation of time-series anomaly detectors, addressing limitations of existing methods. Borgert et al. (2024) https://arxiv.org/abs/2409.13938 apply elastic shape analysis to movement data for studying osteoarthritis, demonstrating the added value of analyzing full movement curves over discrete summaries. Pascal and Vaiter (2024) https://arxiv.org/abs/2409.14937 introduce a nonstationary autoregressive model for data-driven reproduction number estimation in epidemiology, using a novel risk estimator based on the Stein's Unbiased Risk Estimate formalism. Flood and Mostafa (2024) https://arxiv.org/abs/2409.14284 propose a novel CDF estimator that integrates probability and non-probability samples for improved distribution function estimation in survey data.
Several preprints explore the application of statistical methods to specific domains. Baratela et al. (2024) https://arxiv.org/abs/2409.13098 use complex network metrics and match statistics to predict soccer match outcomes. Zammarchi and Maranzano (2024) https://arxiv.org/abs/2409.13760 employ spatial hierarchical clustering to map climate change awareness across countries. Dorta-González et al. (2024) https://arxiv.org/abs/2409.14570 investigate factors influencing generative AI usage among researchers. Finally, Ederer et al. (2024) https://arxiv.org/abs/2409.15948 use statistical properties of usernames to de-anonymize posts on the online forum EJMR, raising important ethical considerations regarding online anonymity. These contributions collectively demonstrate the wide-ranging impact of statistical research.
Introducing the Large Medical Model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences by Ricky Sahu, Eric Marriott, Ethan Siegel, David Wagner, Flore Uzan, Troy Yang, Asim Javed https://arxiv.org/abs/2409.13000
The Large Medical Model (LMM) presents a significant advancement in healthcare analytics by leveraging the power of generative pre-trained transformers (GPTs). Trained on a massive dataset of over 140 million longitudinal patient claims records, the LMM offers a novel approach to predicting healthcare costs and identifying potential risk factors. Unlike traditional statistical models or even large language models, the LMM focuses on sequences of medical event codes and taxonomies, capturing the intricate temporal relationships within patient medical histories.
The LMM's methodology involves constructing sequences of historical patient events using structured medical data, including diagnostic codes, procedures, medications, and costs. By employing a specialized vocabulary derived from medical terminology systems, the LMM achieves higher information density and reduced computational requirements compared to text-based models. Furthermore, the use of Monte Carlo simulations during inference allows the LMM to generate multiple possible future event sequences for each patient, enabling probability estimation for various events and the identification of potential causal relationships.
The LMM's performance is remarkable. In cost prediction, it achieves a Normalized Mean Absolute Error (NMAE) of 78.3%, a 14.1% improvement over the best commercial models. Its R-Squared value of 25.3% also represents a 2% improvement over existing benchmarks. The R-Squared formula, R² = 1 - Σ(Yi - ŷi)² / Σ(Yi - ȳ)², quantifies the proportion of variance in actual costs predictable by the model. In chronic disease prediction, the LMM demonstrates an average AUROC of 0.897 across 19 conditions, a 1.9% improvement over the state-of-the-art BEHRT model. The LMM's ability to generate sequences of predicted events, including diagnoses, procedures, and costs, offers a level of granularity and actionability not seen in current models. Its capacity for in-silico research, simulating the effects of interventions on patient timelines, opens exciting possibilities for personalized medicine and drug discovery.
An integrated selection and routing policy for urban waste collection by Niels A. Wouda, Marjolein Aerts-Veenstra, Nicky van Foreest https://arxiv.org/abs/2409.13386
Caption: Impact of ISR policy on average daily distance traveled by waste collection vehicles in Groningen for varying service levels and fleet sizes.
The municipality of Groningen has implemented a novel integrated selection and routing (ISR) policy for urban waste collection, resulting in significant cost savings. The challenge lies in efficiently emptying numerous underground waste container clusters while minimizing travel distance and preventing overflows. The ISR policy addresses this by first calculating a "prize" for each cluster, reflecting the urgency of service based on estimated overflow probability: p = pPr(Z > V), where Z is the predicted fill level, V is the cluster capacity, and p is a weighting parameter. This prize-collecting vehicle routing problem (PCVRP) approach balances operational costs with the risk of overflows.
A simulation study using real-world data demonstrated the ISR policy's effectiveness. Maintaining a high service level (99.9%), the ISR policy reduced route lengths by 40% and shift durations by 44% compared to current practice. Overflow volume was also dramatically reduced. Furthermore, the study found that installing expensive fill-level sensors provided minimal additional benefit, suggesting the current deposit-counting system is sufficient for effective route planning. The reduced workload under the ISR policy even suggests a potential 25% reduction in fleet size while maintaining near-optimal service levels. This data-driven approach optimizes urban services and challenges the assumption that more advanced technology always translates to significant performance gains.
Anonymity and Identity Online by Florian Ederer, Paul Goldsmith-Pinkham, Kyle Jensen https://arxiv.org/abs/2409.15948
Caption: Distribution of p-values from Poisson-Binomial Test for IP Address Attribution on EJMR
This study reveals a critical vulnerability in the anonymity of posters on the Economics Job Market Rumors (EJMR) forum. By exploiting statistical properties of the platform's username generation algorithm, researchers successfully linked posts to specific IP addresses. The algorithm, used until May 2023, combined a user's IP address with a topic ID and applied a SHA-1 hash function. Researchers reversed this process by computing a vast number of hashes, narrowing down possible IP addresses and using a statistical test based on frequency within short time windows to identify the most likely "true" IP address. The formula used was u = S(H(M(t, a, o))), where u is the username, S is a substring function, H is the SHA-1 hash, M is a concatenation function, t is the topic ID, a is the IP address, and o is other data.
This methodology attributed 66.1% of the roughly 7 million posts over 12 years to 47,630 distinct IP addresses. The analysis revealed posting activity concentrated in major US cities and developed countries with prominent economics institutions, demonstrating EJMR's pervasiveness throughout the economics profession. Furthermore, using transformer models, the researchers categorized post content, finding 11.8% toxic, 3.3% misogynistic, and 3.1% hate speech. While university IP addresses were slightly less likely to be problematic, the difference was minimal, indicating widespread problematic content. By examining university-level posting patterns and applying an empirical design based on attention-seeking behavior, the study also found that larger, higher-ranked universities post more frequently, and receiving more attention on initial posts leads to increased subsequent posting activity, suggesting attention as a motivator even in anonymous environments.
This newsletter highlights a diverse range of applications and advancements in statistical modeling and machine learning. The development of the Large Medical Model (LMM) offers a promising new tool for healthcare analytics, demonstrating significant improvements in cost and risk prediction through its innovative use of transformer models trained on patient event sequences. Shifting to urban logistics, the integrated selection and routing (ISR) policy for waste collection in Groningen showcases the practical impact of optimization techniques, achieving substantial cost savings while maintaining service levels. Finally, the de-anonymization study of EJMR posts underscores the complexities of online anonymity and the ethical considerations surrounding data privacy in the digital age. These studies collectively demonstrate the power of statistical methods to address real-world challenges and raise important questions about the responsible use of data and algorithms.