This collection of preprints explores a wide range of applications for statistical modeling and machine learning. These applications span diverse fields, from ecological forecasting and bioacoustics to personalized medicine and causal inference. A notable trend is the emergence of novel Bayesian methodologies designed to tackle complex data challenges. For example, Bae et al. (2024) Bae et al. (2024) propose a Bayesian framework for high-dimensional mediation analysis. This framework uses a multivariate stochastic search variable selection method with Markov random field and Bernoulli priors to pinpoint active pathways and estimate indirect effects in metabolomics data. In a similar vein, Zorzetto et al. (2024) Zorzetto et al. (2024) introduce a Bayesian nonparametric model for principal stratification. Their approach employs dependent Dirichlet processes to investigate the causal relationship between air pollution and social mobility, mediated by education level. Diana et al. (2024) Diana et al. (2024) contribute a unified Bayesian framework for mortality model selection, integrating model selection and parameter estimation using reversible jump Markov chain Monte Carlo. These Bayesian approaches offer robust solutions for managing complex dependencies and uncertainties inherent in various research domains.
Beyond Bayesian methods, the application of machine learning for enhanced prediction and classification is a recurring theme. Jin et al. (2024) Jin et al. (2024) investigate the predictive power of large language models (LLMs) in forecasting clinical trial outcomes. Their findings suggest that GPT-4 excels in early trial phases, while HINT demonstrates superior performance in later phases. Gibbons et al. (2024) Gibbons et al. (2024) explore the use of generative AI models, specifically ACGAN and DDPMs, for data augmentation in bioacoustic classification, achieving improved accuracy in noisy environments. Sadeghi (2024) Sadeghi (2024) employs time series forecasting techniques, such as ARIMA and LSTM networks, to predict project performance metrics in urban road reconstruction. These studies collectively showcase the potential of machine learning to refine prediction and decision-making across diverse fields.
Several contributions delve into specific applications and methodological advancements in statistical modeling. Xia et al. (2024) Xia et al. (2024) investigate dolphin swimming performance using a particle filter to fuse external and internal kinematic measurements, enabling precise trajectory estimation and analysis of energetic cost during cornering maneuvers. Chen et al. (2024) Chen et al. (2024) develop statistical generative models for human operational motions, leveraging Riemannian geometry and functional PCA to emulate realistic human shape sequences. Amaral et al. (2024) Amaral et al. (2024) address the complexities of spatio-temporal modeling of Antarctic krill abundance by employing a Hurdle-Gamma model to account for zero-inflated data and misaligned covariates. These studies highlight the importance of tailoring statistical methods to the specific characteristics of the data and the research question at hand.
Furthermore, several papers address crucial methodological considerations in statistical analysis. Wesselkamp et al. (2024) Wesselkamp et al. (2024) revisit the concept of the ecological forecast horizon, proposing a unified framework for evaluating predictability and defining empirical forecast horizons. Lazic (2024) Lazic (2024) discusses the ultimate issue error in statistical inference, emphasizing the importance of differentiating between parameter testing and hypothesis testing. Shchetkina & Berman (2024) Shchetkina & Berman (2024) investigate the conditions under which heterogeneity of treatment effects becomes actionable for personalization, introducing the concept of actionable heterogeneity. These contributions underscore the need for careful methodological considerations and rigorous interpretation of statistical results.
Finally, a number of papers explore specific applications across diverse fields. Groll et al. (2024) Groll et al. (2024) develop a machine learning-based anomaly detection framework for life insurance contracts. Ferstad et al. (2024) Ferstad et al. (2024) propose a pipeline for learning explainable treatment policies for digital health interventions. Maity et al. (2024) Maity et al. (2024) introduce a Fragility Index for time-to-event endpoints in single-arm clinical trials. Cutuli et al. (2024) Cutuli et al. (2024) use Bayesian hierarchical models to capture preference heterogeneity in migration flows. These diverse applications highlight the broad relevance of statistical modeling and machine learning in a variety of domains.
Can artificial intelligence predict clinical trial outcomes? by Shuyi Jin, Lu Chen, Hongru Ding, Meijie Wang, Lun Yu https://arxiv.org/abs/2411.17595
This study investigates whether artificial intelligence can effectively predict clinical trial outcomes, a potentially transformative development for the expensive and intricate process of drug development. Researchers assessed the predictive capabilities of various large language models (LLMs), including GPT-3.5, GPT-4, GPT-40, and GPT-4mini, alongside the hierarchical interaction network model (HINT). Using data from ClinicalTrials.gov, the models were tasked with predicting trial success or failure based on information such as trial title, summary, and outcome measures. Performance was evaluated using balanced accuracy, Matthews Correlation Coefficient (MCC), recall, and specificity.
The results presented a nuanced picture. GPT-40 emerged as the top-performing LLM, achieving the highest balanced accuracy (0.573) and MCC (0.212) across the dataset. While its recall was impressive (0.931), indicating a high sensitivity to positive outcomes, its specificity was considerably lower (0.214), suggesting a tendency to overpredict success. Other LLMs, particularly GPT-4mini and GPT-3.5, exhibited similar biases, with perfect or near-perfect recall but extremely low specificity. In contrast, HINT offered a more balanced performance, demonstrating the highest specificity (0.541) and respectable balanced accuracy (0.563) and MCC (0.111). This suggests HINT's strength lies in correctly identifying trial failures, a critical aspect often overlooked by the LLMs.
Further analysis revealed intriguing trends across different trial phases and disease categories. GPT-40 maintained relatively consistent performance across all phases, while HINT's specificity improved in later phases, especially Phase III. Notably, all models struggled with oncology trials, the most complex and lengthy category in the dataset, underscoring the challenges AI faces in navigating the intricacies of cancer research. Trial duration also played a significant role, with accuracy decreasing as trial length increased for all models. HINT, in particular, experienced a sharp decline in performance for longer trials. Additionally, the study found LLMs performed poorly in predicting the outcomes of terminated trials, highlighting their limitations in capturing external factors influencing trial termination. HINT, however, demonstrated consistent accuracy in identifying terminated trials.
Learning Explainable Treatment Policies with Clinician-Informed Representations: A Practical Approach by Johannes O. Ferstad, Emily B. Fox, David Scheinker, Ramesh Johari https://arxiv.org/abs/2411.17570
Caption: This figure compares the performance of different CATE estimators and state/action representations for a digital health intervention targeting glycemic control in T1D youth. Clinician-informed representations consistently outperform learned representations, with the best policy (T-Learner, clinician-informed actions, TIDE states) achieving an ATT@25% of 6.6. This highlights the importance of incorporating clinical domain knowledge into AI-driven treatment policies for improved efficacy and explainability.
Digital health interventions (DHIs) and remote patient monitoring (RPM) hold substantial promise for personalized chronic disease management. However, their practical implementation has been hindered by concerns about efficacy, clinical workload, and the "black box" nature of many AI-driven solutions. This research addresses these challenges by presenting a pipeline for developing explainable and effective treatment policies for RPM-enabled DHIs, specifically targeting glycemic control in youth with type 1 diabetes (T1D).
The core contribution of this study lies in demonstrating the vital role of clinical domain knowledge in crafting state and action representations for these AI models. Instead of relying solely on black-box machine learning to analyze raw data (e.g., continuous glucose monitor (CGM) readings and treatment messages), the researchers integrated clinician-informed representations. These representations leverage established clinical practices, such as summarizing CGM data into clinically relevant metrics like time-in-range (TIR) and employing clinician-defined features for treatment messages. These representations are then used to train targeting policies that prioritize patients for intervention based on estimated conditional average treatment effects (CATEs), specifically: τ(s, a) = ρ(s, a) - ρ(s, 0), where s represents the patient state and a the action.
The results clearly demonstrate the advantages of clinician-informed representations. Policies derived from these representations significantly outperformed those learned from black-box methods in terms of both efficacy and efficiency. In fact, only the clinician-informed policies consistently surpassed random targeting. The most effective policy, leveraging clinician-informed action representations, TIDE-derived state representations, and a T-Learner CATE estimator, achieved an ATT@25% of 6.6 [95% CI: 5.6-7.6] on the held-out test set. This metric represents the average treatment effect on the treated when 25% of the population receives an intervention, reflecting a realistic clinical capacity constraint. Moreover, the clinician-informed policies proved to be more interpretable and aligned with established clinical guidelines. They appropriately prioritized patients with lower TIR, larger drops in TIR, higher mean glucose, and those not using insulin pumps—all factors known to influence the effectiveness of interventions. This alignment with clinical practice is crucial for building trust and encouraging adoption among healthcare professionals.
Multimodal Whole Slide Foundation Model for Pathology by Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F.K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood https://arxiv.org/abs/2411.19666
Caption: This figure illustrates the architecture and training process of TITAN, a novel multimodal foundation model for whole-slide images (WSIs). The model is trained in three stages, incorporating visual data from WSIs and textual data from synthetic captions and pathology reports, enabling it to learn rich slide representations and perform various clinical tasks. Panel A shows the distribution of WSIs and their characteristics in the Mass-340k dataset. Panels B and C depict the process of generating slide embeddings from WSIs. Panel D shows the caption generation process. Panel E visualizes the UMAP embedding of TITAN's slide representations, demonstrating its ability to cluster different organ tissues.
Computational pathology has been significantly advanced by foundation models trained on histopathology regions-of-interest (ROIs). However, applying these models to patient- and slide-level clinical challenges, particularly with limited clinical data or rare conditions, remains a significant obstacle. Researchers introduce TITAN, a multimodal whole slide foundation model designed to overcome these limitations. Unlike previous patch-based models, TITAN is pretrained on a massive dataset of 335,645 whole-slide images (WSIs) across 20 organ types and incorporates both visual and textual data. This approach allows TITAN to learn rich, general-purpose slide representations without requiring clinical labels, enabling its application to diverse clinical tasks, including rare disease retrieval and cancer prognosis.
TITAN's pretraining involves three stages. Stage 1 focuses on vision-only unimodal pretraining using millions of high-resolution ROIs and a student-teacher knowledge distillation approach adapted for slide-level learning. Stage 2 incorporates cross-modal alignment by contrasting slide embeddings with 423,122 synthetic captions generated by PathChat, a multimodal generative AI for pathology. This step enhances TITAN's ability to capture fine-grained morphological descriptions. Finally, Stage 3 refines the model through cross-modal alignment with 182,862 pathology reports at the slide level, enabling it to understand coarser clinical descriptions and generate pathology reports.
The researchers evaluated TITAN on a wide array of tasks, demonstrating its superior performance compared to existing ROI and slide foundation models. In linear probing experiments for morphological subtyping, TITAN and its vision-only variant, TITAN∨, achieved an average performance improvement of +8.4% and +6.7%, respectively, over the next-best model. In cross-modal zero-shot classification, TITAN significantly outperformed the baseline, achieving a +56.52% increase in balanced accuracy for multi-class tasks and +13.8% for binary tasks. TITAN also excelled in rare cancer retrieval, highlighting its potential to aid clinicians in diagnosing challenging cases. In report generation tasks, TITAN outperformed the baseline by a remarkable 161% across three evaluation metrics.
A Machine Learning-based Anomaly Detection Framework in Life Insurance Contracts by Andreas Groll, Akshat Khanna, Leonid Zeldin https://arxiv.org/abs/2411.17495
Caption: This figure compares the performance of various unsupervised anomaly detection methods on two insurance datasets. It shows the proportion of detected anomalies and the number of correctly identified synthetic anomalies (out of four) for each method, highlighting the superior performance of autoencoders and variational autoencoders, particularly on the larger dataset. Classical methods struggled with the larger dataset, while Isolation Forest exhibited oversensitivity.
Life insurance, like other insurance sectors, relies heavily on data integrity. Anomaly detection is therefore essential for identifying fraud, errors, or emerging trends. However, the frequent lack of labeled data necessitates unsupervised approaches. This study benchmarks various classical and modern unsupervised anomaly detection methods on two health insurance datasets, a proxy for life insurance data, augmented with synthetic anomalies. The methods tested include proximity-based approaches (k-nearest neighbors, k-means, DBSCAN, HDBSCAN, and One-Class SVM), a tree-based method (Isolation Forest), and deep learning techniques (autoencoders and variational autoencoders). The goal was to evaluate their effectiveness in detecting anomalies without prior knowledge and with minimal human intervention, particularly addressing the challenges posed by larger, more complex datasets.
Data preprocessing involved calculating BMI (BMI = Weight/(Height^2)) for Dataset 1 and handling missing values in both datasets. Categorical variables in Dataset 2 were one-hot encoded. Classical methods were primarily evaluated using the Silhouette Score (for clustering quality) and Anomaly Score (for Isolation Forest), with grid search employed for automatic parameter tuning where applicable. Deep learning models, lacking a direct scoring mechanism, employed ensemble learning with three models each, varying hidden layer and latent space dimensions. Performance was assessed based on runtime, the number of detected synthetic anomalies (out of four), and the overall number of data points flagged as anomalous.
The results revealed mixed performance for classical methods. While most performed reasonably well on the smaller Dataset 1 (986 rows, 13 columns), they struggled with the larger, more complex Dataset 2 (25,000 rows, 51 columns after preprocessing). DBSCAN and HDBSCAN failed to complete on Dataset 2, while others exhibited limited anomaly detection capabilities. Isolation Forest, while detecting all four anomalies in both datasets, flagged a large number of normal data points as anomalous (342 out of 990 in Dataset 1 and 1974 out of 24010 in Dataset 2), raising concerns about its reliability. In contrast, both autoencoder and variational autoencoder ensembles successfully detected all four synthetic anomalies in both datasets within reasonable runtimes. However, VAEs marked over 50% of Dataset 1 as anomalous, suggesting potential oversensitivity.
When Is Heterogeneity Actionable for Personalization? by Anya Shchetkina, Ron Berman https://arxiv.org/abs/2411.16552
Caption: Visualizing Actionable Heterogeneity
Targeting and personalization policies are increasingly popular strategies to enhance outcomes in various domains. While the presence of heterogeneous treatment effects is considered essential for personalization to be effective, this research from the Wharton School argues that heterogeneity alone is insufficient. The study introduces the concept of "actionable heterogeneity," requiring that the most effective intervention varies across subgroups, visualized as crossover interactions in outcomes across treatments. The magnitude of these crossovers, and thus the potential gain from personalization, is determined by three factors: within-treatment heterogeneity (σ), cross-treatment correlation (ρ), and variation in average outcomes (s). The study develops a statistical model to quantify this gain, expressed as: E[I(Y<sub>A</sub> - Y<sub>B</sub> > 0)(Y<sub>A</sub> - Y<sub>B</sub>)], where Y<sub>A</sub> and Y<sub>B</sub> are potential outcomes for treatments A and B. This simplifies to: (μ<sub>A</sub> - μ<sub>B</sub>)(1 - Φ((μ<sub>A</sub> - μ<sub>B</sub>) / σ√2(1-ρ)) + σ√(1-ρ)φ((μ<sub>A</sub> - μ<sub>B</sub>) / σ√2(1-ρ)), where μ<sub>A</sub> and μ<sub>B</sub> are population-level average outcomes, σ is within-treatment heterogeneity, ρ is cross-treatment correlation, and Φ and φ are the CDF and PDF of the standard normal distribution, respectively.
The researchers applied their model and five common personalization methods to two large-scale field experiments focused on encouraging flu vaccination. The first, the "Walmart study," involved 22 behavioral nudges, while the second, the "Penn-Geisinger study," tested 19 nudges. The results showed an 18% gain from personalization in the Penn-Geisinger study and a more modest 4% gain in the Walmart study, consistent with the model's predictions. The study also found that having more treatments can sometimes reduce the potential gain from personalization, especially with peaked distributions of average treatment outcomes. A counterfactual analysis using the model revealed that the primary driver of the difference in personalization gains between the two studies was the significantly higher within-treatment heterogeneity in the Penn-Geisinger study, largely due to richer covariate data. However, a sensitivity analysis showed that a 1% reduction in cross-treatment correlation would lead to a greater increase in the gain from personalization than a 1% increase in within-treatment heterogeneity in both studies. This highlights the importance of considering cross-treatment correlation when designing and evaluating personalization strategies.
This newsletter showcases a diverse range of advancements in statistical modeling and machine learning across various domains. The highlighted papers demonstrate the growing sophistication of Bayesian methods for causal inference and the increasing reliance on machine learning for prediction and classification tasks. From predicting clinical trial outcomes with LLMs and enhancing bioacoustic classification with generative AI to detecting anomalies in insurance contracts and personalizing treatment policies for digital health interventions, these studies highlight the practical impact of these fields. A crucial theme emerging from this collection is the importance of incorporating domain expertise and tailoring statistical methods to the specific characteristics of the data and research question. The research on actionable heterogeneity further emphasizes the need for a nuanced understanding of treatment effects and the importance of considering factors beyond simple heterogeneity when designing personalized interventions. The advancements presented in these preprints pave the way for more robust, accurate, and impactful applications of statistical modeling and machine learning in the future.