This collection of papers presents a diverse range of advancements in statistical modeling and inference, spanning applications from insurance claim prediction to macroeconomic forecasting. Several papers focus on enhancing existing models by incorporating richer data sources or addressing specific limitations. For instance, Hou et al. (2024) (Hou et al., 2024) introduce a novel topic-based finite mixture model for insurance claims, integrating both claim amounts and textual descriptions to improve prediction accuracy and clustering. Similarly, Shi et al. (2024) (Shi et al., 2024) propose a mixture hidden Markov model for intermittently observed disease processes, accounting for disease subtype heterogeneity and partially known disease type information. In the realm of causal inference, Du et al. (2024) (Du et al., 2024) develop an assumption-lean post-integrated inference method using negative control outcomes to address bias in multiple hypothesis testing after data integration. Deliorman et al. (2024) (Deliorman et al., 2024) investigate the impact of model misspecification in surrogate endpoint evaluation using the information-theoretic causal inference framework.
Another prominent theme is the development of novel methodologies for improved estimation and inference. Fogliato et al. (2024) (Fogliato et al., 2024) present an empirical Bayes estimator for precise model benchmarking with limited observations, demonstrating improved precision in subgroup-level estimates of large language model performance. Gao et al. (2024) (Gao et al., 2024) propose a penalized sparse covariance regression approach for high-dimensional covariates, establishing non-asymptotic error bounds and demonstrating the oracle property of folded concave penalized estimators. Wang et al. (2024) (Wang et al., 2024) introduce a Monte Carlo expectation-maximization method for acoustic spatial capture-recapture with unknown identities, addressing the challenge of constructing capture histories. Saxena et al. (2024) (Saxena et al., 2024) develop a scalable inference approach called Fenrir for Bayesian multinomial logistic-normal dynamic linear models, significantly improving computational efficiency for longitudinal count compositional data analysis.
Several applications of these statistical methods to real-world problems are presented. Dai et al. (2024) (Dai et al., 2024) investigate the impact of resource allocation on economic growth in Chinese cities, quantifying the cost of misallocation and proposing policy implications. Ebrahimzadeh et al. (2024) (Ebrahimzadeh et al., 2024) develop a framework for learning ranking policies in e-commerce marketplaces, formulating the expected reward of the marketplace and demonstrating trade-offs governed by different context value distributions. Sisti et al. (2024) (Sisti et al., 2024) propose a Bayesian method for adverse effects estimation in observational studies with truncation by death, introducing a composite ordinal outcome combining death and adverse events. Yung et al. (2024) (Yung et al., 2024) provide a framework for understanding and discussing strategies for overall survival safety monitoring in clinical trials, highlighting different approaches and their practical implications.
A subset of papers focuses on specific data types and modeling challenges. Chen et al. (2024) (Chen et al., 2024) introduce efficient Bayesian additive regression models for microbiome studies, addressing the computational intractability of traditional models. Hamidi (2024) (Hamidi, 2024) analyzes spatial and temporal land clutter statistics in SAR imaging, demonstrating the effectiveness of Weibull and Rayleigh distributions for modeling clutter characteristics. Pherwani et al. (2024) (Pherwani et al., 2024) employ dynamic generalized linear models for scalable spatiotemporal modeling and forecasting of human mobility, demonstrating accurate occupancy count forecasts. Several papers also address the increasing use of real-world data (RWD) and real-world evidence (RWE) in drug development, particularly for rare diseases (Chen et al., 2024; Chen et al., 2024; Chen et al., 2024).
Decentralized Clinical Trials in the Era of Real-World Evidence: A Statistical Perspective by Jie Chen, Junrui Di, Nadia Daizadeh, Ying Lu, Hongwei Wang, Yuan-Li Shen, Jennifer Kirk, Frank W. Rockhold, Herbert Pang, Jing Zhao, Weili He, Andrew Potter, Hana Lee https://arxiv.org/abs/2410.06591
Decentralized clinical trials (DCTs), where some or all trial-related activities occur outside traditional sites, are gaining traction due to their potential to broaden participation, improve data capture in real-world settings, and accelerate treatment implementation. While offering numerous advantages, DCTs also present unique statistical challenges. This paper offers a statistical perspective on DCT design, conduct, and analysis, focusing on the implications of decentralized elements like digital health technologies (DHTs) and diverse participant populations.
The use of DHTs for remote data acquisition and monitoring requires careful consideration of their fitness-for-purpose, meaning they must be validated for their intended use in evaluating endpoints. The increasing integration of artificial intelligence (AI) in DHTs presents both opportunities and challenges, especially in participant recruitment and safety monitoring. The reliance on DHTs necessitates careful definition of estimands within the ICH E9(R1) framework, accounting for potential DHT malfunctions. The diverse participant populations often enrolled in DCTs can lead to greater endpoint heterogeneity compared to traditional trials. This heterogeneity must be accounted for in the statistical analysis to ensure valid and reliable results.
Statistically relevant aspects of DCTs include participant screening, recruitment, and retention, often facilitated by DHTs. Remote informed consent, medication dispensing, and outcome/endpoint assessment are key considerations. Data management and monitoring are more complex in DCTs due to the volume and variety of data from multiple sources. A comprehensive data management plan (DMP) is crucial, addressing data origin and flow, remote data acquisition methods, and vendor management. Real-time analytics integrated into DHT systems can enhance data management efficiency. Safety monitoring is paramount, with DHTs enabling real-time monitoring and alerting for safety issues. The trial protocol should include a detailed safety monitoring plan outlining participant response and adverse event reporting procedures.
The statistical analysis plan (SAP) should address data collection methods, the appropriateness of decentralized non-inferiority trials, pre-specified endpoints, the estimand framework, and potential data quality issues. Missing data is a significant concern in DCTs, potentially arising from device malfunction, participant non-compliance, or data transfer issues. Strategies for handling missing data, such as imputation or maximum likelihood-based methods, should be detailed in the SAP. This paper underscores the importance of addressing the unique statistical challenges associated with decentralized designs, remote data acquisition, and diverse participant populations throughout the trial lifecycle. Careful consideration of estimands, trial design, potential biases, and the development of a comprehensive SAP are critical for ensuring the validity and reliability of study results, ultimately contributing to the generation of meaningful real-world evidence.
Causal machine learning for predicting treatment outcomes by Stefan Feuerriegel, Dennis Frauen, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Alicia Curth, Stefan Bauer, Niki Kilbertus, Isaac S. Kohane, Mihaela van der Schaar https://arxiv.org/abs/2410.08770
Caption: This graph illustrates the concept of individualized treatment effects, a key aspect of causal machine learning. The purple line represents the varying treatment effect across a patient's age, with the dashed line indicating the average treatment effect. Causal ML aims to estimate these individualized effects, enabling personalized treatment strategies that deviate from the average effect based on patient characteristics like age.
Causal machine learning (ML) offers a powerful approach to predicting treatment outcomes and personalizing patient care by focusing on the causal relationship between treatment and outcome, unlike traditional ML which primarily predicts outcomes without establishing causality. This involves estimating causal quantities like the average treatment effect (ATE) or the conditional average treatment effect (CATE), representing the expected difference in outcomes under different treatment regimes. Causal ML can leverage both experimental data from randomized controlled trials (RCTs) and observational data from real-world data (RWD) sources, offering the significant advantage of estimating individualized treatment effects for personalized predictions and more precise decision-making.
The foundation of causal ML lies in addressing the fundamental problem of causal inference: we can only observe the factual outcome under a given treatment, not the counterfactual outcome under a different treatment. This requires specific assumptions for identifiability, including the stable unit treatment value assumption (SUTVA), positivity (overlap), and unconfoundedness (ignorability). SUTVA assumes no interference between patients and consistent potential outcomes. Positivity requires a non-zero probability of receiving any treatment for all patient characteristic combinations. Unconfoundedness assumes that, given observed covariates, treatment assignment is independent of potential outcomes. Validating these assumptions, particularly unconfoundedness, is crucial but challenging in RWD settings, necessitating strategies like leveraging domain knowledge, instrumental variable approaches, and causal sensitivity analysis to assess robustness to unobserved confounding.
Various methods exist for estimating treatment effects within the causal ML framework. Meta-learners like S-learner, T-learner, DR-learner, and R-learner offer model-agnostic approaches for CATE estimation with binary treatments. Model-specific methods like causal trees and causal forests adapt existing ML models for enhanced performance. Evaluating causal ML models is complex due to the unobservability of counterfactuals and ground-truth treatment effects. Current approaches, including heuristics like comparing performance on factual outcomes or using pseudo-outcomes as surrogates, have limitations. Despite these challenges, causal ML holds immense potential for clinical translation. It can generate new clinical evidence by identifying patient subgroups with positive or negative treatment responses, personalize treatment strategies based on individual characteristics, and analyze treatment effect heterogeneity in RWD. However, challenges like the technical difficulty of estimating heterogeneous treatment effects and predicting outcomes, the need for robust uncertainty quantification, and the development of standardized protocols and regulatory frameworks remain. Future research should prioritize addressing these technical challenges, demonstrating clinical insights, and integrating causal ML into clinical decision support systems. A cautious and rigorous implementation approach, including validation with RCTs where possible, is essential to realize the full potential of causal ML in improving patient care.
Use of Real-World Data and Real-World Evidence in Rare Disease Drug Development: A Statistical Perspective by Jie Chen, Susan Gruber, Hana Lee, Haitao Chu, Shiowjen Lee, Haijun Tian, Yan Wang, Weili He, Thomas Jemielita, Yang Song, Roy Tamura, Lu Tian, Yihua Zhao, Yong Chen, Mark van der Laan, Lei Nie https://arxiv.org/abs/2410.06586
Real-world data (RWD) and real-world evidence (RWE) are increasingly crucial in drug development, especially for rare diseases. This paper provides a statistical perspective on leveraging RWD/RWE in this challenging area, reviewing existing regulatory guidance, current practices, and proposing a targeted learning roadmap. The authors emphasize the importance of addressing the unique challenges of rare disease drug development, such as small patient populations, limited disease understanding, and diagnostic difficulties.
Natural history studies (NHS) are essential for understanding disease progression and identifying suitable endpoints. Precisely defining the study population, treatments, endpoints, and population-level summaries is crucial in NHS design. Consideration of competing risks/events, like death, is also necessary, as they can significantly impact the disease's natural history. For clinical trials, RWD/RWE can inform trial design, including patient identification, site selection, and endpoint selection. RWD/RWE can also serve as external controls in single-arm trials (SATs), a common design in rare disease research due to limited sample sizes. The paper also discusses hybrid control designs for RCTs, where RWD/RWE augments internal control data, allowing broader patient access to promising therapies.
The paper advocates for a targeted learning (TL) roadmap for rare disease trials. This structured approach begins with a well-defined research question and corresponding estimands (Step 0). It then proceeds through defining the observed data distribution and statistical model (Step 1), mapping causal relationships (Step 2), estimating the targeted parameter (Step 3), measuring uncertainty (Step 4), and performing sensitivity analysis (Step 5). Step 3 introduces key causal parameters: the average treatment effect (ATE), given by E(Y¹)-E(Y⁰), the average treatment effect among the treated (ATT), given by E(Y¹|A=1) - E(Y⁰|A=1), and the average treatment effect among the comparators (ATC), given by E(Y¹|A=0) - E(Y⁰|A=0). These parameters, defined in terms of the full data, provide clear interpretations for causal inference.
Case studies presented in the paper illustrate the practical application of RWD/RWE. The approval of SKYCLARYS for Friedreich ataxia demonstrates the use of NHS and open-label extension studies to support product effectiveness. Other examples showcase RWD/RWE use in various regulatory contexts, including original marketing applications and label expansions. These examples highlight the importance of data quality, adherence to regulatory guidance, and the application of causal inference principles in generating robust RWE. The TL roadmap is recommended as a systematic and transparent approach to generating reliable RWE for rare disease drug development.
Challenges and Possible Strategies to Address Them in Rare Disease Drug Development: A Statistical Perspective by Jie Chen, Lei Nie, Shiowjen Lee, Haitao Chu, Haijun Tian, Yan Wang, Weili He, Thomas Jemielita, Susan Gruber, Yang Song, Roy Tamura, Lu Tian, Yihua Zhao, Yong Chen, Mark van der Laan, Hana Lee https://arxiv.org/abs/2410.06585
Developing drugs for rare diseases presents significant statistical challenges. Small patient populations make traditional RCTs difficult to power, often leading to early termination due to low accrual rates (over 25% of rare disease trials between 2016 and 2020 were terminated early for this reason). Rarity and complexity often cause diagnostic delays and inaccuracies, further complicating trial design. The lack of well-understood natural history makes identifying appropriate endpoints and developing informative biomarkers difficult, compounded by the frequent lack of consensus on clinically meaningful endpoints and the need for carefully validated surrogate endpoints. The often slow progression of rare diseases necessitates long-term trials, posing recruitment and retention challenges.
This paper reviews regulatory guidance from agencies worldwide, including the FDA, EMA, NMPA, MHLW, and CADTH, highlighting the increasing focus on rare disease drug development. It then discusses the statistical challenges in trial design, conduct, and analysis, including the difficulties of externally controlled trials due to potential biases, the impact of patient heterogeneity on study power, and the complexities of benefit-risk assessment (BRA) with small sample sizes and short follow-up periods.
The authors propose several strategies to address these challenges. Adaptive designs, like the snSMART design, are particularly suitable for rare disease trials, offering flexibility and efficiency in dose-finding and confirming therapeutic effects. Other strategies include alternative trial designs (randomized withdrawal, delayed start), expansion from adult to pediatric populations, and incorporating biomarkers. Careful selection of clinically meaningful endpoints and appropriate external controls for single-arm trials is emphasized. For statistical analysis, methods like Bayesian borrowing, propensity score matching, and G methods are suggested for externally controlled trials, while methods like win-ratio and desirability of outcome ranking are proposed for integrating evidence from multiple endpoints. Hierarchical models are discussed for investigating treatment effects using surrogate endpoints and their relationship to overall survival.
Beyond trial design and analysis, the paper addresses broader challenges, including limitations of market incentives and the lack of high-quality RWD. The unique challenges posed by pediatric populations are also discussed, emphasizing the need for tailored development pathways. The paper stresses the importance of coordinated efforts in rare disease research, including data sharing and collaborative outcome development, and the potential role of RWD/RWE in supporting regulatory decision-making. Collaboration among stakeholders – researchers, clinicians, regulatory agencies, biopharmaceutical companies, and patient organizations – is crucial to overcome these challenges and advance the development of effective therapies for rare diseases.
This newsletter highlights the increasing importance of sophisticated statistical methodologies in addressing complex real-world problems. From enhancing insurance claim predictions with topic modeling to improving macroeconomic forecasts, the papers discussed showcase the power of innovative statistical approaches. A key theme is the need for robust and scalable methods that can handle complex data structures, address model misspecification, and provide precise estimates even with limited data. The focus on real-world data (RWD) and real-world evidence (RWE), particularly in the context of rare disease drug development, underscores the growing need for statistically sound methods to analyze and interpret data from non-traditional sources. The challenges and opportunities presented by decentralized clinical trials (DCTs) further emphasize the importance of careful statistical consideration throughout the trial lifecycle, from estimand definition and data management to analysis and interpretation. The development of targeted learning roadmaps and causal machine learning techniques offers promising avenues for generating robust evidence and personalizing treatment strategies. Overall, the papers in this newsletter demonstrate the vital role of statistical advancements in improving decision-making and driving progress across diverse fields, including healthcare, finance, and environmental science. The continued development and application of these methods hold great promise for addressing complex challenges and improving outcomes in critical domains.