This collection of preprints explores diverse applications of statistical modeling and machine learning across various domains, including geophysics, healthcare, and climate science. Several papers introduce novel methodologies for enhancing prediction and inference in complex systems. Shibata et al. (2024) (Shibata et al., 2024) propose an efficient Bayesian inversion method for simultaneous geometry and spatial field estimation, leveraging the Karhunen-Loève expansion to overcome computational bottlenecks in previous approaches. Similarly, Namdari et al. (2024) (Namdari et al., 2024) introduce P3LS (Point Process Partial Least Squares) for analyzing latent time-varying intensity functions from inhomogeneous point processes, with applications in medical imaging. For time series analysis, Bonas et al. (2024) (Bonas et al., 2024) present CESAR (Convolutional Echo State AutoencodeR), a deep learning model combining convolutional autoencoders and echo state networks for high-resolution wind forecasting, while Gao et al. (2024) (Gao et al., 2024) introduce ARMD (Auto-Regressive Moving Diffusion), a novel diffusion-based model inspired by ARMA theory for improved time series forecasting. These contributions highlight the ongoing development of specialized statistical and machine learning techniques for specific data types and applications.
Another recurring theme is the importance of uncertainty quantification and robust inference. Giertych et al. (2024) (Giertych et al., 2024) adapt conformal prediction to astronomical data with measurement error, providing finite sample coverage guarantees for prediction intervals. Legenkaia et al. (2024) (Legenkaia et al., 2024) investigate the impact of heterogeneities and convolution on Principal Component Analysis for time series, deriving analytical predictions for reconstruction errors. Cotoarbă et al. (2024) (Cotoarbă et al., 2024) advocate for probabilistic digital twins in geotechnical engineering, incorporating uncertainties through Bayesian model updating. These works emphasize the need for methods that not only provide accurate predictions but also quantify the associated uncertainty, particularly in noisy or complex data settings.
Several papers focus on specific applications in healthcare and epidemiology. Cuellar (2024) (Cuellar, 2024) discusses the statistical challenges in diagnosing Shaken Baby Syndrome/Abusive Head Trauma, emphasizing the need for reliable data and rigorous validation. Chatton et al. (2024) (Chatton et al., 2024) compare the performance of a survival super learner against a Cox model for kidney transplant failure prediction, finding superior discrimination but similar calibration drift over time. Plank et al. (2024) (Plank et al., 2024) estimate excess mortality during the COVID-19 pandemic in New Zealand using a quasi-Poisson regression model, highlighting the importance of age-stratified data. These studies showcase the diverse applications of statistical methods in healthcare, from diagnosis and prognosis to epidemiological analysis.
Beyond model development, several papers address practical considerations for data analysis and software development. Panavas et al. (2024) (Panavas et al., 2024) offer design recommendations for differentially private interactive systems, emphasizing the importance of usability alongside privacy and utility. Song and Messier (2024) (Song & Messier, 2024) introduce the chopin
R package for scalable spatial analysis on parallelizable infrastructure, addressing the computational challenges of large geospatial datasets. These contributions underscore the growing need for tools and frameworks that facilitate the practical application of statistical methods in real-world settings.
Finally, several papers explore the intersection of statistics with other disciplines. Bennedsen et al. (2024) (Bennedsen et al., 2024) analyze the Global Carbon Budget as a cointegrated system, providing insights into the dynamics of Earth's carbon cycle. Min et al. (2024) (Min et al., 2024) review the evolving role of applied statistics in the era of artificial intelligence, highlighting the symbiotic relationship between the two fields. These works demonstrate the broad applicability of statistical thinking and methods across a wide range of scientific and societal challenges.
But Can You Use It? Design Recommendations for Differentially Private Interactive Systems by Liudas Panavas, Joshua Snoke, Erika Tyagi, Claire McKay Bowen, Aaron R. Williams https://arxiv.org/abs/2412.11794
Caption: This diagram illustrates the proposed infrastructure for a user-friendly differentially private interactive query system. It emphasizes the use of synthetic data for query exploration, human review of research proposals, and clear communication of accuracy and uncertainty in the results, aiming to balance privacy, utility, and usability. The system guides users through query generation, accuracy specification, and project justification before executing queries on private data and disseminating noisy results.
Accessing sensitive government data is crucial for impactful public policy research, but privacy concerns often restrict access. Differentially private (DP) interactive systems, also known as validation servers, offer a promising solution by allowing researchers to query specific statistics without direct data access. While theoretically robust, the practical implementation of these systems has been hampered by usability challenges. This paper argues that previous efforts have overemphasized strict privacy guarantees at the expense of usability, hindering widespread adoption. It proposes a paradigm shift that prioritizes usability while carefully balancing privacy assurance and statistical utility.
The authors outline three core design considerations: privacy assurance (measurable and trackable privacy expenditure, transparently articulated unprotected information, and threat modeling capabilities), statistical utility (scope of analyses, valid statistical inference, and level of uncertainty), and system usability (ease of use for both data users and administrators). Existing interactive DP systems often fall short in these areas, particularly with respect to exploratory data analysis (EDA), privacy parameter setting, fixed privacy budgets, and the interpretation of private results. These limitations arise from fundamental incompatibilities between the DP framework and standard statistical workflows, coupled with unrealistic assumptions about users' DP knowledge.
To overcome these challenges, the paper presents five key recommendations. First, it suggests providing synthetic data for exploration, enabling users to familiarize themselves with the data structure and formulate meaningful queries without consuming their privacy budget. Second, it advocates for removing privacy parameter language from user inputs, replacing it with more intuitive accuracy levels. Third, it proposes allocating privacy budgets per research proposal rather than per user, allowing for more flexible and efficient use of the privacy budget. Fourth, it suggests incorporating light human review of research proposals to ensure responsible data use and prevent unintended privacy breaches. Finally, it emphasizes the importance of providing comprehensive output documentation, including automatically generated uncertainty measures and example publication language to facilitate the interpretation and dissemination of DP results. These recommendations aim to bridge the gap between DP principles and the practical needs of data analysts, making the system more accessible to users without extensive DP expertise.
The paper also outlines a proposed infrastructure for a private interactive query system based on these recommendations. This infrastructure includes components like a landing page, query generation page, accuracy specification page, project justification page, human review process, and data results/release page. It emphasizes leveraging tested, open-source DP libraries for faster deployment and reproducibility. While this infrastructure is presented as a theoretical framework, it provides a concrete basis for evaluating the feasibility of the recommendations and identifying areas for future research.
Finally, the paper underscores the critical need for user-centered research, particularly using methods from human-computer interaction (HCI), to evaluate the usability of DP systems. It proposes several research questions focusing on the impact of synthetic data, user preferences for accuracy metrics, guidelines for evaluating research proposals, and best practices for reporting DP results. By incorporating user feedback and conducting empirical evaluations, future DP systems can be designed to be both secure and practical, ultimately empowering researchers and policymakers to make more effective use of sensitive government data.
Beyond Reweighting: On the Predictive Role of Covariate Shift in Effect Generalization by Ying Jin, Naoki Egami, Dominik Rothenhäusler https://arxiv.org/abs/2412.08869
Generalizing statistical findings across diverse populations under distribution shift is a persistent challenge in statistical inference. Traditional methods often rely on the covariate shift assumption, which posits that differences between populations are fully explained by observed covariates. However, recent research has demonstrated that adjusting for covariate shift alone is often inadequate, leaving a significant unexplained component attributed to conditional shift – the change in the conditional distribution of outcomes given the covariates. This paper explores a novel perspective on the role of covariate shift, revealing its predictive power in bounding and informing about the often unobservable conditional shift.
The authors utilize two large-scale, multi-site replication projects from the social sciences: the Pipeline Project and Many Labs 1. These projects encompass 680 studies across 65 sites and 25 hypotheses, providing a rich dataset for investigating the interplay between covariate and conditional shift. They introduce standardized, "pivotal" measures for quantifying both types of shift. The covariate shift measure is a Mahalanobis-type statistic: $\frac{1}{L} \sum_{l=1}^{L} \frac{(E_Q[X_l] - E_P[X_l])^2}{Var_P(X_l)}$, where L is the number of covariates, and P and Q denote the source and target distributions, respectively. The conditional shift measure is the standardized difference in conditional expectations: $\frac{|E_Q[\phi - \phi_P(X)]|}{sd_P(\phi - \phi_P(X))}$, where $\phi$ is the influence function and $\phi_P(X)$ is its conditional expectation under the source distribution. These standardized measures allow for meaningful comparisons of shift magnitude across different studies and variables.
Empirical analysis of the replication data reveals a crucial finding: while conditional shift is non-negligible, its magnitude is often bounded by the observable covariate shift. Importantly, this pattern only emerges when using the proposed standardized measures, highlighting their importance in capturing the relative strength of the two shifts. Furthermore, the ratio of conditional shift to covariate shift exhibits remarkable stability across different hypotheses and sites, suggesting a predictable relationship between the two. This empirical observation is further supported by theoretical analysis based on a random distribution shift model, which assumes that the underlying probability distribution is subject to numerous small, random perturbations. Under this model, the conditional shift is expected to be smaller than the covariate shift, particularly when the treatment assignment mechanism is invariant across populations.
The authors then demonstrate how this predictive relationship can be leveraged for improved effect generalization. They construct prediction intervals for target population estimates by utilizing the empirically observed ratio of conditional to covariate shift. These intervals maintain valid coverage, significantly outperforming methods that ignore distribution shift (i.i.d. assumption) and offering substantial improvements over existing worst-case bounds, which tend to be overly conservative. Specifically, in scenarios without auxiliary data, the proposed method achieved near-nominal 95% coverage across both replication datasets. With auxiliary data from other hypotheses within the same sites, the method further improved, achieving even closer to nominal coverage and significantly shorter intervals compared to worst-case approaches. This demonstrates the practical utility of exploiting the predictive relationship between covariate and conditional shift for more accurate and efficient generalization.
I See, Therefore I Do: Estimating Causal Effects for Image Treatments by Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar https://arxiv.org/abs/2412.06810
Estimating individual treatment effects (ITE) is essential for personalized interventions across various domains. While existing methods often simplify treatments to scalar values, this research addresses the challenge of using images as treatments, recognizing the wealth of information they contain. The authors introduce NICE (Network for Image treatments Causal effect Estimation), a novel neural network architecture designed to leverage the multidimensional nature of image treatments for improved ITE estimation. This fills a significant gap in the existing literature, which has primarily focused on text or graph-based treatments.
NICE operates through a three-step process. First, it generates representations for both user covariates and treatment images using separate networks. These representations are then concatenated to form a joint embedding that captures the interaction between user characteristics and treatment features. Second, individual treatment head networks are employed to estimate potential outcomes for each treatment category. This allows the model to learn treatment-specific effects, capturing the heterogeneity of responses to different image treatments. Finally, a combined loss function, incorporating both regression loss and a treatment regularization loss based on Maximum Mean Discrepancy (MMD), is used to optimize the model. The MMD loss is crucial for mitigating confounding bias, a major challenge in observational studies, especially when dealing with complex treatments like images. The ITE is then estimated as the difference between the predicted potential outcomes for different treatments: τ<sub>a,b</sub>(x<sub>i</sub>) = E[y<sup>a</sup>|x=x<sub>i</sub>] - E[y<sup>b</sup>|x=x<sub>i</sub>], where τ<sub>a,b</sub>(x<sub>i</sub>) represents the ITE of treatment a with respect to treatment b for user i with covariates x<sub>i</sub>.
Due to the scarcity of datasets with image treatments, the researchers developed a novel semi-synthetic data simulation framework. This framework uses movie posters from the PosterLens dataset as treatments and generates synthetic potential outcomes based on user covariates and treatment embeddings. This allows for controlled experiments and evaluation of NICE's performance under various conditions. The performance is evaluated using the rooted Precision in Estimation of Heterogeneous Effect (PEHE) metric: √(1/(k(k-1)) ∑<sub>a=1</sub><sup>k</sup> ∑<sub>b≠a</sub> (1/n) ∑<sub>i=1</sub><sup>n</sup> (τ̂<sub>a,b</sub>(x<sub>i</sub>) - τ<sub>a,b</sub>(x<sub>i</sub>))<sup>2</sup>), where τ̂<sub>a,b</sub>(x<sub>i</sub>) are the estimated PEHEs. This metric captures the average error in estimating the heterogeneous treatment effects across different treatment pairs.
Experiments were conducted across various scenarios, including different numbers of treatments (4, 8, and 16) and varying levels of treatment assignment bias. The results demonstrate that NICE significantly outperforms baseline methods, including adaptations of existing algorithms that incorporate treatment attributes. Moreover, NICE exhibits robust performance in zero-shot settings, accurately estimating ITEs for treatments unseen during training. Quantitatively, NICE achieved a substantial reduction in the PEHE metric compared to baselines, with improvements ranging from approximately 10% to 40% across different experimental settings. These results highlight the effectiveness of NICE in leveraging the rich information embedded in image treatments for improved ITE estimation.
Probabilistic digital twins for geotechnical design and construction by Dafydd Cotoarbă, Daniel Straub, Ian FC Smith https://arxiv.org/abs/2412.09432
Caption: This diagram illustrates the Probabilistic Digital Twin (PDT) framework for geotechnical design, showcasing the Bayesian updating process that integrates data (Z<sub>t</sub>) to refine the digital state (d<sub>t</sub>), representing beliefs about the physical state (X<sub>t</sub>). The framework incorporates a probabilistic settlement model, subsoil model, and cost model to inform decision optimization (a<sub>t</sub>) for minimizing total cost (C<sub>tot</sub>) while achieving desired quantities of interest like settlement and overconsolidation ratio.
Traditional digital twins (DTs) in the Architecture, Engineering, Construction, Operations, and Management (AECOM) sector often rely on deterministic models, failing to account for the inherent uncertainties that are particularly prevalent in geotechnical engineering. This paper introduces a Probabilistic Digital Twin (PDT) framework specifically tailored for geotechnical design and construction. This framework addresses the limitations of traditional DTs by explicitly incorporating and propagating uncertainties throughout the modeling process. The PDT distinguishes between two key data types: property data (direct measurements of physical attributes like shear strength) and behavior data (time-dependent observations like settlement). This distinction allows for a more scalable and nuanced framework that integrates data across the entire project lifecycle.
The PDT framework utilizes Bayesian methods for model updating, ensuring that the model accurately reflects site-specific conditions as new information becomes available. The framework is formally represented using an influence diagram, which clearly depicts the conditional dependencies between the physical state, data, quantities of interest, decisions, and rewards. The digital state, which represents the current belief about the physical state, evolves dynamically through a Bayesian update equation: d<sub>t</sub> ∝ Σ<sub>X<sub>t-1</sub></sub> Σ<sub>Q<sub>t</sub></sub> d<sub>t-1</sub> × p(X<sub>t</sub>|X<sub>t-1</sub>, U<sub>t-1</sub> = u<sub>t-1</sub>) × p(Z<sub>t</sub> = z<sub>t</sub>|X<sub>t</sub>, Q<sub>t</sub>, U<sub>t-1</sub> = u<sub>t-1</sub>) × p(Q<sub>t</sub>|X<sub>t</sub>), where d<sub>t</sub> is the digital state at time t, X<sub>t</sub> is the physical state, U<sub>t</sub> are decisions, Z<sub>t</sub> is data (property and behavior), and Q<sub>t</sub> are quantities of interest. Due to the complexity of direct numerical integration, a particle filter approach is employed to approximate the posterior distribution of the state.
The practical effectiveness of the PDT framework is demonstrated through its application to a highway foundation construction project on clayey soil. The objective is to determine the most cost-efficient surcharge and prefabricated vertical drain (PVD) design that ensures consolidation within a specified timeframe. A probabilistic model is used to predict settlement and overconsolidation ratio over time, incorporating uncertainties in soil properties. Behavioral data, collected as weekly settlement measurements, is used to update the model via the Bayesian approach. A heuristic strategy optimization, based on cross-entropy optimization, is employed to identify the optimal surcharge adjustments over time.
The results show that the PDT-based heuristic significantly outperforms existing heuristics, even under substantial measurement uncertainties (standard deviation σ<sub>ε</sub> = 0.05m-0.15m). Compared to previous work, the PDT achieves a 6-13% reduction in expected total costs and up to a 20% reduction compared to the state-of-the-art, even in the worst-case uncertainty scenario (σ<sub>ε</sub> = 0.15m). Furthermore, the PDT reduces the standard deviation of the cost by up to 40%, indicating a significant reduction in the variability of the final expected cost. This case study highlights the potential of the PDT framework to improve decision-making and project outcomes in geotechnical engineering by explicitly accounting for and managing uncertainties.
This newsletter highlights a convergence of themes around robust inference, uncertainty quantification, and the practical application of advanced statistical methods in diverse fields. The emphasis on usability in differentially private systems, as explored by Panavas et al., reflects a growing recognition of the need to make powerful privacy-preserving tools accessible to a broader audience. The work by Jin et al. on the predictive role of covariate shift offers a novel perspective on generalization, moving beyond traditional assumptions and providing more accurate and efficient uncertainty quantification. The development of NICE by Thorat et al. extends the scope of causal inference to image treatments, opening up exciting possibilities for personalized interventions in various domains. Finally, the probabilistic digital twin framework proposed by Cotoarbă et al. demonstrates the tangible benefits of incorporating uncertainty into real-world engineering applications, leading to more robust and cost-effective designs. Together, these contributions represent significant advancements in statistical methodology and their practical application, paving the way for more informed decision-making in the face of complex and uncertain data.