This newsletter explores a collection of papers showcasing methodological advancements and applications in statistical modeling and data analysis. Several papers focus on causal inference and prediction. Jiang et al. (2024) introduce a framework for longitudinal causal inference with selective eligibility, addressing dropout due to changing eligibility criteria. Their proposed time-specific eligible treatment effect (ETE) and expected number of outcome events (EOE) offer a nuanced approach to analyzing treatment effects in dynamic settings. Yu and Qian (2024) develop a semiparametric causal excursion effect model for analyzing time-varying effects of mobile health interventions using longitudinal functional data, incorporating double time indices and allowing for effect moderation by contextual variables. For prediction, Schroeder et al. (2024) explore Prediction Rule Ensembles (PREs) with multiple imputation for handling missing data, examining the trade-offs between predictive performance and model complexity.
The development of novel statistical models for specific applications is another key theme. Rios and Xu (2024) propose a Bayesian D-optimal experimental design for optimizing last-mile delivery, tackling the challenge of random travel costs. Bouhou et al. (2024) use a Partially Observable Markov Decision Process (POMDP) for target detection and tracking in cognitive massive MIMO radar. Herrera-Martin et al. (2024) introduce weighted logistic regression for classifying rare events, applied to repeating fast radio bursts. Funk et al. (2024) present a multivariate bias correction method based on zero-inflated vine copulas for climate model outputs. Dzadz and Romaniuk (2024) explore resampling and GAN methods for improving insurance catastrophic data.
Spatial and network data analysis also feature prominently. MacDonald et al. (2024) introduce mesoscale two-sample testing for network data. Eckardt et al. (2024) develop second-order characteristics for spatial point processes with graph-valued marks. Blasi and Furrer (2024) propose modular covariate-based covariance functions for nonstationary spatial modeling. Arbia and Nardelli (2024) investigate the Local Influence Function (LIF) for detecting spatial outliers. Tiwari et al. (2024) study the influence of social networks on opioid overdose deaths.
Finally, privacy and data quality are addressed. Cho and Awan (2024) introduce Semi-DP, extending differential privacy. Skøien et al. (2024) develop a Quadtree-based approach for statistical disclosure control in geospatial data. Sen and Lahiri (2024) propose a composite estimator integrating probability and nonprobability surveys. Numerous other papers present specialized applications and methodological extensions, highlighting the breadth and depth of current research in statistical modeling and data analysis.
Longitudinal Causal Inference with Selective Eligibility by Zhichao Jiang, Eli Ben-Michael, D. James Greiner, Ryan Halen, Kosuke Imai https://arxiv.org/abs/2410.17864
Caption: Estimated Average Treatment Effects of PSA Intervention
Dropout in longitudinal studies poses a significant threat to the validity of causal inferences. While previous research has largely focused on missing outcomes due to treatment, this paper addresses a critical yet often overlooked source of dropout: selective eligibility. Selective eligibility occurs when a unit's eligibility for subsequent treatments is influenced by their prior treatment history. This is distinct from "truncation by death," as dropout occurs after observing the outcome but before the next treatment, rendering standard dropout approaches inapplicable.
This paper proposes a comprehensive methodological framework for longitudinal causal inference in the presence of selective eligibility. The authors introduce two novel causal estimands: the average eligible treatment effect (ETE) and the expected number of outcome events (EOE). The ETE, denoted as τ<sub>t</sub>(z<sub>t-1</sub>) := E{Y<sub>it</sub>(z<sub>t-1</sub>, 1) - Y<sub>it</sub>(z<sub>t-1</sub>,0) | S<sub>it</sub>(z<sub>t-1</sub>) = 1}, captures the average treatment effect at time t among units eligible for treatment, given a specific prior treatment history z<sub>t-1</sub>. The EOE, θ(z) := Σ<sup>T</sup><sub>t=1</sub>E{Y<sub>it</sub>(z<sub>t</sub>)S<sub>it</sub>(z<sub>t-1</sub>)}, represents the expected total number of events under a given treatment sequence z<sub>T</sub>. This framework accommodates both deterministic and stochastic interventions, broadening its applicability to various longitudinal scenarios.
Under a generalized version of sequential ignorability, the authors derive two nonparametric identification formulas. One leverages outcome regression, while the other utilizes inverse probability weighting. To enhance estimation efficiency, they derive the efficient influence function (EIF) for each estimand, leading to doubly robust estimators. These estimators maintain consistency even if either the propensity score model or the outcome regression and eligibility models are misspecified, extending the classic doubly robust estimator to longitudinal studies with selective eligibility.
The practical utility of this framework is demonstrated through an application to a randomized controlled trial evaluating a pre-trial risk assessment instrument (PSA) in the criminal justice system. In this context, selective eligibility arises due to recidivism, as arrestees are only eligible for the PSA intervention upon rearrest. The analysis examines the PSA's influence on judicial decisions and subsequent negative outcomes (failure to appear, new criminal activity, and new violent criminal activity) for up to three arrests.
The findings reveal that providing the PSA generally increases agreement between judicial decisions and PSA recommendations, with statistically significant effects observed for the first two arrests. For the third arrest, the effect is significant only when the PSA was provided for the previous two arrests. Importantly, the analysis indicates minimal impact of the PSA on subsequent negative outcomes, aligning with prior analyses focused solely on first arrests.
Formal Privacy Guarantees with Invariant Statistics by Young Hyun Cho, Jordan Awan https://arxiv.org/abs/2410.17468
Caption: Comparison of L2 costs between Semi-DP and naive mechanisms for two probability models.
While differential privacy (DP) offers robust privacy protection for released query outputs, it faces challenges when certain statistics, known as invariants, are also publicly available. These invariants can leak information about the underlying data, potentially compromising individual privacy. Motivated by the 2020 US Census, which released both DP outputs and true statistics, this paper introduces Semi-Differential Privacy (Semi-DP), a novel framework that addresses the limitations of traditional DP in the presence of invariants.
Semi-DP refines the notion of adjacency in DP by restricting the scope to invariant-conforming databases – databases sharing the same invariant value as the confidential data. Within this restricted space, Semi-DP defines adjacency using a semi-adjacent parameter, a(t), which quantifies the worst-case impact of replacing an individual's data while maintaining the invariant. Formally, a(t) = sup<sub>i∈[n]</sub> sup<sub>x,y∈Di</sub> inf{d(X,Y) : X,Y ∈ D<sub>t</sub>, X<sub>i</sub> = x, Y<sub>i</sub> = y}, where D<sub>t</sub> is the set of invariant-conforming databases and d(X,Y) is an adjacency metric. This ensures that even the most challenging data substitutions remain indistinguishable to an adversary.
The authors develop specialized mechanisms satisfying Semi-DP, including adaptations of the Gaussian mechanism and the optimal K-norm mechanism for rank-deficient sensitivity spaces. The optimal K-norm mechanism leverages the convex hull of the sensitivity space within the subspace spanned by the sensitivity space, minimizing noise addition while preserving privacy. The application of Semi-DP to contingency table analysis, relevant to the US Census, demonstrates how to release private outputs while preserving true marginal counts.
Numerical experiments showcase the superior performance of the Semi-DP Gaussian mechanism over naive approaches, consistently achieving lower L2 costs across various contingency table sizes and probability models. Similarly, the optimal K-norm mechanism under Semi-DP significantly reduces L2-costs compared to naive l<sub>1</sub>, l<sub>2</sub>, and l<sub>∞</sub>-norm mechanisms.
A crucial contribution of this work is the privacy analysis of the 2020 US Decennial Census using the Semi-DP framework. This analysis reveals that the effective privacy guarantees are weaker than advertised because the reported privacy parameters do not account for the released invariants. For instance, the Census mechanism satisfies (D<sub>t</sub>, A<sub>2</sub>, 10.24)-zCDP for state population totals, contrasting with the reported (D, A<sub>1</sub>, 2.56)-zCDP. Converting to (ε, δ)-DP with δ = 10<sup>-10</sup> yields an actual guarantee of ε = 40.95057, significantly higher than the advertised ε = 17.91528.
The paper also acknowledges limitations of Semi-DP, particularly regarding individual-level privacy. The restricted adjacency notion can weaken privacy compared to traditional DP, especially with adversary side information. Future research should focus on mitigating these vulnerabilities, potentially by refining adjacency definitions or gaining deeper understanding of invariant structures.
Saddlepoint Monte Carlo and its Application to Exact Ecological Inference by Théo Voldoire, Nicolas Chopin, Guillaume Rateau, Robin J. Ryder https://arxiv.org/abs/2410.18243
Caption: Comparison of Relative Standard Error in SPMC with and without Tilting
Ecological inference (EI) often involves analyzing aggregate data while individual-level data remains hidden. Traditional EI methods often rely on approximations due to the computational complexity of exact inference, especially with large datasets and complex models. This paper introduces saddlepoint Monte Carlo (SPMC), a novel method for obtaining unbiased, low-variance estimates of marginal likelihoods in such scenarios. The method utilizes importance sampling of the characteristic function, drawing insights from the saddlepoint approximation with exponential tilting, and is particularly well-suited for models belonging to the exponential family.
The core of SPMC lies in the inversion formula: f<sub>AX</sub>(y) = (2π)<sup>-d<sub>y</sub></sup>∫<sub>[-π,π]<sup>d<sub>y</sub></sup></sub> exp(-iz<sup>T</sup>y)φ<sub>X</sub>(A<sup>T</sup>z)dz, where f<sub>AX</sub>(y) is the density of the observed aggregate data Y = AX, φ<sub>X</sub> is the characteristic function of the unobserved individual data X, and A is the aggregation matrix. SPMC approximates this integral using importance sampling, employing either a uniform or a Gaussian proposal based on the moments of X. To further reduce variance, exponential tilting is incorporated, selecting a tilting parameter v such that A∇κ<sub>X</sub>(A<sup>T</sup>v) = y, where κ<sub>X</sub> is the cumulant function of X. This tilting effectively shifts the distribution of X to prevent y from residing in the tail of the distribution of AX.
The power of SPMC is demonstrated through its application to French election data, a classic EI problem. Analyzing the 2007 presidential election, the authors show that approximating the multinomial distribution of votes with a Gaussian, a common practice in some EI studies, leads to substantial bias. SPMC, in contrast, allows for exact inference, revealing, for instance, significant discrepancies between estimated abstention rates and those reported in exit polls (80% vs. 64%). Further analysis of the 2022 presidential election, incorporating population density as a covariate, demonstrates SPMC's ability to handle complex models. The results indicate a positive correlation between population density and the probability of voting for Macron (center) after initially supporting Mélenchon (left/far-left). Finally, the analysis of the 2024 legislative election data showcases SPMC's scalability, efficiently handling constituencies with varying numbers of candidates.
SPMC represents a significant advance in EI, enabling exact inference where previous methods relied on approximations. Its low variance and scalability make it suitable for large datasets and complex models, opening new research avenues in electoral sociology and other fields dealing with aggregate data. Beyond EI, potential applications include data privacy and inverse problems, highlighting SPMC's broad utility in computational statistics.
Towards more realistic climate model outputs: A multivariate bias correction based on zero-inflated vine copulas by Henri Funk, Ralf Ludwig, Helmut Kuechenhoff, Thomas Nagler https://arxiv.org/abs/2410.15931
Caption: The image illustrates the three-step process of Vine Copula Bias Correction (VBC) for climate model data. First, vine copula modeling estimates dependencies, accounting for zero-inflated margins. Then, a Rosenblatt transformation corrects the model data to a uniform domain, considering discrete-continuous mixtures. Finally, delta mapping projects the corrected data to the reference distribution using multiplicative and additive projections.
Climate models, essential for understanding climate variability and projecting future scenarios, are prone to biases arising from incomplete representations of physical processes. These biases can significantly distort the multivariate distributional shape of climate variables, impacting downstream applications such as hydrological modeling and extreme event analysis. Existing bias correction methods, like univariate quantile mapping (UBC) and multivariate bias correction (MBCn), struggle to address the zero-inflation often present in high-resolution climate data, especially for variables like precipitation and radiation. This necessitates a more sophisticated approach that accurately captures the complexities of zero-inflated climate data.
This paper introduces Vine Copula Bias Correction for partially zero-inflated margins (VBC), a novel multivariate bias correction methodology. This technique leverages the flexibility of vine copulas, renowned for their ability to model high-dimensional, nonlinear dependencies, and extends their application to accommodate zero-inflated variables. A key theoretical contribution of VBC is a generalized vine density decomposition formula, extending the work of Bedford and Cooke (2001, 2002), which allows for the inclusion of variables with mixed discrete and continuous components. VBC corrects model data by transforming it to a uniform domain using a modified Rosenblatt transformation that accounts for zero-inflation, and then projects it to the reference distribution using a modified delta mapping procedure.
The performance of VBC was evaluated against UBC and MBCn using a real-world application focusing on five climate variables from the CRCM5-LE dataset over three Bavarian catchments. Results demonstrate VBC's consistent superiority in terms of distributional similarity to the reference data, measured by the 2nd Wasserstein distance (W²). VBC improved similarity in at least 93% of the corrected datasets across all catchments and times of day, outperforming UBC and MBCn, especially during nighttime when zero-inflation is most prominent. Importantly, VBC maintained consistent correction quality across different altitudes, demonstrating its robustness to varying climatic conditions.
The authors also introduce the Model Correction Inconsistency (MCI) metric, which assesses the preservation of weather patterns within the model data after correction. The MCI measures the absolute difference in non-exceedance probabilities between the model data and its bias-corrected counterpart. Results show that while UBC, by design, is the least invasive method, VBC outperforms MBCn in preserving the course of weather, exhibiting lower average MCI values and less seasonal variation in inconsistency. This suggests that VBC effectively corrects biases while retaining the model's inherent temporal dynamics.
This newsletter has highlighted significant advances in statistical methodology and their application across diverse domains. From tackling selective eligibility in longitudinal causal inference to developing novel privacy-preserving techniques and enhancing the realism of climate model outputs, the papers discussed showcase the power of innovative statistical thinking. The introduction of the ETE and EOE estimands by Jiang et al. offers a robust framework for analyzing complex longitudinal studies, while the Semi-DP framework by Cho and Awan addresses critical limitations of traditional differential privacy in the presence of invariant statistics. The development of SPMC by Voldoire et al. enables exact inference in challenging ecological inference problems, and the VBC method by Funk et al. provides a powerful tool for correcting biases in high-resolution climate data. These contributions collectively push the boundaries of statistical modeling and data analysis, offering valuable insights and tools for researchers across various fields.