Subject: Statistical Modeling and Machine Learning Advancements
This collection of preprints explores diverse applications of statistical modeling and machine learning across a broad spectrum of fields, from public health and urban planning to financial risk assessment and fundamental scientific research. A key focus area is the development of innovative methods for handling complex data structures and addressing inherent biases. For example, Dixit, Holan, and Wikle (2024) investigate asymmetric loss functions within conditionally autoregressive (CAR) models for spatial real estate prediction, emphasizing the significant impact of loss function choice on area-level predictions. Similarly, Nascimento et al. (2024) tackle the challenges of non-additive, non-normally distributed noise in speckled data (common in synthetic aperture radar (SAR) images) by proposing a novel regression model. Meanwhile, O'Neil and Tymochko (2024) introduce a novel perspective to urban planning and resource allocation during extreme heat events by leveraging persistent homology, a tool from topological data analysis, to evaluate cooling center coverage.
The application of advanced statistical methods to real-world problems forms another prominent theme. Lenti et al. (2024) employ Bayesian Networks and Stochastic Variational Inference to model the causal pathways of climate activism on Reddit, providing valuable insights into the factors influencing online participation. In the realm of harbor operations, Yu et al. (2024) develop a geometric model with stochastic error for detecting abnormal motion in portal crane bucket grabs, demonstrating the potential of computer vision combined with statistical modeling for improving efficiency and safety. Brigatto et al. (2024) analyze natural inflow forecasts in Brazil, uncovering an optimistic bias with significant implications for water resource management and energy planning. Crucially, Su, Su, and Wang (2024) assess the privacy guarantees of the 2020 US Census, revealing unused privacy budget and suggesting potential improvements in data accuracy without compromising privacy.
The development of novel machine learning algorithms and frameworks also features prominently. Yuan et al. (2024) introduce GENEREL, a framework unifying genomic and biomedical concepts through multi-task, multi-source contrastive learning, enabling more effective integration of diverse data sources. Yang et al. (2024) present fastHDMI, a Python package designed for efficient variable screening in high-dimensional neuroimaging data, leveraging mutual information estimation. Okhrati (2024) introduces a new class of adaptive non-linear autoregressive (Nlar) models for dynamically estimating learning rates and momentum in optimization algorithms. Zhu and Li (2024) propose hierarchical latent class models for mortality surveillance using partially verified verbal autopsies, addressing the challenges in understanding the impact of emerging diseases.
Further applications of statistical learning span diverse domains. Groll et al. (2024) demonstrate the potential of ensemble methods in sports analytics by combining multiple statistical learning approaches to predict the UEFA EURO 2024 outcome. Ma, Zhao, and Kang (2024) develop a framework combining Q-learning with microsimulation for optimizing COVID-19 booster vaccine policies. Cao, Knight, and Nason (2024) introduce a multiscale method for analyzing data collected from network edges, with applications to hydrology and river network analysis. McClean et al. (2024) propose a fairness criterion for comparing causal parameters with many treatments and positivity violations, addressing key challenges in causal inference. Finally, a series of papers explore applications in specific domains such as latent image resolution prediction (Kansabanik and Barbu 2024), analysis of scientific contributions (Chen et al. 2024), and many more, showcasing the breadth and depth of current research in statistical modeling and machine learning.
The 2020 United States Decennial Census Is More Private Than You (Might) Think by Buxin Su, Weijie J. Su, Chendi Wang https://arxiv.org/abs/2410.09296
Caption: This bar chart compares the Bureau's published privacy loss parameter (ε) for the 2020 Census with a tighter, f-DP-derived ε across eight geographical levels. The analysis reveals substantial unused privacy budget, allowing for a reduction in ε (and thus noise) while maintaining the same privacy level δ, leading to improved data accuracy.
The 2020 US Census, a cornerstone of policy-making, employed differential privacy (DP) for the first time to protect respondent confidentiality. This groundbreaking move introduced the TopDown algorithm and discrete Gaussian noise injection into census tabulations. However, the Bureau acknowledged the possibility of overestimation in their privacy loss calculations, raising the question of whether tighter privacy guarantees, and thus improved accuracy, could be achieved. This paper addresses this open question using the f-DP framework.
The Bureau's reliance on zCDP and continuous distribution approximations potentially overstated the privacy loss. This paper argues for a more precise accounting using f-DP, directly addressing the discrete nature of the injected noise. By meticulously tracking privacy losses across the eight geographical levels of the Census (each involving ten queries), the authors reveal a significant, untapped privacy budget—between 8.50% and 13.76% for each level. This implies the possibility of reducing the privacy parameter ε while maintaining δ, leading to stronger privacy guarantees.
This discovery has substantial implications for data accuracy. The authors demonstrate that noise variances can be reduced by 15.08% to 24.82% without compromising privacy. This translates to improved accuracy in downstream applications. Simulations using 2010 Census data from Pennsylvania show a roughly 15% reduction in mean squared error (MSE) after post-processing. An empirical study on the relationship between earnings and education, using ACS 5-year data, further validates the impact of reduced noise, mitigating distortions caused by privacy constraints (e.g., 60.57% reduction in mean absolute error (MAE) for the slope coefficient at the state level).
The f-DP method offers a more accurate and efficient way to account for privacy loss in complex, compositional scenarios like the US Census. While this method shines in homogeneous noise settings, challenges remain in handling heterogeneous noise due to floating-point arithmetic limitations. Future research addressing these limitations could further unlock the potential of f-DP for even more granular privacy budget optimization.
Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive Learning by Hongyi Yuan, Suqi Liu, Kelly Cho, Katherine Liao, Alexandre Pereira, Tianxi Cai https://arxiv.org/abs/2410.10144
Caption: GENEREL, a novel framework, effectively clusters related diseases (e.g., type 1 diabetes, acquired hypothyroidism) and associated genes/SNPs (e.g., ptpn22, rs4988235_G) within a unified representation space. This visualization demonstrates GENEREL's superior performance compared to PubmedBERT in capturing nuanced biological relationships and differentiating between semantically similar diseases with distinct underlying mechanisms. The tighter clustering in GENEREL reflects its ability to leverage diverse data sources and integrate genomic information with biomedical knowledge.
Integrating diverse biomedical data sources, such as biobanks, genetic databases, and scientific literature, is a significant challenge due to inconsistent phenotypic trait encoding. This hinders data integration and limits advancements in personalized medicine and drug discovery. Existing graph-based methods struggle with limited pairwise relationships and inaccurate code mappings. Biomedical language models, while powerful, often lack integration with underlying biological mechanisms and genetic information like SNPs. GENEREL (GENomic Encoding REpresentation with Language Model) addresses these limitations by bridging the gap between genomic and biomedical knowledge.
GENEREL utilizes language models to generate embeddings for biomedical concepts based on their descriptions. These embeddings are fine-tuned using diverse summary-level data (PrimeKG, UMLS, UK Biobank, GWAS Catalog, eQTL). This approach bypasses traditional code mapping limitations and integrates knowledge from various sources. Learning end-to-end from concept descriptions eliminates the need for anchor concepts, avoiding potential mapping errors. Genomic information is incorporated by embedding SNPs using one-hot encoding and a trainable embedding matrix, creating a unified representation space.
The framework employs multi-task learning with three key tasks: (1) learning relatedness from biomedical knowledge graphs, (2) aligning biomedical concepts and SNPs using GWAS, UK Biobank, and eQTL data, and (3) identifying synonyms from UMLS. Each task uses contrastive learning with the InfoNCE loss: L<sub>S</sub> = Σ<sub>(h,t)∈S</sub> w<sub>h,t</sub>L<sub>InfoNCE</sub>(h, t) = Σ<sub>(h,t)∈S</sub> w<sub>h,t</sub> log (exp(sim(h, t)/τ) / Σ<sub>ħ∈C</sub> exp(sim(h, t)/τ)), where S represents concept pairs, w<sub>h,t</sub> is the pair's weight reflecting relatedness, sim(h,t) is the similarity function (inner product of embeddings), C is the set of negative samples, and τ is the temperature parameter. GENEREL adjusts contrastive losses based on the relative importance of concepts and SNPs, guided by odds ratios or correlation scores.
Evaluation on various benchmarks demonstrates GENEREL's superior performance. It achieves the highest AUCs in detecting related biomedical concept pairs, outperforming existing models and baselines. For example, in detecting disease-gene associations from DisGeNET, GENEREL achieves an AUC of 0.760 ± 0.023 compared to 0.640 ± 0.023 for the BGE baseline. GENEREL also excels in associating biomedical concepts with SNPs, surpassing graph learning techniques and achieving competitive performance against matrix factorization methods on the MVP dataset. It effectively encodes the relative relatedness between traits and SNPs and shows robustness to synonyms.
Ablation studies confirm the importance of each training task in GENEREL, highlighting the synergistic effect of multi-task, multi-source learning. The case study illustrates GENEREL's ability to capture nuanced biological relationships, effectively clustering related diseases, genes, and SNPs while differentiating semantically similar diseases with distinct mechanisms. GENEREL marks a significant advance in integrating genomic and biomedical knowledge, paving the way for enhanced data integration and discovery.
Fair comparisons of causal parameters with many treatments and positivity violations by Alec McClean, Yiting Li, Sunjae Bae, Mara A. McAdams-DeMarco, Iván Díaz, Wenbo Wu https://arxiv.org/abs/2410.13522
Caption: Comparison of Average Causal Effects of Dialysis Providers on Readmission Rates using Different Methods under Positivity Violations
Comparing treatment outcomes is essential in causal inference, particularly in medicine and policy. Traditional methods rely on treatment-specific means (TSMs), offering fair comparisons by considering outcomes under each treatment for the same population. However, TSMs struggle when the positivity assumption is violated—some individuals having zero probability of receiving certain treatments. This is especially problematic with many treatment options, such as in provider profiling.
This paper introduces a framework for fair comparisons under positivity violations. The V-fairness criterion, based on counterfactual quantities, dictates that if one treatment's conditional TSM (given covariates V) is almost surely larger than another's, the corresponding causal parameter should also be larger. This avoids Simpson's paradox and ensures fair comparisons, independent of outcomes under unrelated treatments. The criterion's desirability is linked to V's granularity: coarser V leads to a more desirable criterion but requires stronger positivity. The authors formalize this trade-off, proving that identifying V-fair parameters necessitates a specific positivity assumption—milder than strong positivity but stronger than simply requiring non-zero probabilities.
The paper introduces interventions—trimmed TSMs, exponential tilts, and multiplicative shifts—satisfying V-fairness and identifiable under the milder positivity assumption. These interventions focus on the "trimmed set" of individuals with non-zero probability of receiving all treatments, mirroring the intuition behind matching and balancing weights. However, these parameters are non-smooth, hindering standard nonparametric efficiency theory. The authors address this with smooth approximations, offering a new approach to smoothed trimming.
Doubly robust-style estimators are developed for these smooth parameters, achieving parametric convergence rates and normal limiting distributions under nonparametric conditions. This allows efficient estimation of fair comparisons even with complex data. The analysis of dialysis provider performance in New York State demonstrates the methods' practicality. Despite positivity violations, minimal variation in provider performance was found after accounting for statistical uncertainty. However, the worst-performing provider had statistically significantly higher readmission rates than two others.
A Two-Stage Federated Learning Approach for Industrial Prognostics Using Large-Scale High-Dimensional Signals by Yuqi Su, Xiaolei Fang https://arxiv.org/abs/2410.11101
Industrial prognostics, predicting equipment failure using sensor data, is hampered by limited data. Individual organizations rarely have enough failure data for robust models, and data sharing is often precluded by privacy concerns. This paper proposes a two-stage federated learning approach for collaborative prognostic model building without sharing raw data. This addresses limitations of existing federated prognostic models, which often rely on deep learning (less effective with smaller datasets) and provide only point estimates of failure time.
The first stage addresses high-dimensional degradation signals using multivariate functional principal component analysis (MFPCA). MFPCA captures correlations within and between sensor readings, reducing dimensionality while preserving key information. To avoid centralized data, the paper introduces a federated randomized singular value decomposition (FRSVD) algorithm, enabling distributed MFPCA computation and reducing computational and communication costs compared to existing federated SVD methods. Users share lower-dimensional matrices, protecting data privacy.
The second stage predicts failure times using the extracted features (MFPC-scores). (Log)-location-scale (LLS) regression, well-suited for various failure time distributions, is employed. A federated parameter estimation algorithm based on gradient descent is proposed. Users compute and share local gradients with a central server, which aggregates them to update the model without accessing raw data.
Testing on simulated data and a NASA C-MAPSS dataset validates the model. In simulations, the federated approach achieves accuracy comparable to centralized methods using RSVD and SVD, significantly outperforming individual models trained on isolated datasets. The NASA case study confirms this, showing similar performance to centralized methods and substantial improvement over individual models, especially those trained on smaller datasets. This demonstrates the power of federated learning for collaborative, privacy-preserving prognostic model building.
This newsletter highlights a range of advancements in statistical modeling and machine learning, emphasizing both methodological innovations and impactful applications. The studies on the 2020 US Census and GENEREL demonstrate the power of sophisticated statistical techniques (f-DP and multi-task contrastive learning, respectively) to address critical challenges in data privacy and biomedical knowledge integration. The work on fair causal comparisons provides a robust framework for evaluating treatment effects even under challenging conditions of positivity violations and many treatment options. Finally, the development of a two-stage federated learning approach for industrial prognostics exemplifies the growing importance of privacy-preserving collaborative learning in practical settings. These diverse contributions collectively underscore the transformative potential of statistical learning across a wide spectrum of scientific and societal challenges.