Subject: Cutting-Edge Research in Statistical Modeling, Causal Inference, and Machine Learning
Hi Elman,
This newsletter highlights recent advancements in statistical modeling, causal inference, and machine learning, showcasing both methodological improvements and diverse real-world applications.
This collection of papers explores diverse methodological advancements and applications across various domains, with a notable emphasis on statistical modeling, causal inference, and machine learning. Several studies focus on improving existing techniques, such as Klingwort and Toepoel (2024) (Klingwort & Toepoel, 2024) who employ meta-regression to analyze the impact of survey design features on response rates in crime surveys, and Manela, Yang, and Evans (2024) (Manela et al., 2024) who propose a framework for evaluating model generalizability in causal inference under covariate shifts using frugal parameterization and synthetic benchmarks. Similarly, Deriba and Yang (2024) (Deriba & Yang, 2024) introduce a novel approach to performance-based risk assessment for large-scale transportation networks using the Transitional Markov Chain Monte Carlo method, addressing challenges posed by "gray swan" events. Furthermore, Abbahaddou et al. (2024) (Abbahaddou et al., 2024) enhance the robustness of Graph Neural Networks against adversarial attacks using Conditional Random Fields in a post-hoc manner.
Another prominent theme is the application of advanced statistical methods to real-world problems. Liu, Gu, and Han (2024) (Liu et al., 2024) utilize 2-dimensional truncated distributions to estimate journey time in vehicle re-identification surveys, addressing survivorship bias. Raabe et al. (2024) (Raabe et al., 2024) investigate the effect of funding on student achievement, while Galeotti et al. (2024) (Galeotti et al., 2024) explore robust market interventions in differentiated oligopolies using noisy estimates of demand. Oh, Song, and Kim (2024) (Oh et al., 2024) propose a deep learning-based loading protocol for parameter estimation of Bouc-Wen class models in structural engineering. Powers et al. (2024) (Powers et al., 2024) demonstrate the use of linked micromaps for evidence-based policy, offering a novel visualization technique for geographically indexed statistics.
Several papers explore the intersection of human behavior and data analysis. Qi, Schölkopf, and Jin (2024) (Qi et al., 2024) present a causal framework for responsibility attribution in human-AI collaboration, while Dan et al. (2024) (Dan et al., 2024) analyze reporting fatigue in longitudinal social contact surveys, crucial for pandemic preparedness. Kim and Lee (2024) (Kim & Lee, 2024) examine the impact of FOMC sentiment on market expectations through a bounded rationality lens. Fisher, Shahar, and Resheff (2024) (Fisher et al., 2024) introduce neural fingerprints for adversarial attack detection, enhancing the security of deep learning models. Li et al. (2024) (Li et al., 2024) investigate the effects of online moderation on player behavior in competitive action games.
Several contributions focus on specific applications of data analysis. From automated image color mapping for historical photographs (Arnold & Tilton, 2024) to analyzing factors influencing Amazon product sales rank (Chen et al., 2024), and modeling the effective capacity of battery energy storage systems for wind farms (Vaishampayan et al., 2024), the applications are diverse. Other notable applications include drought analysis in California (Ujjwal et al., 2024), a software package for teaching statistics through discrimination analysis (Abdullah et al., 2024), and personalized fragrance recommendation using hierarchical Relevance Vector Machines (Gonzales Martinez, 2024). Advancements in Graph Neural Networks are also presented, including enhancing robustness against adversarial attacks using Conditional Random Fields (Abbahaddou et al., 2024) and proposing Centrality Graph Shift Operators (Abbahaddou et al., 2024).
Finally, several papers address specialized methodological challenges, spanning diverse areas such as multilingual hierarchical classification for job advertisements (Beręsewicz et al., 2024), porosity equivalence analysis in additive manufacturing (Miner & Narra, 2024), modeling dengue pandemics using the epidemiological Renormalization Group framework (D'Alise et al., 2024), and modeling zero-coupon Treasury yield curves with VIX as stochastic volatility (Park & Sarantsev, 2024). Other highlighted methodologies include Hamiltonian Monte Carlo methods for spectroscopy data analysis (McBride & Sgouralis, 2024), human-in-the-loop feature selection using Kolmogorov-Arnold Networks and Double Deep Q-Networks (Jahin et al., 2024), and a surrogate model for quay crane scheduling (Park & Bae, 2024). Several papers also address specific data analysis applications, such as bipartite network analysis in anime series (Sosa et al., 2024), change point detection in hydroclimatological data using a bootstrap Pettitt test (Conte et al., 2024), electricity consumption scenario generation using predictive clustering trees (Soenen et al., 2024), Inverse Reinforcement Learning for identifying suboptimal medical decision-making (Bovenzi et al., 2024), and a database of cast vote records from the 2020 US election (Kuriwaki et al., 2024). Further research explores modeling shooting performance in biathlon (Leonelli, 2024), detecting LUAD-associated genes using Wasserstein distance (Zhao et al., 2024), evaluating traumatic brain injury outcomes (Oishi et al., 2024), modeling wild animal home range and spatial interaction (Bayisa et al., 2024), performing relative survival analysis (Basak et al., 2024), planning progressive type-I interval censoring schemes (Das et al., 2024), exploring transfer learning between US presidential elections (Miao et al., 2024), comparing the cost efficiency of fMRI studies (Zhang et al., 2024), comparing link prediction approaches in collaboration networks (Sosa et al., 2024), investigating the relationship between smartphone usage and sleep quality (Chaudhry et al., 2024), discussing independence in integrated population models (Barraquand, 2024), and introducing local indicators of mark association for spatial marked point processes (Eckardt & Moradi, 2024). Finally, sentiment analysis on Amazon reviews using RoBERTa is also presented (Guo, 2024).
Increasing power and robustness in screening trials by testing stored specimens in the control arm by Hormuzd A. Katki, Li C. Cheung https://arxiv.org/abs/2411.05580
Cancer screening trials are notoriously resource-intensive, requiring large sample sizes and long follow-up periods to demonstrate a statistically significant reduction in mortality. This presents a major hurdle, especially with the rise of novel multicancer early detection (MCD) tests where timely evidence is crucial. This paper proposes a novel “Intended Effect” (IE) design, offering a potential solution by focusing the analysis on individuals who ever test positive (ever-positives) – as they represent the only group that could potentially benefit from the screening intervention. This targeted approach significantly increases statistical power compared to traditional analyses, potentially halving the required sample size and thus accelerating the evaluation of promising new screening technologies.
The IE design hinges on a key principle: analyzing the risk difference among ever-positives (RD<sub>pos</sub>) is more efficient than analyzing the overall trial risk difference (RD). This is because the RD among never-positives (RD<sub>neg</sub>) is assumed to be zero; those who never test positive cannot be impacted by the screening intervention. The power gain from the IE design is quantified by the ratio of non-centrality parameters (Z-ratio), comparing the IE analysis to the standard analysis: Z<sub>ratio</sub> = {1 - (RD<sub>neg</sub>/RD) * P(M-)} / {P(M+) / P(M+ | D+) * P(M+ | D-)}. Here, RD is the overall trial risk-difference, RD<sub>neg</sub> is the risk difference among never-positives, P(M+) is the probability of being ever-positive, P(M-) is the probability of being never-positive, and D+ denotes experiencing the trial outcome. This formula highlights the impact of the prevalence of ever-positives and the differential risk among them on the power gain achieved by the IE design.
However, the IE design relies on certain assumptions that require careful scrutiny and refinement. This study tackles three key challenges. First, to address the cost of testing all control-arm specimens, the researchers propose testing only a stratified sample, incorporating inverse-probability sampling weights to account for the sampling scheme. Simulations demonstrated that testing all primary-outcome control-arm specimens and 50% of the rest maintains nearly all the power of the full IE analysis, significantly reducing costs while minimally affecting statistical power. Second, the study addresses the potential for loss-of-signal in stored control-arm specimens due to degradation. The authors show that the IE design is inherently robust to loss-of-signal that is non-differential by outcome. For differential loss-of-signal, they introduce a correction using “retest-positive” fractions from the screen-arm, ensuring unbiased estimates. Third, the study tackles non-compliance with control-arm specimen collections. They show that the IE design is robust to non-compliance that is non-differential or differential by arm only. For non-compliance differential by both arm and outcome, corrections are introduced to remove bias and restore statistical power.
Finally, the study explores the impact of “unintended effects” of screening, such as false reassurance among screen-negatives leading to reduced adherence to standard-of-care screening. Simulations showed that even with unintended effects, the IE analysis still offers significantly greater power than the standard analysis. This robustness to unintended effects further strengthens the case for the IE design.
Effective Capacity of a Battery Energy Storage System Captive to a Wind Farm by Vinay A. Vaishampayan, Thilaharani Antony, Amirthagunaraj Yogarathnam https://arxiv.org/abs/2411.04274
Caption: Visualization of the Power Alignment Function, g(B, P)
With the increasing reliance on renewable energy sources like wind power, effective energy storage solutions are crucial for grid stability. This paper introduces a novel time-series methodology for determining the capacity credit of a Battery Energy Storage System (BESS) dedicated to a wind farm, moving beyond traditional reliability metrics. The core concept is the power alignment function, g(B, P), which quantifies the minimum peaker plant power needed to meet demand given a BESS with energy rating B (MWh) and power rating P (MW). This function becomes the foundation for calculating both absolute capacity, κ(B, P) = g(0,0) – g(B, P), and incremental capacity, representing the added benefit of increasing BESS storage.
The study formulates calculating the power alignment function as a linear program, minimizing peaker plant power subject to BESS constraints. This approach facilitates the analysis of both average and peak peaker power, offering a comprehensive view of BESS effectiveness. Using real-world wind speed and load demand data, the analysis demonstrates the variability in BESS sizing requirements based on different wind and demand profiles. The results reveal that while a BESS can recover almost all lost wind power, the necessary energy capacity varies significantly. For example, recovering 50% of lost wind power requires vastly different BESS energy ratings across different days, highlighting the importance of considering specific wind and demand patterns.
The authors also provide theoretical insights into the power alignment function, deriving a general lower bound for gav(B, P) and refining it based on the characteristics of wind and demand sequences. This leads to a graph interpretation of the lower bound as a flow problem, where the maximum cost path determines the incremental capacity. The theory reveals the crucial role of the run structure of the excess demand time series in influencing the behaviour of gav(B, P) and the incremental capacity. As BESS energy rating increases, edges disappear from the maximum cost path, resulting in diminishing incremental capacity. This understanding of the relationship between BESS size and incremental capacity is crucial for optimizing BESS investments.
Cities Reconceptualized: Unveiling Hidden Uniform Urban Shape through Commute Flow Modeling in Major US Cities by Margarita Mishina, Mingyi He, Venu Garikapati, Stanislav Sobolevsky https://arxiv.org/abs/2411.05455
Caption: Histograms of standardized population densities for 12 major U.S. cities after applying an Origin-Constrained Gravity Model to commute data. The observed distributions (blue) closely align with the theoretical normal distribution (black), confirming that when geographical constraints are mitigated, urban population densities tend towards a log-normal pattern. This reveals a hidden urban form governed by mobility patterns.
Urban planning often contends with the complex interplay of historical, geographical, and economic factors that shape city development. This paper presents a compelling argument for a universal principle of urban organization, hidden beneath these geographical constraints. By modeling commute flows, the researchers uncover a hidden urban form where population distribution adheres to a log-normal pattern across diverse U.S. cities. This offers a new lens through which to understand urban development and design more efficient and resilient cities.
The study analyzes commute data from 12 major U.S. cities, constructing a commute network for each city with census tracts as nodes and weighted edges representing commuter flow. Initial population density calculations, based on population divided by area, show significant deviations from a log-normal distribution in most cities due to the influence of geographical and historical factors. Chicago, however, exhibits a log-normal distribution, possibly due to its planned development and uniform infrastructure.
To mitigate these geographical constraints, the researchers employ an Origin-Constrained Gravity Model: T<sub>ij</sub> = O<sub>i</sub>(D<sub>j</sub> exp(-αd<sub>ij</sub>β)) / Σ<sub>k</sub>D<sub>k</sub> exp(-αd<sub>ik</sub>β). This model predicts commute flows (T<sub>ij</sub>) between census tracts i and j based on origin population (O<sub>i</sub>), destination jobs (D<sub>j</sub>), distance (d<sub>ij</sub>), and parameters α and β controlling distance decay. By fitting this model and allowing location coordinates to adjust based on commuting patterns, a new embedding space emerges. In this reshaped space, population density, recalculated using Voronoi tessellations, closely follows a log-normal distribution (f(x; μ, σ) = (1 / xσ√2π) * exp(-(lnx - μ)<sup>2</sup> / 2σ<sup>2</sup>)) across all cities. A random-walk preferential attachment simulation further supports the emergence of this distribution in idealized, unconstrained urban growth, suggesting that the log-normal distribution reflects a natural organizational pattern for urban populations when free from geographical limitations.
Mitigating Consequences of Prestige in Citations of Publications by Michael Balzer, Adhen Benlahlou https://arxiv.org/abs/2411.05584
Caption: Distribution of Weighted Citations (arsinh Transformed) for Biomedical Publications
This study addresses the pervasive issue of prestige bias in scientific citations, proposing a method to predict a paper's impact based solely on pre-publication characteristics. This approach aims to mitigate the Matthew Effect, where established authors and prestigious journals receive disproportionately more citations regardless of scientific merit. This has significant implications for research funding, enabling agencies to prioritize truly high-quality science over established prestige.
Focusing on biomedical publications and leveraging the prevalence of double-blind peer review, the researchers construct datasets from PubMed and the NIH Open Citation Collection. These datasets include pre-publication variables like number of references, mean age of references, paper length, and MeSH term proportions. Using both linear models (LM) and generalized linear models (GLM), specifically negative binomial regression, they predict both weighted citations (using SCImago Journal Rank impact factor) and raw citation counts. For weighted citations, the inverse hyperbolic sine transformation, arsinh(⋅), ensures approximate normality. The model formulation for the LM is E(SJRi) = μi = xiβi and for the GLM is E(Citationsi) = vi = h(xiai) = exp(xiai), where x represents the independent variables and β and α are the corresponding coefficients.
The results reveal a surprisingly high predictive accuracy using only pre-publication variables. For example, in the linear model for weighted citations, the number of references shows a significant positive effect, while the mean age of references has a negative impact, suggesting that papers citing newer literature tend to receive more citations. The models achieve R² values up to 0.427, indicating that pre-publication factors can explain a substantial portion of citation variance. Robustness checks using different train-test splits and model-based gradient boosting confirm these findings.
This work offers a more objective assessment of scientific output, decoupling citation prediction from author and journal prestige. This could lead to a fairer funding system based on intrinsic research merit. However, the authors acknowledge limitations, including the field-specific nature of the analysis and the potential for undiscovered variables. Future research could explore incorporating additional pre-publication information, such as stylistic analysis or text-mined features, to further refine prediction models and extend the approach to other scientific disciplines.
This newsletter highlights a diverse range of advancements in statistical modeling and data analysis. From improving the power of cancer screening trials through innovative study designs to understanding the hidden order governing urban populations through commute flow modeling, the research presented pushes the boundaries of methodological innovation and application. The development of a framework for attributing responsibility in human-AI collaboration is particularly timely, addressing the increasingly complex ethical considerations surrounding AI integration in decision-making. Finally, the effort to mitigate prestige bias in citation analysis offers a path towards a more equitable and objective evaluation of scientific contributions, potentially reshaping research funding and scientific discourse. Taken together, these advancements demonstrate the power of data-driven approaches to address complex challenges across diverse domains.