This collection of papers presents advancements across diverse areas of statistical modeling and data analysis, with notable contributions to network inference, causal inference, and time series analysis. Aghdam and Solis-Lemus (2024) introduce CMiNet, an R package for constructing consensus microbiome networks by integrating results from nine established and one novel network construction method. This approach addresses the challenge of variability in microbiome network inference and offers a more robust representation of microbial interactions. Ferreira et al. (2024) propose a consistent model selection procedure for identifying functional interactions among stochastic neurons with variable-length memory, demonstrating its effectiveness on both simulated and real spike train data. This work contributes to the understanding of complex neuronal networks and provides a rigorous framework for analyzing neuronal spiking activity. The two-part series by Robertson et al. (2024a, 2024b) provides a methodological review and practical guidance on confidence intervals for adaptive trial designs, addressing the challenges of undercoverage and inconsistency with hypothesis testing decisions. Their work offers valuable insights for researchers designing and interpreting adaptive clinical trials.
Several papers focus on applying statistical and machine learning techniques to real-world problems. Bi and Neethirajan (2024) investigate the correlation between dairy farm practices and methane emissions using satellite data and machine learning, highlighting the potential of genetic selection for emissions reduction. Abbahaddou et al. (2024) introduce GMM-GDA, a graph data augmentation algorithm based on Gaussian Mixture Models, to enhance the generalization capabilities of Graph Neural Networks (GNNs). Matloff and Mittal (2024) propose towerDebias, a novel debiasing method based on the Tower Property to mitigate the influence of sensitive variables in black-box models, contributing to the growing field of fair machine learning. Chinthala et al. (2024) analyze the impact of COVID-19 on the taxi industry and travel behavior in Chicago using spatial analysis and visualization, revealing significant shifts in travel patterns. Zhou and Hooker (2024) employ Targeted Maximum Likelihood Estimation (TMLE) for Integral Projection Models (IPMs) in population ecology, offering a robust approach for estimating population dynamics and demographic structure.
Another theme emerging from these papers is the development of novel statistical methodologies. Pfadt-Trilling and Fortier (2024) critique current assumptions about carbon trading and advocate for separate emission reduction targets by greenhouse gas species. Mboko et al. (2024) explore machine learning methods for traffic forecasting in multimodal transport systems using population mobility data. Abakasanga et al. (2024) apply machine learning to predict hospital length of stay for patients with learning disabilities and multiple long-term conditions, focusing on equitable prediction across ethnic groups. Tran et al. (2024) propose a CUSUM procedure based on excess hazard models for monitoring changes in survival time distribution in registry data with missing or uncertain cause of death information. Brown et al. (2024) investigate the predictive capabilities of different mathematical models for English football league outcomes using crowd-sourced player valuations. Englert et al. (2024) utilize Bayesian Additive Regression Trees (BART) to model joint health effects of environmental exposure mixtures, demonstrating the approach on asthma-related emergency department visits.
Several papers contribute to specific methodological areas. Wan (2024) addresses the "PSM paradox" in Propensity Score Matching, arguing that it is not a legitimate concern. Zhang et al. (2024) analyze the performance of ultra-reliable low-latency communication (uRLLC) in a scalable cell-free RAN system. Kousovista et al. (2024) utilize unsupervised clustering to identify temporal patterns of multiple long-term conditions in individuals with intellectual disabilities. Gjoka et al. (2024) develop a diagnostic test for detecting filamentarity in spatial point processes, applying it to climate and galactic data. Saha et al. (2024) use scientific deep learning to project methane emissions from oil sands tailings, revealing potential underestimations. Trencséni (2024) explores the use of Monte Carlo simulations in A/B testing. Nguyen et al. (2024) propose a Variational Bayes approach for portfolio construction. Finally, several papers present specialized applications and methodological refinements, spanning diverse areas such as soil-carbon sequestration prediction (Pagendam et al., 2024), joint spatiotemporal modeling of zooplankton and whale abundance (Kang et al., 2024), and multifractal complexity in cryptocurrency trading (Wątorek et al., 2024). These papers collectively demonstrate the continued development and application of sophisticated statistical and machine learning methods across a broad range of scientific domains.
Language Models as Causal Effect Generators by Lucius E.J. Bynum, Kyunghyun Cho https://arxiv.org/abs/2411.08019
Caption: This causal graph depicts the core mechanism of a sequence-driven structural causal model (SD-SCM). The LLM, represented by F, generates variable y based on its parent x, which in turn is influenced by u. The dashed lines indicate influences learned by the LLM, while solid lines represent the explicit causal structure defined by the user. The variable t represents a downstream effect influenced by both x and the LLM's internal representations, capturing the causal relationships encoded within the LLM.
This paper introduces a novel framework, sequence-driven structural causal models (SD-SCMs), for leveraging Large Language Models (LLMs) to generate data with a controllable causal structure. An SD-SCM combines an LLM with a user-defined Directed Acyclic Graph (DAG). The DAG dictates the causal relationships between variables, while the LLM implicitly learns the functional relationships, or structural equations, from its training data. This approach enables researchers to sample from not only observational and interventional distributions but also counterfactual distributions, providing a powerful tool for evaluating causal inference methods. A key advantage of SD-SCMs is that they don't require manual specification of the functional relationships between variables, offering greater flexibility and realism compared to traditional methods.
The SD-SCM framework relies on two key abstractions: domain-restricted sampling and parent-only concatenation. Domain-restricted sampling ensures that the generated samples fall within the desired domain by conditioning the LLM's output on previous inputs and normalizing probabilities within the specified sample space. Parent-only concatenation ensures that the input to the LLM for each variable consists only of its causal parents as defined in the DAG. This enforces the specified causal structure in the generated data. The authors demonstrate the power of this approach by creating a benchmark for causal effect estimation, generating 1000 datasets based on a breast cancer scenario using GPT-2 and Llama-3-8b. The benchmark explores the causal effect of a tumor's PD-L1 expression on treatment plans, including various outcome types and settings, notably the presence of hidden confounders.
The benchmark evaluation included several popular causal inference methods, such as Causal Forest, double machine learning (DML), doubly robust meta-learning (DR), Bayesian Additive Regression Trees (BART), and conformalized counterfactual quantile regression (CQR). These methods were tested on average treatment effect (ATE), conditional average treatment effect (CATE), and individual treatment effect (ITE) estimation. The results revealed interesting performance patterns. Simpler methods like BART performed remarkably well for ATE estimation when all covariates were observed, achieving R² values close to 1. However, performance degraded significantly in the presence of hidden confounding, highlighting the importance of addressing unobserved confounders in causal inference. For ITE estimation, CQR maintained nominal coverage but resulted in wider intervals, while BART for CATE exhibited overconfidence with tighter but less reliable intervals under hidden confounding. Furthermore, the authors demonstrated the utility of SD-SCMs in auditing LLMs for encoded causal effects, uncovering differences between GPT-2 and Llama-3-8b in the breast cancer scenario. This suggests the potential of SD-SCMs for identifying biases and misinformation encoded within LLMs.
Compactly-supported nonstationary kernels for computing exact Gaussian processes on big data by Mark D. Risser, Marcus M. Noack, Hengrui Luo, Ronald Pandolfi https://arxiv.org/abs/2411.05869
Caption: The image visualizes the sparsity-discovering nonstationary kernel C<sub>y</sub> used in a novel Gaussian Process method. Different curves represent the kernel with varying smoothness parameter 'b', demonstrating how the kernel adapts to different covariance structures. The parameters 'r', 'a', and 'h' control the kernel's range, amplitude, and smoothness, enabling it to capture both sparsity and nonstationarity in the data.
Gaussian Processes (GPs) are valuable for probabilistic machine learning, but their application to large datasets has been limited by computational scaling and the restrictions of stationary kernels. This paper presents a novel kernel that overcomes both limitations, enabling exact GP inference on large datasets with nonstationary covariance structures. Traditional GP implementations rely on stationary kernels, which lack flexibility in modeling real-world data, and exact inference scales poorly with data size (O(N³) computations and O(N²) memory). While approximate methods exist for scalability, they introduce subjectivity and potential inaccuracies. This new approach addresses these challenges by designing a kernel that can learn both sparsity and nonstationarity directly from the data.
The core innovation is the design of the sparsity-discovering nonstationary kernel, denoted as C<sub>y</sub>. It combines a compactly supported kernel, C<sub>sparse</sub>, with a more general "core" kernel, C<sub>core</sub>:
C<sub>y</sub>(x, x'; θ<sub>y</sub>) = C<sub>core</sub>(x, x'; θ<sub>core</sub>) × C<sub>sparse</sub>(x, x'; θ<sub>sparse</sub>)
C<sub>sparse</sub> is constructed using sums and products of bump functions, allowing it to model both zero and nonzero covariances in a data-driven way. This allows the GP to learn sparsity directly from the data, unlike approximate methods where sparsity is imposed subjectively. C<sub>core</sub> can be any positive semi-definite function, providing flexibility in modeling various covariance structures. The authors demonstrate the use of both stationary (e.g., Matérn) and nonstationary core kernels. The resulting kernel, C<sub>y</sub>, is embedded within a fully Bayesian GP model, enabling full uncertainty quantification.
The researchers validated their method on synthetic and real-world datasets. Synthetic experiments showed that the kernel effectively identified sparsity and nonstationarity, outperforming existing exact and approximate GP methods, especially with nonstationary and sparse underlying data. A real-world application involved predicting daily maximum temperature using over one million measurements across the contiguous United States. This new method significantly outperformed the state-of-the-art inverse-distance weighting method, achieving a 15.3% reduction in root mean square error and providing crucial uncertainty quantification. This work represents a significant advancement in GP methodology, providing exact inference on massive datasets with flexible, nonstationary covariance structures.
SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation by Mikhail Khodak, Lester Mackey, Alexandra Chouldechova, Miroslav Dudík https://arxiv.org/abs/2411.09730
Caption: The image visualizes SureMap's additive prior covariance matrix $\Lambda(\tau)$, decomposing it into components representing overall variance, sex-based variance, age-based variance, and the interaction between sex and age. The colored matrices illustrate how each component contributes to the overall structure, with the checkerboard pattern representing the overall variance, block-diagonal structures representing sex and age variances, and the diagonal structure capturing the sex-age interaction. This structured prior allows SureMap to effectively leverage information across subpopulations and improve disaggregated evaluation, especially in data-scarce settings.
Disaggregated evaluation, which assesses AI model performance across various subpopulations, is crucial for understanding fairness and performance disparities. Data scarcity, especially for intersectional groups, is a significant challenge. This problem is exacerbated when multiple clients procure the same AI model, each conducting their own disaggregated evaluation, leading to the multi-task disaggregated evaluation problem. This paper introduces SureMap to address both single-task and multi-task disaggregated evaluation. SureMap transforms the problem into a structured simultaneous Gaussian mean estimation problem. It leverages an additive prior that captures relationships between subpopulations with a linear (in the number of subpopulations) number of parameters, balancing efficiency and expressivity. Importantly, SureMap incorporates external data, such as from the model developer or other clients, to inform the prior. The prior's parameters are tuned using Stein's Unbiased Risk Estimate (SURE), eliminating the need for data-splitting and improving efficiency in data-scarce scenarios.
The core of SureMap involves estimating the mean $\mu$ of a multivariate Gaussian with known diagonal covariance $\Sigma$ given a sample $y \sim N(\mu, \Sigma)$, using the MAP estimator:
$\hat{\mu}_{MAP}(\tau) = (\Lambda^{-1}(\tau) + \Sigma^{-1})^{-1}(\Lambda^{-1}(\tau)\theta + \Sigma^{-1}y)$
where $\Lambda(\tau)$ is the prior covariance parameterized by $\tau$, and $\theta$ is the prior mean. The authors evaluate SureMap on various datasets, including existing tabular datasets and new ones for both single-task and multi-task settings. SureMap performed competitively with or better than existing baselines in single-task scenarios, showing significant improvements with intersectional attributes. In multi-task evaluations, incorporating data from multiple clients led to substantial accuracy gains. For instance, on the State-Level ACS dataset, multi-task SureMap achieved a 2x improvement over the naive estimator. Even with limited external data, multi-task SureMap outperformed single-task methods and other multi-task baselines, demonstrating the power of leveraging cross-client information. SureMap offers a promising new approach to disaggregated evaluation, effectively addressing data scarcity and leveraging the benefits of multi-task learning.
We must re-evaluate assumptions about carbon trading for effective climate change mitigation by Alyssa R. Pfadt-Trilling, Marie-Odile P. Fortier https://arxiv.org/abs/2411.08053
This paper challenges the fundamental assumptions underlying current carbon trading policies, arguing that the assumed fungibility of greenhouse gases (GHGs) and the use of single-point values like the Global Warming Potential (GWP) are flawed and lead to ineffective climate action. The authors critique the GWP, which calculates the relative contribution of a GHG to climate change by comparing its radiative forcing to that of CO<sub>2</sub> over a specific time horizon. The formula for GWP is:
GWP = ∫<sub>0</sub><sup>n</sup> a<sub>i</sub>c<sub>i</sub>dt / ∫<sub>0</sub><sup>n</sup> a<sub>CO2</sub>c<sub>CO2</sub>dt
where a<sub>i</sub> and c<sub>i</sub> represent the instantaneous radiative forcing and concentration of GHG species i, respectively, and n is the chosen time horizon. They highlight its limitations, including its dynamic nature, dependence on atmospheric composition, and the significant uncertainty associated with CO<sub>2</sub> lifetime. The GWP fails to adequately differentiate between the climatic impacts of short-lived climate forcers (SLCFs) like methane and long-lived GHGs like CO<sub>2</sub>, leading to potential inaccuracies in temperature change estimations.
The assumption of directional and temporal fungibility of carbon across sources and sinks is also scrutinized. While emitting one ton of CO<sub>2</sub> may have a similar impact regardless of location, carbon sequestration effectiveness varies greatly depending on factors like ecosystem type and management practices. This variability undermines the reliability of carbon offset projects, especially forestry initiatives, which often overestimate carbon sequestration potential. The concept of 'permanence' in carbon offsets is also misleading, as carbon residence times in biological sinks do not match the atmospheric lifetime of CO<sub>2</sub>. The paper further differentiates between decreasing GHG emissions and GHG sequestration, emphasizing that while both contribute to carbon offsets, they have distinct impacts on climate change mitigation. Reducing emissions only slows the growth of atmospheric GHG concentrations, whereas sequestration actively removes GHGs. The authors also raise ethical concerns about carbon commodification and the potential for exploitative practices in carbon offset markets.
Instead of relying on carbon credits, the authors advocate for a 'multi-basket' approach with individual emission reduction targets for different GHG species, reflecting their unique contributions to climate change. This, combined with a focus on technological and systemic changes, is considered more effective for achieving temperature stabilization. They also emphasize the need for nuanced decision-making regarding temporary and permanent carbon sequestration when considering offsetting emissions.
Methane projections from Canada's oil sands tailings using scientific deep learning reveal significant underestimation by Esha Saha, Oscar Wang, Amit K. Chakraborty, Pablo Venegas Garcia, Russell Milne, Hao Wang https://arxiv.org/abs/2411.06741
This study utilizes scientific deep learning to reveal a substantial underestimation of methane emissions from Canada's oil sands tailings ponds. These ponds, a byproduct of bitumen extraction, are known methane sources. The researchers developed a physics-constrained machine learning model incorporating real-time weather data, mechanistic models from laboratory experiments, and industrial reports, aiming to connect emission levels with atmospheric methane concentrations. The model was trained and validated using data from weather stations near active tailings ponds, focusing on those capturing methane emissions without interference from other sources.
The model successfully predicted methane emissions and concentrations, outperforming alternative models in long-term projections. It identified active ponds and estimated their emission levels, finding each could emit 950 to 1500 tonnes of methane annually, equivalent to the CO<sub>2</sub> emissions from at least 6000 gasoline-powered vehicles. Surprisingly, the model also suggested that abandoned ponds, often assumed to have negligible emissions, could become active and emit up to 1000 tonnes of methane per year. By connecting emissions to concentrations, the researchers estimated that emissions around major oil sands regions would need to be reduced by approximately 12% to restore average methane concentrations to 2005 levels.
The study employed a modified physics-constrained optimization problem:
min ∑<sub>i∈Iobs</sub> (u(x<sub>i</sub>) – u<sub>i</sub>)² subject to F(Φ(x, u), u, q) = 0
where u(x) is the learned concentration function, u<sub>i</sub> are observed concentrations, q describes emission dynamics from mechanistic models, and F represents the physical constraint derived from the Gaussian Plume Model (GPM). The model learned the unknown dispersion/advection terms in the GPM, represented by:
q(x, t) = Grad<sub>t</sub>(u<sub>θ</sub>) + Φ<sub>θ</sub>(Φ<sub>θ</sub>(Φ<sub>θ</sub>[u<sub>θ</sub>, x<sub>atm</sub>])))
where u<sub>θ</sub> is the predicted methane concentration and x<sub>atm</sub> represents atmospheric parameters. A reverse formulation estimated emissions from different directions around weather stations, providing insights into emitting source locations. The study's findings suggest a significant underestimation of methane emissions from oil sands tailings ponds, highlighting the importance of considering both active and abandoned ponds in emission inventories. The model's ability to link emissions with concentrations offers a valuable tool for assessing their impact on air quality and developing mitigation strategies.
This newsletter showcases a diverse range of advancements in statistical modeling and data analysis. From leveraging the power of LLMs for causal inference to developing novel kernels for Gaussian Processes on massive datasets, the research presented here pushes the boundaries of methodological innovation. The highlighted papers emphasize the importance of addressing real-world challenges, such as mitigating bias in AI models, accurately quantifying methane emissions, and re-evaluating the assumptions underlying carbon trading policies. The development of SureMap provides a valuable tool for disaggregated evaluation, crucial for ensuring fairness and understanding performance disparities in AI systems. The novel nonstationary kernel for Gaussian Processes opens up new possibilities for analyzing large, complex datasets in various scientific domains. Finally, the critical examination of carbon trading assumptions and the revealing study on methane emissions from oil sands tailings highlight the urgent need for more accurate and nuanced approaches to climate change mitigation. These papers collectively underscore the vital role of statistical and machine learning methods in tackling pressing scientific and societal challenges.