This newsletter explores a collection of preprints showcasing diverse applications of statistical modeling and machine learning. The research spans from theoretical advancements in time series analysis to practical applications in healthcare, environmental science, and beyond.
Xie (2025) (Xie, 2025) introduces the Grey System Model on Time Scales (GST-T), a novel approach integrating Grey System Theory with the generalized time scales framework. This model is designed to analyze hybrid systems with events occurring on varying time domains, offering a robust solution for systems with incomplete information across both discrete and continuous time.
In a similar vein of enhancing analytical capabilities, Murakami et al. (2025) (Murakami et al., 2025) leverage Bayesian variational inference and GPU acceleration for rapid identification of crystalline phases from X-ray diffraction data. Their method achieves practical calculation times for complex phase combinations, addressing limitations of traditional methods by considering the entire profile and all possible phase combinations.
Several papers focus on developing novel statistical models. Zhou et al. (2025) (Zhou et al., 2025) introduce inferential methods for prediction based on functional random effects in generalized functional mixed effects models, providing a new approach to prediction inference with functional outcomes. Bertolacci et al. (2025) (Bertolacci et al., 2025) propose GeoWarp, a hierarchical spatial statistical model for inferring 3D geotechnical properties of subsea sediments. This model addresses the challenges of nonstationarity and anisotropy in spatially sparse data. Onorati & Canale (2025) (Onorati & Canale, 2025) develop a semi-parametric Bayesian spatial model for rainfall events in geographically complex domains, incorporating spatial correlation and dependencies on geographical characteristics using latent Gaussian processes.
Applications in healthcare feature prominently. Wang et al. (2025) (Wang et al., 2025) develop a Bayesian machine learning model for absolute risk prediction of cannabis use disorder in adolescents and young adults. Fowler et al. (2025) (Fowler et al., 2025) propose an N-of-1 approach using an adapted autoregressive hidden Markov model to estimate individual causal effects of interventions for bipolar disorder. Virkud et al. (2025) (Virkud et al., 2025) utilize precision medicine algorithms to identify optimal treatment rules for heart failure patients. Sunog et al. (2025) (Sunog et al., 2025) investigate the impact of incorporating primary care indications in EHR-based target trial emulation for dementia research.
Beyond healthcare, the collection explores diverse areas. Charles (2025) (Charles, 2025) discusses data mining the functional architecture of the brain's circuitry. Baouan et al. (2025) (Baouan et al., 2025) present an optimal transport-based embedding to quantify the distance between playing styles in collective sports. Ohlendorff et al. (2025) (Ohlendorff et al., 2025) introduce a subsampling bootstrap method for constructing confidence intervals in semiparametric causal inference. Yun & Panaretos (2025) (Yun & Panaretos, 2025) propose the Tensorized-and-Restricted Krylov (TReK) method for estimating covariance tensors. Ang & Soh (2025) (Ang & Soh, 2025) investigate the impact of network dynamics on policy belief evolution using the Wasserstein distance.
Finally, several papers focus on specific methodological advancements, including enhanced sparse Bayesian learning, adaptive sequential Monte Carlo, sigmoid growth curve analysis for system dynamics, subtype-aware timeline registration for longitudinal EHRs, quality control of lifetime drift in semiconductor devices, extending the STERGM framework, data-driven discovery of SLE etiological heterogeneity, optimization-based variable selection, an R package for OpenAlex data, a copula-enhanced vision transformer (CeViT), and machine learning for predicting house rental prices.
Data mining the functional architecture of the brain's circuitry by Adam S. Charles https://arxiv.org/abs/2501.09684
Caption: This image depicts a decomposed linear dynamical system (dLDS) model, illustrating how it captures latent brain states (x<sub>t</sub>) underlying observed neural activity (y<sub>t</sub>) and behavior (b<sub>t</sub>). The dLDS model represents neural dynamics as paths on a manifold, allowing for the identification of modular and distributed brain computations, unlike traditional models that treat each recorded neuron as an independent dimension. The manifold visualization showcases how different dimensions of latent neural activity contribute to observed behaviors.
The sheer volume of data now available in systems neuroscience, thanks to advancements like light beads microscopy and fully integrated silicon probes, presents an unprecedented opportunity. We're on the cusp of transitioning from localized studies of individual brain regions to mapping the entire brain's functional architecture—the "software" that drives the "hardware." This shift is crucial because emerging evidence suggests that cognitive processes are distributed and flexible, adapting dynamically across neural circuits.
Historically, technological constraints limited neuroscience to studying specific brain areas, simple tasks, and isolated biological variables. This localized approach, while informative, missed the interconnectedness and dynamic nature of brain function. Now, with brain-wide recordings, we can directly observe these interactions, mapping information flow across the entire brain. The challenge lies in developing analytical tools to synthesize this massive, multi-faceted data into coherent, interpretable models.
Traditional models of neural dynamics often treat each recorded neuron as an independent entity in a shared state space. These models, like linear dynamical systems and recurrent neural networks, are limited in their ability to capture the modularity and distributed nature of brain computations. New approaches, such as decomposed linear dynamical systems (dLDS), explicitly incorporate modularity, treating dynamics as paths on a manifold and identifying distinct sets of interactions based on how the neural state traverses this manifold.
Synthesizing datasets across different tasks, animals, and even research labs is crucial for revealing the functional architecture. This requires aligning data at the systems level, as neuron-to-neuron correspondence is impossible. Graph-based methods for multi-matrix decompositions hold promise, allowing for the identification of shared and private sources of variance across datasets. Furthermore, integrating multiple modalities, such as fMRI for hemodynamics and optical imaging for neurotransmitters, is essential for a complete understanding. This necessitates developing models that capture the interplay of these diverse signals while maintaining interpretability. While artificial neural networks (ANNs) are powerful data mining tools, their "black box" nature limits their scientific value. The focus should be on developing interpretable models that reveal the underlying principles of brain function, enabling extrapolation beyond the training data.
Absolute Risk Prediction for Cannabis Use Disorder Using Bayesian Machine Learning by Tingfang Wang, Joseph M. Boden, Swati Biswas, Pankaj K. Choudhary https://arxiv.org/abs/2501.09156
Caption: This figure displays the performance of the novel Bayesian CUD risk prediction model, evaluated using 5-fold cross-validation and validated on two independent datasets: Add Health and CHDS. It shows the Area Under the Curve (AUC) and the Expected/Observed ratio (E/O) for varying prediction timeframes, demonstrating good discrimination and calibration across different datasets.
This research introduces a groundbreaking Bayesian machine learning model designed to predict the absolute risk of developing cannabis use disorder (CUD) in adolescents and young adults. This marks the first absolute risk prediction model for any substance use disorder (SUD), offering personalized risk assessments based on individual factors. Trained on data from the National Longitudinal Study of Adolescent to Adult Health (Add Health), the model considers competing risks like death from unrelated causes, which are crucial for accurate long-term risk prediction. Five key risk factors are incorporated: biological sex, delinquency (a longitudinal predictor incorporated as the average score across multiple waves for simplified implementation), and scores on personality traits of conscientiousness, neuroticism, and openness.
The model estimates the CUD hazard rate (λ₁) using a Cox proportional hazard model: λ₁(t|X) = λ₀(t) exp(Xᵀβ), where λ₀(t) is the baseline hazard function modeled using M-splines, X represents the risk factors, and β represents their effects. The mortality hazard rate (λ₂(t)) from non-CUD causes is estimated using all-cause mortality data. The absolute risk of developing CUD between ages a and b is calculated as: r(a,b) = ∫ₐᵇ λ₁(t|X) exp(-∫₀ᵗ [λ₁(u|X) + λ₂(u)]du) dt. Bayesian estimation with lasso and Dirichlet priors is used for regularization and parameter estimation, respectively, while accounting for the complex survey design of Add Health.
Performance evaluation was conducted using 5-fold cross-validation. For predicting CUD risk within five years of first cannabis use, the area under the curve (AUC) was 0.68, and the ratio of expected to observed cases (E/O) was 0.95. External validation using an Add Health test dataset and data from the Christchurch Health and Development Study (CHDS) showed strong performance. In the Add Health test data, the AUC ranged from 0.64 to 0.75, and E/O was close to 1, indicating good discrimination and calibration. Similarly, in the CHDS data, after recalibration for higher CUD prevalence, the AUC ranged from 0.65 to 0.75, with E/O near 1.
This model addresses a critical gap in SUD risk assessment by providing personalized absolute risk estimates, which are more clinically relevant than existing relative risk or pure risk prediction models. Its parsimonious nature, with only five predictors, enhances practical utility in clinical settings. While acknowledging limitations related to the age of the Add Health data and the use of the Cox PH model, the study highlights the potential of this approach for improved early intervention and reducing the burden of CUD, especially among adolescents and young adults. Future research could explore alternative models and incorporate more recent data to refine predictions and adapt to the evolving landscape of cannabis use.
Predicting System Dynamics of Universal Growth Patterns in Complex Systems by Leila Hedayatifar, Alfredo J. Morales, Dominic E. Saadi, Rachel A. Rigg, Olha Buchel, Amir Akhavan, Egemen Sert, Aabir Abubaker Kar, Mehrzad Sasanpour, Irving R. Epstein, Yaneer Bar-Yam https://arxiv.org/abs/2501.07349
Caption: This figure illustrates the application of sigmoid growth curve fitting to model and predict the dynamics of complex systems. Panel A shows an example of a sigmoid fit to customer order data, while panels C shows examples of sigmoid fits to legislative data. Panels D, E, and F visualize the distribution of sigmoid parameters (slope, amplitude, and unsaturation) over time, revealing patterns in the evolution of these systems.
Predicting the behavior of complex systems, such as markets or legislative bodies, is inherently challenging. This research introduces an innovative analytical approach using the sigmoid growth curve to model and predict the dynamics of individual entities within these systems. The sigmoid function, y(t) = A / (1 + e⁻ᵐ⁽ᵗ⁻ᵗ⁰⁾), captures the characteristic acceleration and deceleration phases observed in many real-world phenomena, where activity initially grows rapidly, then slows, and eventually plateaus. This approach allows for predicting not just when an entity's activity will peak and decline but also the total amount of activity it will ultimately exhibit.
The researchers applied their method to two distinct datasets: industrial customer orders and US legislation adoption. In the customer order dataset, the frequency of orders from individual customers followed a sigmoid pattern, reflecting the lifecycle of customer relationships. By tracking the evolution of the sigmoid parameters (m, t₀, and A) over time, they could predict when a customer was likely to stop ordering, achieving 82% accuracy within a one-year timeframe for customers who left in 2009. Further analysis revealed a power law distribution of customer orders, suggesting underlying regularities in the system's dynamics.
The second dataset examined the introduction of bills related to per- and polyfluorinated substances (PFAS). Named entities were extracted from the text of proposed legislation, and their usage was tracked over time. These terms also followed sigmoid patterns, reflecting the rise and fall of attention to specific issues in the legislative process. Visualizing the trajectories of these terms in parameter space allowed researchers to identify emerging trends and predict which terms would gain or lose prominence.
This study demonstrates the sigmoid curve's versatility for understanding and predicting the dynamics of diverse complex systems. The ability to forecast months or even years in advance, as shown in both case studies, offers valuable insights for decision-makers. Characterizing individual component dynamics provides a framework for understanding the aggregate behavior of the entire system, revealing emergent patterns not apparent from examining individual entities alone. This research opens new avenues for predicting system behavior and informing strategic decision-making across various applications.
This newsletter highlights the growing sophistication and breadth of statistical modeling and machine learning. From unraveling the complexities of the human brain to predicting customer behavior and legislative trends, these papers showcase the power of advanced analytical techniques to address critical challenges across diverse fields. The emphasis on interpretability, personalized predictions, and the integration of multiple data sources underscores the ongoing evolution of these fields towards more robust and impactful applications. The development of novel models like GST-T, GeoWarp, and the innovative application of sigmoid curves for system dynamics prediction exemplify the continuous push towards more nuanced and accurate representations of complex real-world phenomena. The focus on absolute risk prediction in healthcare, as demonstrated by the CUD model, signifies a shift towards more clinically relevant and actionable insights. These advancements collectively contribute to a richer understanding of complex systems and pave the way for more effective interventions and decision-making across various domains.