Subject: Cutting-Edge Advances in Statistical Modeling and Machine Learning
Hi Elman,
This newsletter covers a collection of preprints showcasing diverse methodological advancements and applications across various domains, including human mobility, sports performance, biomedicine, and more.
This collection of preprints explores diverse methodological advancements and applications across various domains, including human mobility, sports performance, and biomedicine. Foster, Meyer, and Shakeel (2025) (Foster et al., 2025) introduce a pseudo Markov-chain model to analyze time-elapsed flows from aggregated trip data, developing mobility measures analogous to the radius of gyration used in individual mobility studies. Their application to the NetMob 2024 Data Challenge dataset reveals insights into commuting patterns and urban mobility, with potential implications for sustainable development. In a separate application of statistical modeling, Einmahl and He (2025) (Einmahl & He, 2025) leverage heterogeneous extreme value theory to estimate ultimate world records in the 100-meter dash. Their analysis of top athlete data from 1991-2023 yields point estimates and lower confidence bounds for both men and women, pushing the boundaries of performance prediction in athletics.
Several papers focus on innovative statistical methodologies for specific applications. Goffard, Piette, and Peters (2025) (Goffard et al., 2025) present a market-based insurance ratemaking method, utilizing isotonic regression to link pure premiums to observed commercial premiums, particularly relevant for emerging markets like pet insurance. Koldasbayeva and Zaytsev (2025) (Koldasbayeva & Zaytsev, 2025) investigate cross-validation strategies for spatio-temporal species distribution models, emphasizing the importance of accounting for spatial autocorrelation. Their comparison of different training schemes and CV methods highlights the strengths of spatial blocking and environmental clustering for robust model evaluation. Cui (2025) (Cui, 2025) provides a tutorial on Markov renewal and semi-Markov proportional hazards models for multi-state modeling, demonstrating their application to clinical research using the EBMT dataset. The comparison of Aalen-Johansen and Dabrowska-Sun-Horowitz estimators underscores the value of incorporating sojourn times for a more nuanced understanding of patient trajectories.
The application of machine learning and Bayesian methods is a recurring theme. Cohen et al. (2025) (Cohen et al., 2025) introduce ELIR, an efficient latent image restoration method operating in latent space, achieving competitive results with reduced computational demands compared to diffusion and flow-based approaches. Antonczak et al. (2025) (Antonczak et al., 2025) employ random forest regression to estimate traffic volumes of medium- and heavy-duty vehicles, addressing data sparsity in the Highway Performance Monitoring System and enabling high-resolution estimation of traffic-related air pollution exposure. Barnett et al. (2025) (Barnett et al., 2025) propose joint TITE-CRM designs for dual-agent dose-finding studies with late-onset outcomes, demonstrating superior performance compared to model-assisted designs in identifying optimal biological doses.
Further methodological contributions include Maclaren et al.'s (2025) (Maclaren et al., 2025) Invariant Image Reparameterisation (IIR) approach for addressing parameter non-identifiability in mathematical models and Jadoul et al.'s (2025) (Jadoul et al., 2025) critical analysis of integer ratio analyses in bioacoustics and music, offering a comprehensive methodology for statistically testing integer ratios. Blake et al. (2025) (Blake et al., 2025) develop a Bayesian nonparametric survival analysis approach to estimate the duration of RT-PCR positivity for SARS-CoV-2 from doubly interval censored data, addressing challenges of undetected infections and false negatives. Zhang, Bhaganagar, and Wikle (2025) (Zhang et al., 2025) introduce the extreme variational Autoencoder (xVAE) for capturing extreme events in turbulence, demonstrating its effectiveness in large-eddy simulation data of wildland fire plumes.
Finally, several papers address specific applications within healthcare and other domains. Gong and Zuo (2025) (Gong & Zuo, 2025) propose an integrative model of spontaneous slow oscillations (SSOs) in the brain, leveraging multi-band frequency analysis of fMRI data. Bussemaker et al. (2025) (Bussemaker et al., 2025) investigate system architecture optimization strategies, developing a hierarchical Bayesian optimization algorithm and applying it to jet engine design. Torres-Torres F. et al. (2025) (Torres-Torres F. et al., 2025) introduce a Gaussian Process-driven Hidden Markov Model for early diagnosis of infant gait anomalies. Banin, Barigozzi, and Trapin (2025) (Banin et al., 2025) propose tensor factor models for predicting energy demand, demonstrating superior forecasting accuracy compared to traditional vector factor models and functional time series methods. These diverse contributions highlight the growing importance of advanced statistical and machine learning methods for addressing complex challenges across a wide range of disciplines.
Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis by Sanket Jantre, Tianle Wang, Gilchan Park, Kriti Chopra, Nicholas Jeon, Xiaoning Qian, Nathan M. Urban, Byung-Jun Yoon https://arxiv.org/abs/2502.06173
Caption: This diagram illustrates the uncertainty-aware LoRA adaptation for LLM-based protein-protein interaction (PPI) prediction. Given two proteins (A and B) and a query about their interaction in a specific disease context, the LLM backbone, augmented with uncertainty-aware LoRA (either ensemble or Bayesian), processes the information. The output provides a Yes/No prediction regarding the interaction, incorporating uncertainty quantification for improved reliability.
Protein-protein interactions (PPIs) are crucial for understanding cellular processes and their dysregulation in diseases. Large language models (LLMs) offer a promising avenue for PPI prediction by leveraging the vast biomedical literature. However, the inherent uncertainty in LLMs poses a challenge for reliable application in biomedicine. This paper introduces an uncertainty-aware adaptation of LLMs for PPI analysis, specifically using fine-tuned LLaMA-3 and BioMedGPT models. The researchers address the uncertainty issue by integrating Low-Rank Adaptation (LoRA) ensembles and Bayesian LoRA models for uncertainty quantification (UQ).
LoRA ensembles train multiple low-rank adapters independently on the same pre-trained LLM backbone, averaging their outputs for the final prediction. This approach increases accuracy and robustness while maintaining computational efficiency. Bayesian LoRA, on the other hand, employs the Laplace approximation to estimate the posterior distribution over the LoRA parameters. This provides a more tractable way to capture uncertainty compared to full Bayesian inference, approximating the posterior with a Gaussian distribution centered at the maximum a posteriori (MAP) estimate with covariance equal to the inverse of the Hessian (or practically, the Fisher Information matrix): p(θ | D) ≈ N (θ|θ_MAP, H⁻¹).
The models were evaluated on three disease-specific PPI datasets: Neurodegenerative diseases PPI (ND-PPI), Metabolic disorders PPI (M-PPI), and Cancer PPI (C-PPI). Performance was assessed using metrics such as accuracy, negative log-likelihood (NLL), expected calibration error (ECE), Matthews Correlation Coefficient (MCC), specificity, precision, F1-score, and AUROC. Across all datasets, the uncertainty-aware LoRA adaptations consistently outperformed the standard LoRA baseline. LoRA ensembles generally achieved the highest accuracy and lowest NLL, indicating strong predictive performance and reliable confidence estimates. Bayesian LoRA excelled in calibration, demonstrating the lowest ECE in several cases, which is crucial for trustworthy interpretation of predictions. For example, in the ND-PPI task with LLaMA-3, the LoRA ensemble achieved 88.7% accuracy compared to 86.5% for single LoRA, while Bayesian LoRA had the lowest ECE of 0.052. Similar trends were observed in the M-PPI and C-PPI tasks, with LoRA ensembles consistently exhibiting superior performance in accuracy and NLL, and Bayesian LoRA showing better calibration. This study highlights the importance of UQ in LLM-based PPI prediction, paving the way for more robust computational tools in precision medicine.
Efficient Image Restoration via Latent Consistency Flow Matching by Elad Cohen, Idan Achituve, Idit Diamant, Arnon Netzer, Hai Victor Habi https://arxiv.org/abs/2502.03500
Caption: ELIR's two-stage pipeline, depicted for training (left) and inference (right), operates within the latent space for efficiency. During training, a Latent MMSE estimator minimizes distortion, while Latent Consistency Flow Matching (LCFM) refines the output for perceptual quality. During inference, the fixed encoder extracts latent features from the low-quality input which are then processed by the trained modules to produce a restored image.
While generative image restoration (IR) models have achieved impressive results, their high computational and memory requirements limit their deployment on edge devices. This paper introduces ELIR (Efficient Latent Image Restoration), a novel method that addresses this challenge by operating primarily in the latent space. ELIR employs a two-stage pipeline. The first stage uses a Latent MMSE estimator to predict the latent representation of the minimum mean square error estimate given the latent representation of the degraded image. This effectively reduces distortion within the latent space. The second stage introduces Latent Consistency Flow Matching (LCFM), a novel integration of latent flow matching and consistency flow matching. LCFM refines the latent MMSE output by sampling from the conditional posterior distribution of visually appealing images, balancing distortion and perceptual quality.
ELIR's key innovation lies in its complete execution within the latent space. This significantly reduces the computational burden associated with processing high-resolution images. Furthermore, ELIR replaces computationally expensive transformer-based architectures, commonly used in other state-of-the-art methods, with a more efficient convolution-based architecture, making it suitable for hardware acceleration on edge devices. This design choice contributes to ELIR's smaller model size and lower latency.
ELIR was evaluated on various image restoration tasks, including blind face restoration, super-resolution, denoising, inpainting, and colorization, using the FFHQ and CelebA datasets. The results demonstrate remarkable efficiency improvements, achieving a 4x to 45x reduction in model size and a 4x to 270x increase in frames per second (FPS) compared to diffusion and flow-based methods, without compromising performance. ELIR maintains competitive distortion and perceptual quality metrics (PSNR, SSIM, LPIPS, FID, NIQE, and MUSIQ) compared to state-of-the-art approaches. Ablation studies further validate the effectiveness of ELIR's components. The Latent MMSE estimator effectively reduces distortion in the latent space, while LCFM demonstrates superior efficiency compared to traditional flow matching, achieving comparable FID scores with significantly fewer neural function evaluations (NFEs). The importance of fine-tuning the encoder for specific degradation types is also highlighted. These findings underscore ELIR's potential for real-world applications on resource-constrained devices.
Bayesian Spatiotemporal Nonstationary Model Quantifies Robust Increases in Daily Extreme Rainfall Across the Western Gulf Coast by Yuchen Lu, Ben Seiyon Lee, James Doss-Gollin https://arxiv.org/abs/2502.02000
Caption: This figure displays the percentage change in the 10-year and 100-year return levels of daily extreme precipitation across the Western Gulf Coast, comparing different modeling approaches: pooled stationary, nonpooled nonstationary, and spatially varying covariates. The spatially varying covariates model, which incorporates CO<sub>2</sub> as a covariate, shows robust increases in extreme rainfall, particularly along the coast of Southeast Texas and Louisiana. These changes highlight the impact of rising CO<sub>2</sub> on extreme precipitation and the need for nonstationary models in infrastructure planning.
Accurate estimates of rainfall exceedance probabilities are crucial for effective risk management. Traditional approaches like NOAA Atlas 14 assume stationarity, failing to account for the impact of climate change. This paper introduces a novel Bayesian spatiotemporal nonstationary model, the Spatially Varying Covariates Model, to quantify changes in daily extreme rainfall across the Western Gulf Coast. This model integrates nonstationarity and regionalization for robust frequency analysis, overcoming limitations of previous approaches that suffer from high uncertainty and implausible spatial variability.
The model uses the Generalized Extreme Value (GEV) distribution to model daily extreme precipitation, where the location and scale parameters are conditioned on global mean CO<sub>2</sub> concentration as a time-varying climate covariate: y(s,t) ~ GEV(µ(s,t), σ(s,t), ξ), where µ(s,t) = α<sub>µ</sub>(s) + β<sub>µ</sub>(s)x(t) and σ(s,t) = exp{log α<sub>σ</sub>(s) + β<sub>σ</sub>(s)x(t)}. Here, y(s,t) represents the annual maximum precipitation at station s in year t, x(t) is the ln(CO<sub>2</sub>) concentration, and α and β represent the spatially varying intercept and coefficient respectively, modeled using Gaussian Processes. This allows for smooth spatial variation in the response to CO<sub>2</sub>, borrowing strength across nearby stations.
Applying this framework to daily rainfall data from 181 stations in the Western Gulf Coast, the study identified robustly increasing trends in extreme precipitation intensity and variability. The results reveal positive correlations between ln(CO<sub>2</sub>) and both GEV location and scale parameters across most of the study area, indicating that increasing CO<sub>2</sub> leads to higher intensity and variability of extreme precipitation. The strongest increasing trends were found around Houston and New Orleans. Return level estimates, specifically the 100-year return level, show increases between 10% and 35% over the past 80 years, with the largest increases in coastal Southeast Texas and coastal southeastern Louisiana. Projections to 2050 suggest current guidelines, such as NOAA Atlas 14, will underestimate future return levels, particularly in major cities like Houston. This highlights the importance of incorporating nonstationarity and regionalization in extreme precipitation analysis, offering a robust and flexible approach for reliable return level estimates and informing climate adaptation and infrastructure planning.
Predicting Energy Demand with Tensor Factor Models by Mattia Banin, Matteo Barigozzi, Luca Trapin https://arxiv.org/abs/2502.06213
Caption: This image showcases the average hourly electricity demand, normalized, across different seasons. The two-factor Tensor Factor Model (TFM) described in the accompanying text excels at capturing these seasonal variations, outperforming benchmark models in forecasting accuracy across various time horizons. The distinct patterns for winter, summer, spring, and autumn underscore the importance of a model that can accommodate multi-seasonal dynamics.
Modern electricity datasets exhibit complex, interacting seasonalities (intra-day, intra-week, and annual) and strong cross-sectional correlations, which traditional forecasting methods often struggle to capture. This paper introduces a novel approach using tensor factor models (TFMs) to forecast high-dimensional U.S. electricity demand, offering both improved accuracy and interpretable insights.
The methodology restructures hourly data into a sequence of weekly tensors, each a three-mode array representing hours of the day, days of the week, and electricity providers. This structure allows for a factor decomposition that isolates distinct seasonal patterns along each mode. The model, represented as X<sub>t</sub> = F<sub>t</sub> x<sub>1</sub> A x<sub>2</sub> B<sup>(1)</sup>...x<sub>M+1</sub> B<sup>(M)</sup> + E<sub>t</sub>, distinguishes between factor loadings for hourly (intra-day), daily (intra-week), and provider-specific patterns. This multi-level approach accommodates the common cyclical behavior of electricity usage while also capturing provider-specific variations.
Applying this TFM approach to a dataset from PJM Interconnection LLC, the researchers explored one-factor and two-factor models. While the eigenvalue-ratio criterion favored the one-factor model, the two-factor model consistently achieved a better fit (lower MSE), successfully replicating the observed weekly and daily seasonal patterns. Forecast evaluation, using a rolling window approach with various horizons, showed the TFM achieved the lowest relative MSE across most providers and forecasting horizons, outperforming benchmark models including matrix factor models, vector factor models, and functional time series models. This demonstrates the TFM’s ability to capture the complex multi-seasonal dynamics and long-term dependencies in electricity demand data, offering increased accuracy and interpretable factors for improved decision-making in energy management.
Hidden assumptions of integer ratio analyses in bioacoustics and music by Yannick Jadoul, Tommaso Tufarelli, Chloé Coissac, Marco Gamba, Andrea Ravignani https://arxiv.org/abs/2502.04464
Caption: Hidden Assumptions in Rhythm Ratio Analyses: Visualizing the Impact of Null Hypotheses and Normalization Methods
Rhythm analysis, crucial in music cognition and animal behavior studies, often involves identifying small-integer ratios between temporal intervals. A recent popular method calculates the rhythm ratio r<sub>k</sub> = i<sub>k</sub> / (i<sub>k</sub> + i<sub>k+1</sub>) between adjacent intervals i<sub>k</sub> and i<sub>k+1</sub>. This method is scale-invariant and symmetric around inverse ratios. However, this paper reveals hidden assumptions in its typical application, particularly regarding the null hypothesis used in statistical tests.
The authors demonstrate that using r<sub>k</sub> and normalizing bin counts by bin width implicitly assumes a homogeneous Poisson point process as the null hypothesis. This means observed rhythmic patterns are compared to completely random temporal sequences with exponentially distributed intervals. While this provides a maximally random baseline, it may be too weak for many biological contexts. For example, if intervals are uniformly distributed (a less random scenario), the resulting r<sub>k</sub> distribution will peak around 1:1, leading to frequent rejections of the Poisson null hypothesis even without a true rhythmic category.
The paper offers two solutions: deriving alternative rhythm ratio formulas, s = f(q) (where q = i<sub>2</sub> / i<sub>1</sub>), that transform a given null interval distribution into a uniform distribution of ratios, or adjusting the normalization constant during analysis, normalizing counts by the total probability mass within a bin under the desired null hypothesis instead of dividing by bin width. This allows for testing against different null hypotheses while maintaining the convenience of a uniform baseline or retaining the r<sub>k</sub> formula. The key takeaway is the importance of carefully considering the underlying assumptions of statistical tests and choosing an appropriate null hypothesis. The paper provides tools to either rescale the rhythm ratio or adjust the normalization constant for more nuanced and accurate analyses of rhythmic patterns.
This newsletter highlighted several impactful preprints spanning diverse fields. From leveraging LLMs for uncertainty-aware protein interaction analysis and developing efficient image restoration techniques to quantifying the impact of climate change on extreme rainfall and improving energy demand forecasting with tensor factor models, these papers showcase the power of advanced statistical and machine learning methods. Furthermore, the critical analysis of integer ratio analyses in bioacoustics and music underscores the importance of rigorous methodological considerations in scientific research. These advancements offer valuable insights and tools for addressing complex challenges across various disciplines.