This collection of preprints explores diverse methodological advancements and applications across various domains, including psychology, network science, public health, and climate modeling. Franchino et al. (2025) (Franchino et al., 2025) investigated maths anxiety in Italian psychology students, developing and validating the MAS-IT scale. Their findings, using Confirmatory Factor Analysis (CFA) and Exploratory Graph Analysis (EGA), revealed a four-factor structure differing from the original English scale, suggesting cultural nuances in maths anxiety manifestation. Shifting to network science, Liu et al. (2025) (Liu et al., 2025) introduced a hybrid recommendation framework for academic literature. This framework leverages a large citation network and OpenAI's text-embedding-3-small model for semantic similarity analysis, combining network-based citation patterns with content-based recommendations to offer personalized literature suggestions. Horng et al. (2025) (Horng et al., 2025) tackled data integration in public health, employing probabilistic record linkage to merge gun violence datasets from the GVA and NVDRS, achieving high accuracy and demonstrating the feasibility of enhancing data utility through linkage.
Several papers focus on methodological innovations in statistical modeling and analysis. Yang et al. (2025a) (Yang et al., 2025a) evaluated adaptive sampling methods for virtual safety impact assessment, showing the benefits of integrating domain knowledge and stratification for improved sampling efficiency. Winter et al. (2025) (Winter et al., 2025) proposed a Design of Experiments approach for efficient long-term structural reliability estimation with non-Gaussian stochastic models. Gasparin & Ramdas (2025) (Gasparin & Ramdas, 2025) focused on improving the statistical efficiency of cross-conformal prediction, introducing variants that yield smaller prediction sets without sacrificing theoretical guarantees. Yang et al. (2025b) (Yang et al., 2025b) introduced predictive Bayesian optional stopping (pBOS), combining Bayesian optional stopping with rehearsal simulations for optimized resource allocation and cost savings in experiments.
The application of statistical methods to complex datasets is a recurring theme. Kushwaha et al. (2025) (Kushwaha et al., 2025) investigated armed conflict prediction, finding that specifying conflict type can negatively impact predictability due to weak statistical dependence. Jiao et al. (2025) (Jiao et al., 2025) introduced a method for heteroscedastic growth curve modeling using shape-restricted splines, addressing non-constant variance in growth trajectory analysis. Patel et al. (2025) (Patel et al., 2025) applied neural posterior estimation (NPE) to astronomical image cataloging. Cui et al. (2025) (Cui et al., 2025) presented a sequential robust optimal design framework for toxicology experiments. Other contributions include work on differentially private synthesis of spatial point processes, upper tail dependence estimators for multilayer networks, dynamic Dirichlet process mixture models for political analysis, and the paragraph-citation topic model (PCTM) for analyzing citation networks and document texts. Finally, a range of papers address diverse applications and methodological refinements in areas such as climate modeling, fMRI data simulation, and time series analysis.
Enhancing External Validity of Experiments with Ongoing Sampling by Chen Wang, Shichao Han, Shan Huang https://arxiv.org/abs/2502.18253
Caption: Estimated Average Treatment Effect Over Time with Stage Identification
Online experiments, especially A/B tests, face significant challenges in maintaining external validity due to the continuous enrollment of participants. This ongoing influx can lead to sample unrepresentativeness caused by shifts in user demographics over time. This paper tackles this problem by introducing a novel framework that dynamically evaluates sample representativeness and employs stage-specific estimators for Population Average Treatment Effects (PATE). This ensures the generalizability of experimental results across different durations, a crucial factor for sound product decisions.
The proposed framework divides the ongoing sampling process into three distinct stages: unstable, overlapping, and representative. These stages are determined by two key criteria: sufficient probability of participation for users with diverse covariates, and minimal difference in covariate distributions between the sample and the target population. A heuristic function based on survival analysis, specifically the probability of participation (π(t|Xᵢ)), identifies these stages in real-time without prior knowledge of population characteristics. The time of overlap (Tₒ) is defined as the point where Pr[Sᵢₜ=1|Xᵢ=x] > ηₒ for all t > Tₒ and all x, while the time of representativeness (Tᵣ) is when |Tₜ - τ| < ρ for all t > Tᵣ, where Tₜ is the sample average treatment effect at time t and τ is the PATE.
Distinct estimation strategies are employed for each stage. In the unstable stage, reliable causal inference is deemed impossible due to inadequate representation of users with diverse covariates. During the overlapping stage (Tₒ ≤ t ≤ Tᵣ), an Inverse Probability Weighting (IPW) estimator corrects for sample selection bias:
τᵢₚw(t) = (1/N) * Σᵢ(1{Sᵢₜ=1, Wᵢ=1} * Yᵢ(1) / ŵ(t|Xᵢ)) - (1/N) * Σᵢ(1{Sᵢₜ=1, Wᵢ=0} * Yᵢ(0) / ŵ(t|Xᵢ))
where ŵ(t|Xᵢ) estimates π(t|Xᵢ). In the representative stage (t > Tᵣ), the standard difference-in-means estimator suffices as the sample adequately reflects the population.
Validation using a real-world A/B test on WeChat, synthetic experiments, and a platform-wide application with 600 A/B tests demonstrated the framework's effectiveness. In the real-world test, the framework identified the stabilized treatment effect two days earlier than standard analysis. The platform-wide application significantly improved the identification of effective treatments, increasing the true positive rate by 37-56% while reducing the false positive rate by 17-29%. This underscores the framework's potential to enhance the reliability and generalizability of online experiment results.
StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis by Xinyi Song, Lina Lee, Kexin Xie, Xueying Liu, Xinwei Deng, Yili Hong https://arxiv.org/abs/2502.17657
Caption: This diagram illustrates the structure of the StatLLM dataset, a benchmark for evaluating large language models in statistical analysis. The dataset includes statistical analysis tasks, SAS code generated by three different LLMs (GPT 3.5, GPT 4.0, and Llama), and human evaluation scores for code quality, executability, and output correctness. This resource enables researchers to assess and improve LLM performance in statistical programming and develop more accurate automated evaluation metrics.
The increasing use of Large Language Models (LLMs) for code generation presents an opportunity to automate statistical analysis. However, evaluating the accuracy and reliability of this LLM-generated code remains a challenge, particularly due to the lack of standardized benchmarks for statistical languages like SAS and R. The StatLLM dataset directly addresses this gap, providing a comprehensive resource for assessing LLM performance in statistical analysis. It includes three key components: statistical analysis tasks, LLM-generated SAS code (from ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 70B), and human evaluation scores. The tasks cover a diverse range of statistical methods, from basic descriptive statistics to more complex procedures like survival analysis, and are accompanied by detailed dataset descriptions and human-verified SAS code.
Human evaluation of the LLM-generated code was rigorous, employing ten criteria grouped into three components: Code Correctness and Readability, Executability, and Output Correctness and Quality. A five-point scale (1-5) was used for each criterion, with nine raters divided into three groups corresponding to the evaluation components. To minimize bias, raters underwent extensive training, and LLM identities were concealed during evaluation. The total score for each model (M) was calculated as: X<sup>M</sup> = Σ<sup>3</sup><sub>g=1</sub> x<sup>M</sup><sub>g</sub>, where x<sup>M</sup><sub>g</sub> represents the average score for model M within group g.
StatLLM has several important applications. It can be used to evaluate and refine existing NLP metrics for assessing statistical code. The study found a moderate correlation between current NLP metrics and human evaluations, highlighting the need for more specialized metrics. Using human scores as ground truth, StatLLM can train machine learning models to predict human ratings based on NLP metric scores, demonstrating improved correlations compared to existing metrics. For instance, XGBoost achieved a correlation of 0.434, an 18% improvement over Rouge-2 (0.367). StatLLM can also assess and enhance LLM performance in statistical programming, identifying specific weaknesses and guiding future development.
Finally, StatLLM can be instrumental in developing and testing next-generation statistical software that leverages natural language interaction. The paper showcases an R Shiny app that uses LLMs for automated statistical analysis, demonstrating the potential for streamlined workflows. StatLLM is designed for extensibility, allowing for the inclusion of more complex statistical tasks, additional programming languages, and alternative evaluation metrics.
Academic Literature Recommendation in Large-scale Citation Networks Enhanced by Large Language Models by Kun Liu, Yan Zhang, Rui Pan, Tianchen Gao, Hansheng Wang https://arxiv.org/abs/2503.01189
Caption: This diagram illustrates the hybrid academic literature recommendation system's workflow. The upper path depicts the process for established articles, leveraging citation counts, abstract, title, and node similarity with weighted importance (w1-w5) to generate top recommendations. The lower path shows how the system processes new input, matching it to existing articles and using their citation networks (reference and citation lists) to enhance recommendations, again employing weighted similarity measures (w6-w10) for abstract, title, and node comparisons.
Navigating the ever-growing body of academic literature is a significant challenge for researchers. This paper introduces a hybrid recommendation framework that combines citation network analysis and large language models to provide more relevant literature suggestions. The researchers built a substantial citation network of 190,381 articles from 70 journals across statistics, econometrics, and computer science, covering publications from 1981 to 2022.
This hybrid approach integrates network-based citation patterns with content-based semantic similarities. OpenAI's text-embedding-3-small model generates embedding vectors for each article's abstract, offering computational efficiency and embedding stability essential for dynamic databases. Title similarity is incorporated using a bag-of-words model. Similarity between articles is calculated as a weighted combination: weighted-sim = w₁ × abstract-sim + w₂ × title-sim + w₃ × node-sim. This allows for personalized recommendations by adjusting weights. A "fundamental score," combining weighted similarity and normalized citation count, prioritizes influential articles.
Two experiments validated the system's effectiveness. The first focused on reconstructing reference lists of 10 review articles, achieving a 0.7 average hit rate, comparable to existing methods. The second evaluated recommendations for 1,500 non-review articles, yielding a hit@1 rate of 0.85, hit@5 of 0.44, hit@10 of 0.28, and recall@20 of 0.76. These results demonstrate the system's accuracy, especially for top recommendations. A decline in performance for recent publications highlights the ongoing challenge of navigating the expanding literature landscape.
AI-driven 3D Spatial Transcriptomics by Cristina Almagro-Pérez, Andrew H. Song, Luca Weishaupt, et al. https://arxiv.org/abs/2502.17761
Caption: The figure illustrates the VORTEX framework for predicting 3D spatial transcriptomics from 3D tissue images. It shows the workflow from tissue processing and imaging to cohort-level pretraining, volume-of-interest fine-tuning, and downstream morphomolecular analysis, including spatial domain identification and cross-modal retrieval.
VORTEX (VOlumetrically Resolved Transcriptomics EXpression) is a new AI framework that predicts 3D spatial transcriptomics (ST) from 3D tissue images and minimal 2D ST data. This offers a faster, more scalable, and less destructive alternative to existing 3D ST methods. VORTEX leverages the strong correlation between gene expression and tissue morphology.
VORTEX pretrains on a dataset of paired 3D morphology and 2D ST data from various tissue samples of the same cancer type, learning generic and sample-specific morphological correlates of gene expression. It is then fine-tuned on minimal 2D ST data from the specific volume of interest. This two-stage approach captures both general and volume-specific morphomolecular links. The model processes each 2D section and its neighbors, generating a 3D ST prediction.
Tested on prostate cancer tissue using microCT, VORTEX achieved the highest average Pearson Correlation Coefficient (PCC) in the 3D+VOI setting (pretraining + fine-tuning): 0.46 for all genes, 0.57 for the top 50 predictive genes, and 0.42 for marker genes, outperforming 2D and 3D settings. VORTEX also accurately captured variance and spatial autocorrelation of gene expression. It demonstrated scalability to large tissue volumes and generalizability to other imaging modalities. VORTEX enables unsupervised spatial domain identification by clustering 3D patches based on predicted gene expression. The total loss minimized during pretraining is:
L₁ = λ<sub>cont. I</sub> · L<sub>cont.,I</sub> + λ<sub>rec.,I</sub> · L<sub>rec.,I</sub> + λ<sub>da</sub> · L<sub>da</sub>
where L<sub>cont.,I</sub> is contrastive loss, L<sub>rec.,I</sub> is reconstruction loss, L<sub>da</sub> is domain adaptation loss, and λ values are weighting factors.
Common indicators hurt armed conflict prediction by Niraj Kushwaha, Woi Sok Oh, Shlok Shah, Edward D. Lee https://arxiv.org/abs/2503.00265
Caption: The image visualizes three distinct conflict types identified by an unsupervised learning model: "sporadic/spillover events," "local conflicts," and "major unrest," represented by different colored nodes. Below, bar graphs illustrate the relative influence of various factors (climate, economy, geography, composite demography, infrastructure, and raw demography) on each conflict type, revealing that population, infrastructure, economics, and geography are the most discriminative, respectively. The study found that knowing the conflict type did not improve intensity predictions, highlighting the limitations of using common indicators for forecasting conflict properties.
This study challenges conventional thinking by suggesting that common indicators may hinder armed conflict prediction. Researchers used unsupervised learning on detailed conflict data from Africa, combined with various background indicators, to identify three distinct conflict types: "major unrest," "local conflict," and "sporadic and spillover events." Each type was associated with specific geographic, demographic, and socio-economic characteristics.
"Major unrest" occurred primarily in densely populated areas with developed infrastructure and riparian geography. "Local conflicts" were found in regions with medium population density and diverse socio-economic conditions, often confined within country borders. "Sporadic and spillover events" were smaller, occurring in sparsely populated areas with limited infrastructure and poor economic conditions. The most discriminative factors were population, infrastructure, economics, and geography, respectively.
Counterintuitively, specifying conflict type did not improve predictions of conflict intensity. This was attributed to weak statistical dependence between conflict type and intensity. Comparing a conflict type-based model (M4, a multi-multinomial mixture model: P(x<sub>i</sub>) = Σ π<sub>i</sub> M<sub>θ<sub>i</sub></sub>) with an intensity-based model revealed a competitive effect: knowing the conflict type reduced information about intensity, especially regarding population.
This newsletter highlights a fascinating convergence of methodological advancements and their diverse applications. The development of new tools like VORTEX for 3D spatial transcriptomics and the StatLLM dataset for evaluating LLMs in statistical analysis promises to accelerate research in biomedicine and data science. The innovative hybrid approach to academic literature recommendation addresses the growing challenge of information overload for researchers, while the study on armed conflict prediction underscores the importance of critically examining commonly used indicators and exploring alternative modeling strategies. A recurring theme is the power of combining different data sources and methodologies, whether it's integrating citation networks with semantic analysis, or leveraging 3D tissue morphology with 2D spatial transcriptomics. These approaches demonstrate the potential for synergistic gains in understanding complex phenomena and improving predictive accuracy across diverse fields.