Subject: Cutting-Edge Statistical and Machine Learning Research
Hi Elman,
This newsletter covers recent preprints exploring diverse applications of statistical and machine learning methodologies.
This collection of preprints showcases a broad spectrum of applications for statistical and machine learning methods across various domains, from experimental physiology and medical imaging to climate change and AI safety. Several studies focus on innovative study designs and statistical models for improved inference. Königorski & Schmid (2024) (https://arxiv.org/abs/2412.15076) introduce N-of-1 trials as a powerful tool for personalized and population-level inference in experimental physiology, highlighting their potential for increased efficiency compared to traditional randomized controlled trials. Béclin et al. (2024) (https://arxiv.org/abs/2412.15049) develop a novel parametric linear regression model for quantile function data, applied to paired pulmonary 3D CT scans, offering an objective measure for assessing treatment response in asthma patients. Goh et al. (2024) (https://arxiv.org/abs/2412.14946) propose joint models using Bayesian Additive Regression Trees (BART) to handle non-ignorable missing data, addressing a critical challenge in predictive analysis, particularly in the context of leaf photosynthetic traits.
Another prominent theme is the application of Bayesian methods and AI in diverse fields. Gonzales Martinez & Haisma (2024) (https://arxiv.org/abs/2412.14720) introduce MICG-AI, a Bayesian AI algorithm leveraging digital phenotyping to create a multidimensional index of child growth, offering a more holistic approach compared to traditional growth models. Wadsworth & Niemi (2024) (https://arxiv.org/abs/2412.14339) employ a Bayesian hierarchical nonlinear model with discrepancy for forecasting influenza hospitalizations, utilizing both hospitalization and ILI data. Micoli et al. (2024) (https://arxiv.org/abs/2412.15899) also utilize a Bayesian approach to calculate the probability of success for clinical trials with competing event data, offering a valuable tool for interim monitoring. Furthermore, Sun et al. (2024) (https://arxiv.org/abs/2412.14222) survey the emerging field of LLM-based agents for statistics and data science, highlighting their potential to democratize data analysis and automate complex tasks.
Several papers focus on specific applications and methodological advancements. These include revisited skull analysis using RMaCzek software (Bartoszek, 2024, https://arxiv.org/abs/2412.14343), evaluation of the linear mixing model in fluorescence spectroscopy (Hoff & Osburn, 2024, https://arxiv.org/abs/2412.14263), discussion of the relationship between the ICH E9 (R1) estimands framework and causal inference (Drury et al., 2024, https://arxiv.org/abs/2412.12380), a method for small-area uncertainty estimation for spatial averages of aboveground biomass (Johnson et al., 2024, https://arxiv.org/abs/2412.16403), and a proof of the minimax optimality of the Neyman allocation (Kato, 2024, https://arxiv.org/abs/2412.17753).
Addressing practical challenges in data analysis is another recurring theme. This includes using Bayesian multilevel bivariate spatial modelling to analyze Italian school data (Cefalo et al., 2024, https://arxiv.org/abs/2412.17710), a two-stage method for biomarker combination based on the Youden index (Sun & Zhou, 2024, https://arxiv.org/abs/2412.17471), a triple-variability-source model for analyzing motor-evoked potentials (Ma et al., 2024, https://arxiv.org/abs/2412.16997), the grill plot for visualizing linear predictions (Rousseeuw, 2024, https://arxiv.org/abs/2412.16980), and a disease progression model accounting for health disparities (Chiang et al., 2024, https://arxiv.org/abs/2412.16406).
Finally, numerous preprints explore specialized applications, including comparing logistic regression and XGBoost in fintech (Yarmohammadtoosky & Attota, 2024, https://arxiv.org/abs/2412.16333), generalizing causal effect estimates (Garraza et al., 2024, https://arxiv.org/abs/2412.16320), profile least squares estimation in networks (Chandna et al., 2024, https://arxiv.org/abs/2412.16298), mapping agricultural workers (Ormaza Zulueta et al., 2024, https://arxiv.org/abs/2412.15841), and various other applications spanning proteomics, earthquake modelling, climate science, dengue prediction, AI safety, and programming education. These diverse applications underscore the growing importance of statistical and machine learning methods in tackling complex real-world challenges.
A Survey on Large Language Model-based Agents for Statistics and Data Science by Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang https://arxiv.org/abs/2412.14222
Caption: This diagram illustrates the architecture of an LLM-based data agent, highlighting the central role of the LLM in processing user instructions and generating task codes. These codes are executed within a sandbox environment, leveraging tools like Python, SQL, Jupyter, and R, with results returned to the user through an interface that manages planning, reasoning, reflection, and error handling.
Large Language Models (LLMs) are rapidly changing the field of data science, potentially democratizing data analysis by making it accessible to those without specialized expertise. This survey delves into the growing area of LLM-based data agents – intelligent systems designed to automate complex data tasks through natural language interaction. These "data agents" hold the promise of simplifying data analysis workflows, making them easier for domain experts who may lack programming or statistical skills to use effectively. The survey examines the evolution of these agents, focusing on key features such as planning, reasoning, reflection, multi-agent collaboration, and knowledge integration, which allow them to tackle data-centric problems with minimal human intervention.
At the heart of a typical LLM-based data agent is the LLM, acting as the central processing unit. The LLM interprets user instructions, generates the necessary code, and retrieves the results. A secure sandbox environment allows for code execution, while a user-friendly interface facilitates seamless interaction. The survey categorizes data agents based on their user interface (IDE-based, independent systems, command-line, or OS-based) and their approach to problem-solving (conversational or end-to-end). Conversational agents engage in an interactive dialogue with users, while end-to-end agents execute tasks autonomously based on a single prompt. Planning strategies, vital for complex tasks, are further categorized as linear (step-by-step) or hierarchical (multi-path). The survey also underscores the importance of reflection, which allows agents to learn from past actions and self-correct errors, and multi-agent collaboration, where specialized agents work together to optimize performance.
Several case studies demonstrate the practical applications of data agents. One study showcases data visualization and machine learning using ChatGPT for exploratory data analysis and LAMBDA for automated report generation. Another highlights the Data Interpreter agent, which handles both visualization and machine learning tasks, emphasizing its planning and reflection capabilities. A third study explores the expandability of data agents, demonstrating the integration of tools and domain-specific knowledge. For example, the Data Interpreter integrated a web-scraping tool, while LAMBDA incorporated knowledge of Fixed Point Non-Negative Neural Networks (FPNNNs).
Despite the potential, data agents face challenges. Current LLMs, while proficient in basic statistics, struggle with more advanced analyses. Improving domain-specific knowledge, multi-modal handling (charts, tables, code), and reasoning capabilities are crucial. For data agents to become truly intelligent statistical analysis software, seamless package management, community building, and integration with other large models are essential. Addressing infrastructure challenges, especially for web-based applications handling high concurrent requests, is also critical. The survey concludes that while significant progress has been made, continuous research is needed to overcome these challenges and unlock the full potential of data agents.
Learning Disease Progression Models That Capture Health Disparities by Erica Chiang, Divya Shanmugam, Ashley N. Beecy, Gabriel Sayer, Nir Uriel, Deborah Estrin, Nikhil Garg, Emma Pierson https://arxiv.org/abs/2412.16406
Caption: Group-specific parameter estimates demonstrating disparities in heart failure progression.
Traditional disease progression models, while helpful for predicting patient trajectories and guiding treatment, often overlook a crucial factor: health disparities. These disparities, rooted in systemic inequities, can bias observed data, leading to inaccurate severity estimations and potentially worsening existing inequalities. This research introduces a new Bayesian disease progression model specifically designed to address these biases. The model concentrates on three key disparities: disparities in initial severity (when patients start receiving care), disparities in disease progression rate (how fast the disease advances), and disparities in visit frequency (how often patients seek follow-up care).
The model centers around a patient's latent disease severity, Z<sub>t</sub>, which changes over time t. It's defined by the initial severity, Z<sub>0</sub>, and a rate of progression, R: Z<sub>t</sub> = Z<sub>0</sub> + Rt. Observed symptoms, X<sub>t</sub>, are modeled as a noisy function of the latent severity: X<sub>t</sub> = f(Z<sub>t</sub>) + ε<sub>t</sub>, where ε<sub>t</sub> represents noise. Critically, the model allows parameters controlling Z<sub>0</sub>, R, and visit frequency (modeled as a time-varying rate parameter λ<sub>t</sub> in an inhomogeneous Poisson process: log(λ<sub>t</sub>) = β<sub>0</sub> + β<sub>z</sub>Z<sub>t</sub> + β<sub>A</sub>) to vary according to patient demographics, A. This allows the model to capture how disparities might affect these crucial aspects of disease progression. A key theoretical contribution is proving that this model, while complex enough to capture multiple disparities, remains identifiable – a crucial property for accurate parameter estimation. The authors also theoretically demonstrate that ignoring any of these disparities inevitably leads to biased severity estimates.
Researchers validated their model using synthetic and real-world data. Synthetic experiments confirmed the model's ability to accurately recover true data-generating parameters and produce well-calibrated severity estimates. The real-world application focused on heart failure patients from the New York-Presbyterian hospital system. The model revealed that Black patients tended to present with more severe heart failure compared to White patients. It also uncovered specific disparities: Black and Asian patients began care at later, more severe stages of heart failure than White patients, and Black patients had 10% lower visit frequency than White patients at the same disease severity. Importantly, accounting for these disparities significantly changed severity estimates, increasing the proportion of non-white patients identified as high-risk. This research underscores the critical need to incorporate health disparities into disease progression models. Failing to do so leads to biased severity estimations and can misclassify patients as high-risk, potentially hindering equitable access to care.
How to Choose a Threshold for an Evaluation Metric for Large Language Models by Bhaskarjit Sarmah, Mingshu Li, Jingrao Lyu, Sebastian Frank, Nathalia Castellanos, Stefano Pasquali, Dhagash Mehta https://arxiv.org/abs/2412.12148
Caption: Histograms of faithfulness scores for three different open-source implementations (RAGAS, UpTrain, and DeepEval) using the gpt-40-mini model on the HaluBench dataset. The distributions illustrate the challenge of selecting an appropriate threshold, with bimodal distributions observed for certain implementations, highlighting the limitations of simple methods like Z-score.
Large language models (LLMs) are transforming industries, but their reliability depends on robust evaluation. While various metrics exist to assess LLM performance, choosing the right threshold for these metrics is crucial yet under-researched. This paper presents a new methodology, drawing from model risk management (MRM) in finance, to systematically determine thresholds for LLM evaluation metrics, ensuring responsible and reliable deployment. This approach emphasizes a holistic perspective, starting with identifying the risks associated with the specific LLM application and the risk tolerance of stakeholders. This involves quantifying potential legal, financial, reputational, and societal risks, and translating stakeholder risk preferences into statistical confidence levels. This ensures alignment between model performance and practical considerations.
Once risk tolerance is quantified, the next step is preparing ground truth data, which includes diverse questions, context, answers, and human-curated labels corresponding to the chosen evaluation metric. The data is then split into training and testing sets. The paper suggests several statistically rigorous methods for threshold determination, including Z-score, kernel density estimation (KDE), empirical recall, AUC-ROC, and conformal prediction. The Z-score approach, though simple, relies on normal distribution assumptions, which might not hold for all LLM metrics. KDE estimates the density of the evaluation metric, identifying the midpoint between peaks in the distribution. Empirical recall examines the relationship between recall and the evaluation metric score. AUC-ROC utilizes the area under the receiver operating characteristic curve to determine thresholds based on acceptable false positive and false negative rates. Conformal prediction provides a model-agnostic framework for generating prediction intervals at specific confidence levels.
This methodology is demonstrated using the Faithfulness metric, which measures how well an LLM's responses reflect the retrieved context, on the HaluBench dataset. Three open-source implementations of Faithfulness (RAGAS, UpTrain, and DeepEval) are evaluated with the gpt-40-mini model. Results show that all methods achieve valid coverage aligned with pre-specified confidence levels, but KDE performs better at lower confidence levels, while conformal prediction shows better discriminative power at higher confidence levels. The simple Z-score method was ineffective due to the bimodal nature of the Faithfulness scores. The study highlights the limitations of some methods, such as KDE's sensitivity to parameter choices and empirical recall's conservative behavior at high confidence levels. This research provides a practical guide for choosing appropriate thresholds for LLM evaluation metrics, balancing precision and recall while aligning with application requirements and risk appetite.
On the Role of Surrogates in Conformal Inference of Individual Causal Effects by Chenyin Gao, Peter B. Gilbert, Larry Han https://arxiv.org/abs/2412.12365
Caption: Comparison of Prediction Interval Widths for ITE Estimation Across Different Methods and Surrogate Marker Quintiles
Personalized medicine relies on accurate individual treatment effect (ITE) estimates, but traditional causal inference often falls short by focusing on aggregated effects. While conformal prediction quantifies ITE uncertainty, the resulting prediction intervals are often too wide for practical use. This paper introduces SCIENCE (Surrogate-assisted Conformal Inference for Efficient INdividual Causal Effects), a framework for more efficient ITE prediction intervals by incorporating surrogate outcomes. This addresses a gap in existing methods, which haven't leveraged surrogates for enhancing individual-level causal inference.
SCIENCE works in various data scenarios, including semi-supervised and surrogate-assisted semi-supervised learning, handling situations where primary outcomes are observed in some data (source) but missing in others (target). Critically, it handles covariate shifts between source and target data. The framework uses semi-parametric efficiency theory to derive efficient influence functions (EIFs) for estimating the quantile r<sub>α</sub> of the non-conformity score, crucial for conformal prediction intervals. The EIFs under different data settings reveal that incorporating surrogates boosts efficiency when present in both source and target data, allowing better prediction of missing primary outcomes in the target data. The efficiency gain is quantified by comparing semi-parametric lower bounds, showing it depends on how well surrogates predict the primary outcome, measured by var{m<sub>1</sub>(r<sub>α,1</sub>, X, S) | X}, where m<sub>1</sub> is the conditional CDF of the non-conformity score.
Simulation studies validated SCIENCE, showing substantial efficiency gains (shorter intervals) while maintaining valid coverage. For continuous outcomes with a surrogate predictiveness parameter σ<sub>s</sub> of 10, SCIENCE produced intervals 48% and 22% shorter than weighted CQR and efficient estimators without surrogates, respectively. Simulations confirmed SCIENCE's group-conditional coverage guarantees, ensuring reliability across subpopulations. Further simulations with categorical outcomes highlighted the importance of accounting for surrogates in the conditional probability model for valid empirical coverage.
In a real-world application, SCIENCE was applied to the Moderna COVE COVID-19 vaccine trial. By including surrogate markers like antibody levels, the framework generated more efficient prediction intervals for individual vaccine efficacy, showcasing SCIENCE's practical utility in personalized efficacy assessments. The paper concludes by discussing potential extensions, including developing conformal predictive distributions, exploring class-conditional coverage, and investigating adaptive strategies for selecting the miscoverage rate α.
This newsletter highlights the diverse and impactful applications of statistical and machine learning methods. From novel study designs like N-of-1 trials to sophisticated Bayesian models incorporating health disparities, the research showcased here pushes the boundaries of statistical inference. The emergence of LLM-based data agents promises to democratize data analysis, while rigorous approaches to threshold selection for LLM evaluation metrics ensure responsible AI deployment. Finally, the innovative use of surrogate outcomes in conformal inference sharpens individual-level causal effect estimates, paving the way for more personalized and effective interventions in healthcare and beyond. These advancements collectively demonstrate the increasing power and relevance of statistical and machine learning in addressing complex real-world problems.