Georgia Statistics Day 2024

Keynote Talk

Data Integration for Heterogeneous Data

Speaker: Annie Qu

Abstract:

In this presentation, I will showcase advanced statistical machine learning techniques and tools designed for the seamless integration of information from multi-source datasets. These datasets may originate from various sources, encompass distinct studies with different variables, and exhibit unique dependent structures. One of the greatest challenges in investigating research findings is the systematic heterogeneity across individuals, which could significantly undermine the power of existing machine learning methods to identify the underlying true signals. This talk will investigate the advantages and drawbacks of current methods such as multi-task learning, optimal transport, missing data imputations, matrix completions and transfer learning. Additionally, we will introduce a new latent representation method aimed at mapping heterogeneous observed data to a latent space, facilitating the extraction of shared information and knowledge, and disentanglement of source-specific information and knowledge. The key idea of the proposal is to project heterogeneous raw observations to the representation retriever library, and the novelty of our method is that we can retrieve partial representations from the library for a target study. The main advantages of the proposed method are that it can increase statistical power through borrowing common representation retrievers from multiple sources of data. This approach ultimately allows one to extract information from heterogeneous data sources and transfer generalizable knowledge beyond observed data and enhance the accuracy of prediction and statistical inference.

Biography:

Annie Qu is Chancellor’s Professor, Department of Statistics, University of California, Irvine. She received her Ph.D. in Statistics from the Pennsylvania State University in 1998. Qu’s research focuses on solving fundamental issues regarding structured and unstructured large-scale data and developing cutting-edge statistical methods and theory in machine learning and algorithms for personalized medicine, text mining, recommender systems, medical imaging data, and network data analyses for complex heterogeneous data. The newly developed methods can extract essential and relevant information from large volumes of intensively collected data, such as mobile health data. Her research impacts many fields, including biomedical studies, genomic research, public health research, social and political sciences. Before joining UC Irvine, Dr. Qu was a Data Science Founder Professor of Statistics and the Director of the Illinois Statistics Office at the University of Illinois at Urbana-Champaign. She was awarded the Brad and Karen Smith Professorial Scholar by the College of LAS at UIUC and was a recipient of the NSF Career award from 2004 to 2009. She is a Fellow of the Institute of Mathematical Statistics (IMS), the American Statistical Association, and the American Association for the Advancement of Science. She is also a recipient of IMS Medallion Award and Lecturer in 2024. She serves as Journal of the American Statistical Association Theory and Methods Co-Editor from 2023 to 2025 and as IMS Program Secretary from 2021 to 2027.
Qu Lab website: https://faculty.sites.uci.edu/qulab/

Invited Speakers

Boosted generalized normal distributions: Forecasting patient wait and service times in emergency departments

Speaker: Donald Lee

Abstract:

Applications of ML techniques sometimes ignore important knowledge from the domain it is applied to. For example, it is known that the distribution of patient wait times in an emergency department (ED) is approximately exponential, but current wait time forecasts ignore this information. To incorporate distributional knowledge into ML forecasts, we introduce a rigorous tree boosting procedure for estimating generalized normal distributions (bGND). We show that bGND performs 6% better than the distribution-agnostic ML benchmark in the distributional forecasting of patient wait times, which translates into a 9% increase in patient satisfaction and an increase in hospital earnings of $120,000 for every 10,000 visits. Similar improvements are also shown for patient service time forecasts using bGND.

Biography:

Prof. Lee's research develops rigorous data science techniques for improving the delivery of health care. On the applied front, he has extensive experience designing data-driven tools for problems ranging from healthcare financial planning to real-time warning systems for adverse medical events. On the methodological front, his research has resolved foundational questions in causal inference and in survival machine learning. His work has appeared in leading journals in management, statistical machine learning, and healthcare, and is recognized by R01 funding from the NIH. Prior to joining Emory, he served as an associate professor at Yale and held appointments in the School of Management and in the Department of Statistics & Data Science.

Order Selection for Clustering Multivariate Extremes

Speaker: Ray (Shuyang) Bai

Abstract:

In extreme value theory, the so-called spectral measure summarizes the directional dependence pattern of extreme values across different variables. Several recent works have related spherical clustering techniques to the estimation of models with a discrete spectral measure. Yet, the problem of determining the order, i.e., the number of distinct atoms of the spectral measure, remains unexplored. In this work, we develop an order selection method that, on the theoretical side, consistently recovers the true order, and on the practical side, enjoys intuitive and simple implementation. Our method is based on a variant of the well-known Silhouette method. In particular, we introduce an additional penalty term to the so-called simplified average silhouette width, which discourages small cluster sizes and small dissimilarities between cluster centers. The optimal order is chosen by visualizing the bending of the penalized average silhouette width curve (as a function of the order selected) when the tuning parameter of the penalty term increases. As a consequence, this method consistently estimates the order of a max-linear factor model, for which an usual information-criterion-based method is not applicable. Simulation studies demonstrate the bias-correcting effect of the penalty introduced. The method is also illustrated on a river discharge data set for stations located throughout the US. The order selected by our method matches the geographical context of these stations. This is a joint work with Shiyuan Deng and He Tang.

Biography:

Ray (Shuyang) Bai is an associate professor at the Department of Statistics of University of Georgia. He obtained his PhD in Mathematics from Boston University in 2016. His research scope spans across probability and statistics. He is particularly interested in probabilistic and statistical questions related to data exhibiting non-standard scaling features such as long-range dependence and heavy tails. His recent focus includes extreme value theory under dependence, as well as analysis of multivariate extremes through the lens of machine learning techniques.

Functional Differential Equation Model for Dynamic System

Speaker: Ruiyan Luo

Abstract:

One major limitation of Ordinary Differential Equation (ODE) model is that it assumes the derivatives of the system only depend on the concurrent values. This concurrent assumption can oversimplify the mechanism of dynamic systems and limit the applicability of differential equations. To address the limitation, we propose a general Functional Differential Equation (FDE) model which allows the derivative to explicitly depend on both the current value and a historical segment of the system through an operator whose form is unknown. The operator maps functions defined in infinite-dimension spaces to scalars. To construct the operator and build the FDE from noisy observations, we propose a new family of estimators, called the Functional Neural Networks (FNN) with a smooth hidden layer, and establish the universal approximation property which states that any operator under mild regularity conditions can be well estimated by the members of this family. With this theorem, we propose a penalized moving window integrated least squares method to construct an estimate of the FDE and make forecasts. The FDE method displays an obvious advantage in forecasting by simulations and application in sunspot data.

Biography:

Dr. Ruiyan Luo is Professor of Biostatistics in Department of Population Health Sciences, School of Public Health, Georgia State University. Her research interests include functional data analysis, Bayesian statistics, machine learning, dynamic system modeling and application in infectious diseases.

Bayesian Jackknife Empirical Likelihood-based Inference for Missing Data and Causal Inference Problems

Speaker: Yichuan Zhao

Abstract:

Missing data reduces the representativeness of the sample and can lead to inference problems. This study applied the Bayesian jackknife empirical likelihood method for inference with missing data that were missing at random and causal inference. The semiparametric fractional imputation estimator, propensity score weighted estimator, and doubly robust estimator were used for constructing the jackknife pseudo values which were needed for conducting Bayesian jackknife empirical likelihood-based inference with missing data. Existing methods, such as normal approximation and jackknife empirical likelihood, were compared with the Bayesian jackknife empirical likelihood approach in a simulation study. The proposed approach had better performance in many scenarios in terms of the behavior of credible intervals. Furthermore, we demonstrated the application of the proposed approach for causal inference problems in a study of risk factors for impaired kidney function.

Biography:

Dr. Yichuan Zhao is a Professor of Statistics at Georgia State University in Atlanta. His current research interest focuses on survival analysis, empirical likelihood methods, nonparametric statistics, analysis of ROC curves, bioinformatics, Monte Carlo methods, and statistical modelling of fuzzy systems. He has published more than 100 research articles in statistics and biostatistics, has co-edited six books on statistics, biostatistics and data science, and has been invited to deliver more than 200 research talks nationally and internationally. Dr. Zhao has organized the Workshop Series on Biostatistics and Bioinformatics since its initiation in 2012. He also organized the 25th ICSA Applied Statistics Symposium in Atlanta as the chair of the organizing committee to great success. In addition, the 6th ICSA China Conference that he organized as the chair of both the organizing committee and program committee was a huge success. Dr. Zhao is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute.

Enhancing fraud detection with graph neural networks

Speaker: Shijie Cui

Abstract:

Detecting fraudulent activities is crucial for loss prevention across industries. This task is complex and constantly evolving due to the dynamic and interconnected nature of activities. Traditional fraud detection techniques often focus on transactions in isolation, missing the intricate relationships between entities such as users and service providers. This talk presents the application of Graph Neural Networks (GNNs) in fraud detection in real world. Graph-based approaches allow us to uncover hidden patterns and relationships that traditional methods miss. It has potential to offer a scalable, robust solution for enhancing fraud prevention efforts.

Biography:

Shijie Cui is currently working in Advanced Technology for Modeling (AToM) team in Wells Fargo. He obtained his PhD. in Statistics from Pennsylvania State University. His primary focus is on developing efficient machine learning models as well as and model risk management tools in banking.

Sample Complexity of Risk-Neutral Optimal Control with Application to Vaccination Scheduling for Epidemic Control

Speaker: Johannes Milz

Abstract:

The SEIR model is a widely used framework for simulating the spread of infectious diseases, and optimal control problems can be formulated to design effective intervention strategies, such as vaccination schedules. However, the SEIR model's parameters are often uncertain. To account for this, we model the parameters as random variables and formulate a risk-neutral optimal control problem by the average over their possible values. Building on this idea, we consider a broad class of risk-neutral optimal control problems involving nonlinear ordinary differential equations with uncertain inputs. By sampling these uncertain inputs, we approximate the original problem using empirical risk minimization. Leveraging metric entropy techniques, we derive non-asymptotic sample complexity bounds for the sample-based optimal values and critical points. Numerical simulations for the vaccination scheduling problem validate our theoretical findings. This is joint work with Olena Melnikov.

Biography:

Johannes Milz is an Assistant Professor in the H. Milton Stewart School of Industrial and Systems Engineering at Georgia Tech. His research interests include optimization under uncertainty and optimal control of uncertain systems. His work has focused on the algorithmic development and the analysis of stochastic optimization problems governed by partial differential equations. Prior to joining ISyE, he was a postdoctoral researcher at the Technical University of Munich. He received his Ph.D. degree in Applied Mathematics from the Technical University of Munich in 2021.

Adaptive Sampling Approaches for Real-time Data-Driven Decision-Making

Speaker: Xiaochen Xian

Abstract:

Advances in sensor technology have given rise to a highly data-intensive environment, enabling real-time decision-making by continuously collecting, processing, and analyzing vast amounts of information. In this talk, two adaptive sampling works will be presented in the context of real-time data-driven decision-making.

The first work proposes adaptive testing resource allocation strategies to dynamically allocate limited testing resources among different communities during infectious diseases, on top of a physics-informed model with account for transmission dynamics and health disparity for effective health risk assessment despite limited data. By integrating nonstationary Multi-Armed Bandit (MAB) techniques which strike superior balance between exploration on the communities with high uncertainties of health risks and exploitation on those with high risk levels, the proposed methodology facilitates testing resource allocation among different communities to collect high-quality testing data for quick detection of disease outbreaks. The second work proposes pathwise sampling strategies using moving sensors to quickly identify abrupt changes in an area of interest in real time considering their pathwise movement constraints. To tackle challenges due to variability and partial observability of online observations, we integrate instruments of statistical process control and mathematical optimization to monitor the global status of the area of interest and adaptively adjust paths of MVSs to sample from suspicious locations based on real-time data. Theoretical investigations and simulation studies will be presented to validate the superior performance of the proposed methods.

Biography:

Dr. Xiaochen Xian is currently an assistant professor in H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology. Prior to joining Georgia Tech, she was an assistant professor in the Department of Industrial and Systems Engineering at the University of Florida. She received her B.S. degree in Mathematics and Applied Mathematics from Zhejiang University, China in 2014, and the M.S. degree in Statistics, and the Ph.D. degree in Industrial and Systems Engineering from the University of Wisconsin-Madison in 2017 and 2019. Dr. Xian’s research focuses on computationally aware systems with a special interest in novel methodologies in data-driven decision-making and machine learning under constraints to enable theoretically sound and viable analytical tools. Her research has been supported by federal and local agencies including NSF, NIH, the Florida Center for Cybersecurity, and the Florida Space Grant Consortium. She is the recipient of multiple awards, including NIH NIBIB Trailblazer Award, Cottmeyer Family Faculty Fellowships, finalist of INFORMS QSR Best Referred Paper, INFORMS DMDA Workshop Best Paper, and IISE QCRE Best Track Paper, second runner-up of Best Paper Award in IEEE TASE, feature articles in IISE magazine, AIE, and YoungStats. Dr. Xian is an associate editor of IEEE Transactions on Automation Science and Engineering and IEEE International Conference on Automation Science and Engineering.

Statistical approach to evaluate the impact of a pediatric Typhoid Conjugate Vaccine campaign in Navi Mumbai, India

Speaker: Qian An

Abstract:

Typbar-TCV, a typhoid conjugate vaccine (TCV), was prequalified by the World Health Organization (WHO) in December 2017 for use in children aged 6 months and older. In 2018, the Navi Mumbai Municipal Corporation (NMMC) in India implemented a public sector TCV campaign targeting all children aged 9 months to 14 years within NMMC boundaries over 2 vaccination phases. We intended to evaluate the impact of TCV campaign via a single-step wedge design by estimating and comparing the incidence of blood culture-confirmed typhoid among children eligible to receive the vaccine in the initial vaccine campaign communities to children in the delayed campaign communities. However, COVID-19 pandemic radically altered healthcare seeking behavior in Navi Mumbai and required us to halt a critical part of field data collection. We revised the original design to a test-negative case-control design to evaluate the impact of the mass vaccination campaign despite the reduced healthcare seeking. We matched test-positive, culture-confirmed typhoid cases with up to 3 test-negative, culture-negative controls by age and date of blood culture and assessed community vaccine campaign phase as an exposure using conditional logistic regression. We found that children with typhoid cases were 56% less likely to reside in the initial vaccine campaign communities than in delayed vaccine campaign communities. The findings support the use of TCV mass vaccination campaigns as effective population-based tools to combat typhoid fever.

Biography:

Dr. Qian An is an alum from the Department of Biostatics and Bioinformatics of the Emory University. Dr. An got her PhD in 2014 working with Prof. Jian Kang and Prof. Michael Haber. Currently, she is a mathematical statistician at the Global Immunization Division, Global Health Center, Centers for Disease Control and Prevention. Before GID, she worked as a statistician at the Division of HIV prevention. Her work focused on study design and statistical analysis for a variety of studies, including household surveys, evaluation studies and clinical trials.

Flexible Bayesian Product Mixture Models for Vector Autoregressions

Speaker: Joshua Lukemire

Abstract:

Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, these methods can face hurdles in heterogeneous settings where samples are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enables independent clustering at multiple scales. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals connectivity differences between distinct fluid intelligence groups.

Biography:

Dr. Joshua Lukemire is a research assistant professor in the Department of Biostatistics and Bioinformatics at Emory University. His main research focus involves developing and applying statistical techniques for analyzing high-dimensional imaging data, with a specific interest in developing models for identifying brain network differences between clinical groups. He is also interested in other types of imaging, including pediatric applications of near-infrared spectroscopy to monitor red blood cell transfusion efficacy.

Cell-type-specific mapping of enhancers and target genes from single-cell multimodal data

Speaker: Chang Su

Abstract:

Mapping enhancers and target genes in disease-related cell types has provided critical insights into the functional mechanisms of genetic variants identified by genomewide association studies (GWAS). However, most existing analyses rely on bulk data or cultured cell lines, which may fail to identify cell-type-specific enhancers and target genes. Recently, single-cell multimodal data measuring both gene expression and chromatin accessibility within the same cells have enabled the inference of enhancer-gene pairs in a cell-type-specific and context-specific manner. However, this task is challenged by the data’s high sparsity, sequencing depth variation, and the computational burden of analyzing a large number of enhancer-gene pairs. To address these challenges, we propose scMultiMap, a statistical method that infers enhancer-gene association from sparse multimodal counts using a joint latent-variable model. It adjusts for technical confounding, permits fast moment-based estimation and provides analytically derived p-values. In systematic analyses of blood and brain data, scMultiMap shows appropriate type I error control, high statistical power with greater reproducibility across independent datasets and stronger consistency with orthogonal data modalities. Meanwhile, its computational cost is less than 1% of existing methods. When applied to single-cell multimodal data from postmortem brain samples from Alzheimer’s disease (AD) patients and controls, scMultiMap gave the highest heritability enrichment in microglia and revealed new insights into the regulatory mechanisms of AD GWAS variants in microglia.

Biography:

Chang Su is an assistant professor in the Department of Biostatistics and Bioinformatics at Emory University. Her research aims to develop statistical methodologies to address important biology questions with single-cell genomics and genetics data.

Bayesian Hierarchical Model for Patient-Specific Abnormal Region Detection

Speaker: Rongjie Liu

Abstract:

Early detection of Alzheimer’s disease (AD) can help in better management of the disease and delaying the disease progression. In this study, we propose a Bayesian based approach, i.e., PARD (patient-specific abnormal region detection) to detect patient-specific diseased regions in the AD studies. We formulate the Bayesian hierarchical model used for detecting diseased regions, specify all the prior distributions related to the model parameters as well as hyperparameters, and give the joint posterior distributions. The algorithm is followed for sampling the parameters and hyperparameters from the joint posterior distribution, and derivations of the full conditional distributions and joint probability calculations requires for executing the sampling algorithm. Finally, we compare the effectiveness of our proposed algorithm with some other popular methods on the simulated data and demonstrate the performance on real MRI data from Alzheimer's Disease Neuroimaging Initiative (ADNI).

Biography:

Dr. Rongjie Liu is currently an assistant professor of Statistics at University of Georgia. She received the Ph.D. degree in statistics from Rice University, Houston, TX, USA, in 2020. Her current research interests include Bayesian statistics, learning base methods and image data analysis.

Re-analysis of Published Clinical Study Data Using Machine Learning Methods

Abstract:

Data from thousands of clinical studies are available through data repositories associated with the NIH. The vast majority of these studies have been analyzed using traditional methods of statistical inference and do not state prediction as a goal, though interest in using machine learning and predictive modeling methods in clinical studies is growing. We show examples of re-analyzing existing food allergy studies with the goal of predicting subject outcomes.
Analyses focus on supervised machine learning algorithms such as decision trees and ensembles of trees. Unlike traditional methods for statistical inference, such models do not make distributional assumptions and can easily find nonlinear relationships as well as complex interactions between predictors. We show examples where machine learning models identify different important predictors as well as previously unknown nonlinear outcome/predictor relationships.
Two significant barriers to use of machine learning methods in clinical studies include the requirement of large sample sizes as well as the need for interpretability of analyses. We explore the success of using k-fold cross validation methods to optimize complexity of models fit to data with common small sample sizes seen in clinical studies. We additionally show the benefits of using model interpretability plots such as partial dependence and variable importance plots to provide understanding of input-output relationships beyond the associations investigated in the original stated study hypothesis.

Biography:

Jacqueline Johnson holds a DrPH in Biostatistics from UNC Chapel Hill. Her career has focused on statistical analyses of clinical trials data and includes working as a senior biostatistician in the respiratory disease group at Novartis Pharmaceuticals in New Jersey, as an assistant professor in the Biostatistics core of the Psychiatry department at the UNC Chapel Hill School of Medicine, and as a senior research scientist working on food and inhalant allergen projects at Rho, Inc., also in Chapel Hill. She has been teaching for SAS as a contract instructor since 2009 and joined the SAS Global Academic Programs team full-time in 2019. 

Round Table Discussions

Job Opportunities and Learning Resources for SAS Skill

Lead: Jacqueline Johnson

Abstract:

Join us to look at trends in SAS jobs in Atlanta as well as the greater Southeastern United States! We will explore companies, job titles, and occupations hiring for SAS skill. We will review the free learning portal SAS Skill Builder for Students (https://www.sas.com/skillbuilder) which offers 20+ free e-learning courses to learn SAS programming and SAS Viya software tools.