Name: Sheng-Tao Yang
Thesis Title: Analysis of High-Dimensional Data with Variable Clustering and Selection
Thesis Committee:
Dr. Jye-Chyi Lu (advisor), Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Yajun Mei, Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Roshan Joseph, Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Xiao Liu, Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Gian-Gabriel Garcia, Industrial and Systems Engineering, Georgia Institute of Technology
Date: October 19th (Thursday), 2023
Time: 1:00 PM - 2:30 PM EST
Location: Groseclose #226
Meeting Link (Zoom): https://gatech.zoom.us/j/96429629844
Join our Cloud HD Video Meeting
Zoom is the leader in modern enterprise video communications, with an easy, reliable cloud platform for video and audio conferencing, chat, and webinars across mobile, desktop, and room systems. Zoom Rooms is the original software-based conference room solution used around the world in board, conference, huddle, and training rooms, as well as executive offices and classrooms. Founded in 2011, Zoom helps businesses and organizations bring their teams together in a frictionless environment to get more done. Zoom is a publicly traded company headquartered in San Jose, CA.
gatech.zoom.us
Abstract
Chapter 1 provides a decision-making procedure called Human-in-the-Loop Clustering and Representative Selection (HITL-CARS) procedure that involves users’ domain knowledge for analyzing high-dimensional datasets. The proposed method simultaneously clusters strongly linearly correlated variables and estimates a linear regression model with only a few selected cluster representatives and independent variables. After users obtain the analysis result of CARS and provide their advice based on their domain knowledge, HITL-CARS refines analyses for accounting users’ inputs. To optimize CARS and HITL-CARS procedures, an algorithm is provided for solving the mixed-integer programming problem based on penalized likelihood. Simulation studies are performed to assess the performance between CARS and other two-stage variable clustering and selection methods. A real-life example of brain mapping data shows that HITL-CARS could aid in discovering important brain regions associated with depression symptoms and provide predictive analytics on cluster representatives.
Chapter 2 studies large-sample properties of an adaptive Clustering and Representative Selection (aCARS) procedure in ultrahigh-dimensional scenarios, where the number of variables increases exponentially along with the sample size. This chapter investigates under which conditions, aCARS selects important representatives and variables consistently, constructs clusters matching the true clusters consistently, achieves oracle properties for regression parameter estimation. Moreover, because aCARS involves cluster information for reducing the dimensionality of variable space, the manuscript explores how large the dimensionality could be to preserve aCARS’ large-sample properties. Lastly, since aCARS does not select multiple variables in each cluster, the studies investigate how aCARS relaxes the condition about multicollinearity.
So far, Chapter 1 develops HITL-CARS that includes users’ domain knowledge to refine CARS for analyzing high-dimensional data with a small sample size. Accordingly, Chapter 2 investigates aCARS’ large-sample properties by discussing its asymptotic behaviors, where the dimensionality goes to infinity exponentially along with the sample size. However, finite-sample properties of CARS and aCARS have not been discussed yet, and thus practitioners do not know the advantages of using CARS in analyzing high-dimensional data in practice.
Chapter 3 systematically investigates the finite-sample performance of aCARS and CARS, focusing on their ability of dealing with ultrahigh dimensionality, strong multicollinearity, and the importance of hyperparameter tuning. We study a series of simulation scenarios and real-world data analysis that are capable of comparing CARS and aCARS with related and popular variable selection methods, such as lasso, adaptive Lasso, SCAD, and MCP. In particular, simulation settings focus on ultrahigh-dimensional and strongly multicollinear data, where the number of variables grows exponentially with the sample size, and some variables display Pearson correlations exceeding 0.95. Moreover, we provide practical guides for using High-dimensional Bayesian Information Criterion (HBIC) to tune hyperparameters efficiently in the aCARS and CARS procedures. To assess performance, evaluation metrics for (i) clustering, (ii) representative and variable selection, and (iii) prediction are provided so that the simulation results are able to show the significance of CARS and aCARS systematically. In summary, Chapter 3 demonstrates the applicability of aCARS and CARS in addressing real-word (finite-sample) data characterized by challenges such as ultrahigh dimensionality, multicollinearity, and hyperparameter tuning, thereby offering valuable insights for statisticians and data analysts. In particular, our studies reveal that other methods struggle when faced with ultrahigh dimensionality and strong multicollinearity. In contrast, aCARS consistently clusters strongly correlated variables, selects important variables, and excludes unimportant variables, resulting in the lowest prediction error.