PhD Defense by Renzhi Wu

Wednesday

March 27, 2024

12:00PM - 2:00PM

Location

TEAMS

Title: User-Centered Programmatic Data Labeling

Date: Wednesday, March 27, 2024

Time: 12:00 – 14:00 EST

Location: Teams Link

Renzhi Wu

Ph.D. Candidate in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

Committee:

Dr. Xu Chu (advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Kexin Rong (co-advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology

Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology

Dr. Yeye He – Data Management, Exploration and Mining Group, Microsoft Research

Abstract:

The lack of labeled training data is a major challenge impeding the practical application of machine learning (ML) techniques. Therefore, ML practitioners have increasingly turned to programmatic supervision methods, in which a larger volume of programmatically generated, but often noisier, labeled examples is used in lieu of hand-labeled examples. In this paradigm, supervision sources are expressed as labeling functions (LFs), and a label model aggregates the output of multiple LFs to produce training labels. However, the current process of developing LF relies on the expertise of the user and can be inaccessible for non-experts, particularly when dealing with video data. In addition, existing label models require hyperparameters and dataset-specific training for each dataset and can yield non-deterministic results, further complicating the process for non-expert users.

This dissertation aims to improve the usability of programmatic data labeling through a three-part research approach. First, I explore how to improve usability by specializing programmatic data labeling to the task at hand. I examine a specific task (entity matching) as a case study to develop a specialized Integrated Development Environment, facilitating the development, debugging, aggregation, and management of LFs. Second, I extend the labeling function interface by introducing a visual interface, allowing users to create LFs for video data intuitively without any coding. Specifically, I propose a visual query language for retrieving video clips across datasets, enabling non-expert users to easily develop LFs with mouse drag-and-drop. Third, to obviate user involvement in the label model, I present a hyper label model that requires neither hyperparameters nor dataset-specific training, while producing deterministic results with superior accuracy and efficiency. The proposed method also offers the first analytical optimal solution to the problem.

Graduate Education

Office of Graduate and Postdoctoral Education

Search

Wednesday

March 27, 2024

Accessibility Information

Office of Graduate Education

Georgia Institute of Technology