Title: User-Centered Programmatic Data Labeling
Date: Wednesday, March 27, 2024
Time: 12:00 – 14:00 EST
Location: Teams Link
Renzhi Wu
Ph.D. Candidate in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Xu Chu (advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Kexin Rong (co-advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology
Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology
Dr. Yeye He – Data Management, Exploration and Mining Group, Microsoft Research
Abstract:
The lack of labeled training data is a major challenge impeding the practical application of machine learning (ML) techniques. Therefore, ML practitioners have increasingly turned to programmatic supervision methods, in which a larger volume of programmatically generated, but often noisier, labeled examples is used in lieu of hand-labeled examples. In this paradigm, supervision sources are expressed as labeling functions (LFs), and a label model aggregates the output of multiple LFs to produce training labels. However, the current process of developing LF relies on the expertise of the user and can be inaccessible for non-experts, particularly when dealing with video data. In addition, existing label models require hyperparameters and dataset-specific training for each dataset and can yield non-deterministic results, further complicating the process for non-expert users.
This dissertation aims to improve the usability of programmatic data labeling through a three-part research approach. First, I explore how to improve usability by specializing programmatic data labeling to the task at hand. I examine a specific task (entity matching) as a case study to develop a specialized Integrated Development Environment, facilitating the development, debugging, aggregation, and management of LFs. Second, I extend the labeling function interface by introducing a visual interface, allowing users to create LFs for video data intuitively without any coding. Specifically, I propose a visual query language for retrieving video clips across datasets, enabling non-expert users to easily develop LFs with mouse drag-and-drop. Third, to obviate user involvement in the label model, I present a hyper label model that requires neither hyperparameters nor dataset-specific training, while producing deterministic results with superior accuracy and efficiency. The proposed method also offers the first analytical optimal solution to the problem.