Safe from the Start: Developing Pro-Social AI Training Datasets Through Data Workers' Critical Perspectives
Annabel Rothschild
Ph.D. Student in Human-centered Computing
School of Interactive Computing
Georgia Institute of Technology
Date: 01 April 2024
Time: 09-11am EST
Location: TSRB #223, virtual link
Committee:
Dr. Betsy DiSalvo (advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. Carl DiSalvo (advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. Shaowen Bardzell, School of Interactive Computing, Georgia Institute of Technology
Dr. Ellen Zegura, School of Computer Science, Georgia Institute of Technology
Dr. Richmond Wong, School of Literature, Media, and Communication, Georgia Institute of Technology
Dr. Lauren Klein, Department of English, Emory University
Abstract:
AI and ML systems are increasingly ubiquitous, with recent advances in LLMs and image generators, such as OpenAI’s ChatGPT and DALL·E, creating new urgency in future of work conversations. My work explores how the massive datasets used to train these systems, collected and curated by a global workforce of data workers, come into being. Specifically, I examine what the perspective and lived experience of a data worker contributes to the data labors they perform.
In my research, I build ways to integrate data worker perspective into dataset development at two levels. On the micro level, I trace the impact of CDL on worker perspective and dataset development, and propose a tool to help solidify that process in spreadsheet-based data work. On a macro scale, I advocate for more pro-social treatment of data workers on digital task platforms, such as Amazon MTurk, emphasizing that the benefits of CDL cannot be felt without data workers being made full partners in the AI and ML system development process. My past work includes defining the terms of pro-social task building, for data work requesters using platforms like Amazon MTurk, and understanding how requesters conceptualize workers on these platforms. My proposed work for this macro thread is an investigation of the current communication infrastructure of these platforms, and how it can be leveraged to support the inclusion of data work observation and reflection on tasks completed