Title: Robust and Flexible Reward Modeling for LLM Alignment
Date: April 21st, 2025
Time:11:00 am – 1:00 pm (EST)
Location: ISyE Main 224
Zoom link: https://gatech.zoom.us/j/91835542508
Alexander Bukharin
Machine Learning PhD Candidate
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology
Committee
1. Dr. Tuo Zhao (ISYE, Georgia Tech) (Advisor)
2. Dr. Chao Zhang (CSE, Georgia Tech)
3. Dr. Bo Dai (CSE, Georgia Tech)
4. Dr. Sen Na (ISYE, Georgia Tech)
5. Dr. Olivier Delalleau Liu (NVIDIA)
Abstract
As large language models grow increasingly more capable, ensuring their alignment with human values is of utmost importance. One of the most promising ways to align language models is by designing a reward function that measures alignment with human values, and training the language model to maximize this reward. In this thesis, we focus on two approaches reward design: reward design from external feedback signals and reward learning from human annotated datasets. In this first chapter we develop a reward design framework, HERON, that eases reward function design by exploiting hierarchical relationships between feedback signals. In the second chapter, we propose an algorithm to learn reward functions from datasets with corrupted human annotations. In the last chapter, we develop an adversarial attack approach that automatically discovers flaws in state-of-the-art reward functions, and then subsequently use these attacks to train more robust reward models. Altogether, these contributions advance the scalability and robustness of reward modeling.