Thesis Title: Sample Complexity of Reinforcement Learning Algorithms with a Focus on Policy Space Methods
Thesis Committee:
Dr. Siva Theja Maguluri (Advisor), Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Guanghui (George) Lan, Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Ashwin Pananjady, Industrial and Systems Engineering, Georgia Institute of Technology
Dr. Justin Romberg, Electrical and Computer Engineering, Georgia Institute of Technology
Dr. Daniel Russo, Columbia Business School, Columbia University
Date and Time: Monday, Nov 20th, 2 pm (EST)
In-Person Location: ISyE Executive Boardroom (Groseclose 402)
Meeting Link: https://gatech.zoom.us/j/93338819620
Abstract:
In this thesis, we develop fast reinforcement learning algorithms with finite sample complexity guarantees. The work is divided into two main parts. In the first, we investigate stochastic approximation across various domains to establish finite sample complexity bounds. We study two settings: federated stochastic approximation and two-time-scale linear stochastic approximation with Markovian noise. In the former, we develop a FedSAM algorithm where multiple agents are utilized to solve a fixed-point equation, following a stochastic approximation with Markovian noise. Moreover, we show that FedSAM has linear speedup with respect to the number of agents, while enjoying a constant communication cost. In the latter, we explore two-time-scale linear stochastic approximation with Markovian noise, establishing tight finite-time bounds.
The second part delves into finite-time bounds for reinforcement learning algorithms, with an emphasis on policy space methods. First, we consider two-time-scale natural actor-critic algorithm with on-policy data. For this algorithm we establish a $\epsilon^{-6}$ sample complexity for convergence to the global optimum. Next, we study two-loop natural actor-critic, and we establish a $\epsilon^{-3}$ sample complexity, improving upon the two-time-scale counterpart. In this case, we consider an off-policy sampling strategy.
To enhance the sample complexity of the natural actor-critic, we separate the algorithm into 'Actor' and 'Critic' components. For the Critic, we consider federated TD-learning and TD-learning with Polyak averaging. For the former, we show a linear speedup, and in the latter we establish a tight finite time bound. Furthermore, we establish a tight finite time convergence bound for the TDC algorithm. For the Actor, we demonstrate linear and superlinear convergence rates for the natural policy gradient.