Title: From Multimodal LLMs to Generalist Embodied Agents

 

Date: Wednesday, May 7th, 2025

Time: 12:30pm - 2:00pm EST

Locationhttps://gatech.zoom.us/j/99896191332

 

Andrew Szot

Machine Learning PhD Student

School of Interactive Computing

Georgia Institute of Technology

 

Committee

1. Dr. Zsolt Kira (Advisor) - School of Interactive Computing, Georgia Institute of Technology

2. Dr. Dhruv Batra (Advisor) - School of Interactive Computing, Georgia Institute of Technology

3. Dr. Sehoon Ha - School of Interactive Computing, Georgia Institute of Technology

4. Dr. Larry Heck - School of Interactive Computing, Georgia Institute of Technology

5. Dr. Alexander Toshev - Apple

 

Abstract

This thesis investigates how finetuning multimodal large language models (MLLMs) with large-scale embodied experience unlocks their ability to act as generalist embodied agents. While MLLMs provide reasoning abilities and broad world knowledge, finetuning aligns these capabilities with the demands of embodied agents, unlocking their potential to act autonomously across diverse tasks and domains. The first part of the thesis introduces Habitat 2.0, a platform for generating large-scale embodied experience in simulated 3D, interactive environments. The next part presents a method to adapt MLLMs as generalizable embodied policies with a study on how best to ground these MLLM policies in embodied action spaces. Last, this thesis proposes the Generalist Embodied Agent (GEA), a single model that generalizes to unseen tasks across diverse domains and highlights the importance of online RL for developing a capable generalist agent. Overall, this research aims to utilize the strengths of MLLMs to build generalist embodied agents with strong generalization capabilities.