Date of Award
Summer 9-13-2023
Degree Name
Doctor of Philosophy (PhD)
Degree Type
Dissertation
Abstract
In both Planning and Reinforcement Learning, an interactive environment is a dynamic system with which an agent interacts to achieve its goals. In planning, the agent uses its knowledge of the environment to generate a plan offline and then executes it in the interactive environment, adapting if necessary. In reinforcement learning, the agent learns and refines its policy while actively engaging with the environment, receiving feedback in the form of rewards or penalties, and balancing exploration and exploitation. However, when the environment becomes large and complex, the number of possible states and actions can become overwhelming. These are referred to as large interactive domains, and they pose a significant challenge for learning algorithms. The complexity of these domains can result in the agent taking a long time to learn, if it learns at all, as the agent must explore a vast number of possible actions and states before it can converge to an optimal policy. Additionally, the computational cost of learning the description of these large domains can be high, making it difficult to train the agent in a reasonable amount of time. This dissertation explores learning in large interactive domains in two different settings and learning types: offline planning domains, where the domains’ models are unknown, and online reinforcement learning domains, where the domains are known but the optimal policies and reward distribution are not. In the first part of the dissertation, we introduce the model-free planning problem and propose the first safe model-free learning algorithm for learning the lifted domain’s action model. We then prove the correctness of our approach and provide an experiment demonstrating its efficiency in learning action models for different large domains. In the second part, we extend the safe model-free planning problem to a setting with partial observability and propose two algorithms for learning action models that are guaranteed to be safe even with partially observable states. We theoretically analyze the safety and sample complexity required for each algorithm to guarantee probabilistic completeness and show through experimental results that the proposed algorithms generally outperform a state-of-the-art action model learning algorithm. In the final part of the dissertation, we explore learning in large interactive domains in online learning, specifically the combinatorial multi-armed bandit problem (CMAB), and propose a Sum-of-Square (SoS) approach to solve an instance of this problem, the route planning problem. We provide proof of the identifiability of the SoS program and design an experiment to evaluate the proposed algorithm and compare it with existing CMAB approaches for route planning.
Language
English (en)
Chair
Juba Brendan