Abstract
As artificial intelligence systems are increasingly applied in safety critical domains such as robotics, autonomous driving, and decision making under uncertainty, ensuring their trustworthiness has become a central challenge. This dissertation addresses two major facets of trustworthy AI: reliable control through formal guarantees and alignment against adversarial manipulation. The first part of the dissertation focuses on provably stable, robust, and safe control for nonlinear systems using learning based methods. We introduce the first general framework for synthesizing neural Lyapunov controllers in discrete time systems. This method combines a sound verifier based on mixed integer linear programming with gradient based counterexample generation to efficiently learn control policies that satisfy formal stability conditions. We extend this framework to adversarial settings where state observations are perturbed, developing verification and training techniques that produce controllers robust to both persistent and intermittent attacks. We then propose a method for verified safe reinforcement learning in neural dynamical systems using finite horizon reachability and curriculum learning. Our approach achieves strong safety guarantees across multiple benchmarks while preserving task performance. To improve robustness in environments with high dimensional perceptual inputs, we develop a novel curriculum based adversarial training framework that significantly enhances deep reinforcement learning against large input perturbations. Finally, we introduce a partially supervised reinforcement learning framework that enables safety certification in partially observable environments by leveraging access to interpretable low dimensional states during training. The second part of the dissertation investigates how AI systems that learn from human preferences can be manipulated. We model election control through voter perception manipulation using spatial voting theory and characterize the computational complexity of such attacks under various assumptions. Building on this insight, we explore preference poisoning attacks on reward models, a core component of value aligned AI systems including reinforcement learning from human feedback. We develop and evaluate both gradient based and heuristic attacks that show high success even with minimal data poisoning across domains such as autonomous control and large language model alignment. Together, these contributions offer principled methods for building AI systems that are stable, robust, safe, and aligned with human values, laying a foundation for future progress in trustworthy autonomy.
Degree
Doctor of Philosophy (PhD)
Author's Department
Computer Science & Engineering
Document Type
Dissertation
Date of Award
5-9-2025
Language
English (en)
DOI
https://doi.org/10.7936/hx05-s057
Recommended Citation
Wu, Junlin, "Trustworthy Autonomy Through Robust Control and Alignment" (2025). McKelvey School of Engineering Theses & Dissertations. 1257.
The definitive version is available at https://doi.org/10.7936/hx05-s057