Abstract

As artificial intelligence systems are increasingly applied in safety critical domains such as robotics, autonomous driving, and decision making under uncertainty, ensuring their trustworthiness has become a central challenge. This dissertation addresses two major facets of trustworthy AI: reliable control through formal guarantees and alignment against adversarial manipulation. The first part of the dissertation focuses on provably stable, robust, and safe control for nonlinear systems using learning based methods. We introduce the first general framework for synthesizing neural Lyapunov controllers in discrete time systems. This method combines a sound verifier based on mixed integer linear programming with gradient based counterexample generation to efficiently learn control policies that satisfy formal stability conditions. We extend this framework to adversarial settings where state observations are perturbed, developing verification and training techniques that produce controllers robust to both persistent and intermittent attacks. We then propose a method for verified safe reinforcement learning in neural dynamical systems using finite horizon reachability and curriculum learning. Our approach achieves strong safety guarantees across multiple benchmarks while preserving task performance. To improve robustness in environments with high dimensional perceptual inputs, we develop a novel curriculum based adversarial training framework that significantly enhances deep reinforcement learning against large input perturbations. Finally, we introduce a partially supervised reinforcement learning framework that enables safety certification in partially observable environments by leveraging access to interpretable low dimensional states during training. The second part of the dissertation investigates how AI systems that learn from human preferences can be manipulated. We model election control through voter perception manipulation using spatial voting theory and characterize the computational complexity of such attacks under various assumptions. Building on this insight, we explore preference poisoning attacks on reward models, a core component of value aligned AI systems including reinforcement learning from human feedback. We develop and evaluate both gradient based and heuristic attacks that show high success even with minimal data poisoning across domains such as autonomous control and large language model alignment. Together, these contributions offer principled methods for building AI systems that are stable, robust, safe, and aligned with human values, laying a foundation for future progress in trustworthy autonomy.

Degree

Doctor of Philosophy (PhD)

Author's Department

Computer Science & Engineering

Author's School

McKelvey School of Engineering

Document Type

Dissertation

Date of Award

5-9-2025

Language

English (en)

DOI

https://doi.org/10.7936/hx05-s057

Recommended Citation

Wu, Junlin, "Trustworthy Autonomy Through Robust Control and Alignment" (2025). McKelvey School of Engineering Theses & Dissertations. 1257.

The definitive version is available at https://doi.org/10.7936/hx05-s057

Download

Included in

Computer Sciences Commons

COinS

DOI

https://doi.org/10.7936/hx05-s057

McKelvey School of Engineering Theses & Dissertations

Trustworthy Autonomy Through Robust Control and Alignment

Abstract

Degree

Author's Department

Author's School

Document Type

Date of Award

Language

DOI

Recommended Citation

Included in

DOI

Search

Links

Browse

Author Corner

McKelvey School of Engineering Theses & Dissertations

Trustworthy Autonomy Through Robust Control and Alignment

Author

Abstract

Degree

Author's Department

Author's School

Document Type

Date of Award

Language

DOI

Recommended Citation

Included in

Share

DOI

Search

Links

Browse

Author Corner