Date of Award
Doctor of Philosophy (PhD)
A machine learning workflow is the sequence of tasks necessary to implement a machine learning application, including data collection, preprocessing, feature engineering, exploratory analysis, and model training/selection. In this dissertation we propose the Machine Learning Morphism (MLM) as a mathematical framework to describe the tasks in a workflow. The MLM is a tuple consisting of: Input Space, Output Space, Learning Morphism, Parameter Prior, Empirical Risk Function. This contains the information necessary to learn the parameters of the learning morphism, which represents a workflow task. In chapter 1, we give a short review of typical tasks present in a workflow, as well as motivation for and innovations in the MLM framework.
In chapter 2, we first define data as realizations of an unknown probability space. Then, after a brief introduction to statistical learning, the MLM is formally defined. Examples of MLM's are presented, including linear regression, standardization, and the Naive Bayes Classifier. Asymptotic equality is defined between MLM's by analyzing the parameters in the limit of infinite training data. Two definitions of composition are proposed, output and structural. Output composition is a sequential optimization of MLM's, for example standardization followed by regression. Structural composition is a joint optimization inspired by backpropagation from neural nets. While structural compositions yield better overall performance, output compositions are easier to compute and interpret. In Chapter 3, we define the property of separability, where an MLM can be optimized by solving lower dimensional sub problems. A separable MLM represents a divide and conquer strategy for learning without sacrificing optimality. We show three cases of separable MLM's for mean-squared error with increasing complexity. First, if the input space consists of centered, independent random variables, OLS Linear Regression is separable. This is extended to linear combinations of uncorrelated ensembles, and ensembles of non-linear, uncorrelated learning morphisms. The example of principal component regression is explored thoroughly as a separable workflow, and the choice between equivalent linear regressions is discussed. These separability results apply to a wide variety of problems via asymptotic equality. Functions which can be represented as power series can be learned via polynomial regression. Further, independent and centered power series can be generated using an orthogonal extension of principal component analysis (PCA). In Chapter 4, we explore the connection between generalization error and lower bounds used in estimation. We start by defining the ``Bayes MLM", the best possible MLM for a given problem. When the loss function is mean-squared error, Cramer-Rao lower bounds exist for an MLM which depend on the bias of the MLM and the underlying probability distribution. This can be used as a design tool when selecting candidate MLM's, or as a tool for sensitivity analysis to examine the error of an MLM across a variety of parameterizations. A lower bound on the composition of MLM's is constructed by applying a nonlinear filtering framework to the composition. Examples are presented for centering, PCA, ordinary least-squares linear regression, and the composition of these MLM's. In Chapter 5 we apply the MLM framework to design a workflow that predicts 30-day hospital readmissions. Hospital readmissions occur when a patient is admitted less than 30 days after a previous hospital stay. We examine readmissions for a group of medicare/medicaid patients with the four most common diagnoses at Barnes Jewish Hospital. Using MLM's, we incorporate the Mapper algorithm from topological data analysis into the predictive workflow in a novel ensemble. This ensemble first performs fuzzy clustering on the training set, and then trains models independently on each cluster. We compare an assortment of workflows predicting readmissions, and workflows featuring mapper outperform other standard models and current tools used for risk prediction at Barnes Jewish. Finally, we examine the separability of this workflow. Mapper workflows incorporating AdaBoost and logistic regression create node models with low correlation. When PCA is applied to each node, Random Forest node models also become decorrelated. Support Vector Machine node models are highly correlated, and do not converge when PCA is applied. This is consistent with their worse performance. In Chapter 6 we provide final comments and future work.
Patricio S. La Rosa, ShiNung Ching, Ulugbek Kamilov, Neal Patwari,