Author's School

Graduate School of Arts & Sciences

Author's Department/Program



English (en)

Date of Award

January 2009

Degree Type


Degree Name

Doctor of Philosophy (PhD)

Chair and Committee

Nan Lin


Due to their size and complexity, massive data sets bring many computational challenges for statistical analysis, such as overcoming the memory limitation and improving computational efficiency of traditional statistical methods. In the dissertation, I propose the statistical aggregation strategy to conquer such challenges posed by massive data sets. Statistical aggregation partitions the entire data set into smaller subsets, compresses each subset into certain low-dimensional summary statistics and aggregates the summary statistics to approximate the desired computation based on the entire data. Results from statistical aggregation are required to be asymptotically equivalent. Statistical aggregation processes the entire data set part by part, and hence overcomes memory limitation. Moreover, statistical aggregation can also improve the computational efficiency of statistical algorithms with computational complexity at the order of O(Nm): m > 1) or even higher, where N is the size of the data. Statistical aggregation is particularly useful for online analytical processing: OLAP) in data cubes and stream data, where fast response to queries is the top priority. The &ldquo partition-compression-aggregation&rdquo strategy in statistical aggregation actually has been considered previously for OLAP computing in data cubes. But existing research in this area tends to overlook the statistical property of the analysis and aims to obtain identical results from aggregation, which has limited the application of this strategy to very simple analyses. Statistical aggregation instead can support OLAP in more sophisticated statistical analyses. In this dissertation, I apply statistical aggregation to two large families of statistical methods, estimating equation: EE) estimation and U-statistics, develop proper compression-aggregation schemes and show that the statistical aggregation tremendously reduces their computational burden while maintaining their efficiency. I further apply statistical aggregation to U-statistic based estimating equations and propose new estimating equations that need much less computational time but give asymptotically equivalent estimators.


Permanent URL: