Abstract

With the rapid advancement into the Data Age, synthetic data has emerged as a promising avenue for sharing scientific information while protecting the original data. While existing research has primarily focused on generating synthetic data to accurately replicate the characteristics of observed data, we explore the potential of synthetic data to adjust for unrepresentative sampling. This study explores how sampling bias—specifically unbalanced subsets—impacts statistical inference, and whether synthetic data can help correct such bias. Using a Data Augmentation–Multiple Imputation (DA–MI) framework, we generate synthetic datasets from biased samples and evaluate parameter recovery under different correction strategies. Simulations under a Missing at Random (MAR) mechanism show that synthetic data can achieve bias reduction and variance stability comparable to inverse probability weighting (IPW). Applying this method to the American Trends Panel survey conducted by Pew Research Center in 2022, we observe that unrepresentative subsamples may lead to attenuated estimates of the key effects, such as gender and income. Bias-corrected synthetic data restores these effects and aligns more closely with full-sample benchmarks. Moreover, by masking and perturbing individual-level data during the synthetic data generation process, this approach also facilitates privacy preservation, enabling the development of bias-aware and shareable data products.

Committee Chair

Jimin Ding

Committee Members

Nan Lin, Robert Lunde

Degree

Master of Arts (AM/MA)

Author's Department

Statistics

Author's School

Graduate School of Arts and Sciences

Document Type

Thesis

Date of Award

5-2025

Language

English (en)

Share

COinS