Abstract
With the rapid advancement into the Data Age, synthetic data has emerged as a promising avenue for sharing scientific information while protecting the original data. While existing research has primarily focused on generating synthetic data to accurately replicate the characteristics of observed data, we explore the potential of synthetic data to adjust for unrepresentative sampling. This study explores how sampling bias—specifically unbalanced subsets—impacts statistical inference, and whether synthetic data can help correct such bias. Using a Data Augmentation–Multiple Imputation (DA–MI) framework, we generate synthetic datasets from biased samples and evaluate parameter recovery under different correction strategies. Simulations under a Missing at Random (MAR) mechanism show that synthetic data can achieve bias reduction and variance stability comparable to inverse probability weighting (IPW). Applying this method to the American Trends Panel survey conducted by Pew Research Center in 2022, we observe that unrepresentative subsamples may lead to attenuated estimates of the key effects, such as gender and income. Bias-corrected synthetic data restores these effects and aligns more closely with full-sample benchmarks. Moreover, by masking and perturbing individual-level data during the synthetic data generation process, this approach also facilitates privacy preservation, enabling the development of bias-aware and shareable data products.
Committee Chair
Jimin Ding
Committee Members
Nan Lin, Robert Lunde
Degree
Master of Arts (AM/MA)
Author's Department
Statistics
Document Type
Thesis
Date of Award
5-2025
Language
English (en)
Recommended Citation
Jiang, Hannah, "Correcting Sampling Bias with Privacy-Preserving Synthetic Data: Inference Stability under the DA-MI Framework" (2025). Arts & Sciences Theses and Dissertations. 3640.
https://openscholarship.wustl.edu/art_sci_etds/3640