Date of Award
Doctor of Philosophy (PhD)
Over the last decade, there has been a sharp surge in the collection, sharing and use of genomic data in a broad array of settings - including but not limited to direct-to-consumer (D2C) products such as ancestry and genetic testing kits sold by 23andMe and AncestryDNA, clinical applications including personalized medicine, as well as for broader research purposes in repositories like the 1000 Genomes Project. Further, several open-access repositories where consumers may upload their sequenced genomes, such as the Personal Genome Project and OpenSNP, are becoming increasingly popular. Users of said repositories are motivated either by altruism or by a sense of community built around ancestry or rare genetic traits that users may have in common. As a natural consequence, the sharing of such data has been accompanied by acute privacy concerns, both among researchers as well as the general public. Several of these concerns revolve around membership-inference attacks, whereby an adversarial actor attempts to infer a target individual's presence in a given dataset. Membership-inference attacks pose a significant privacy risk, as genomic datasets often contain potentially sensitive metadata, such as underlying medical conditions which the attacker may now infer about individuals predicted to be present in the dataset.In this thesis, I formalize membership-inference attacks on genomic data, and present optimization-based techniques to defend against them, with the aim of minimizing the impact on the utility of the data while preserving individual privacy. I present two such membership-inference attack paradigms. In the first model, an attacker attempts to establish an individual's presence in an open-access but anonymized dataset such as OpenSNP. In this setting, I assume that the attacker uses information about phenotypes (specifically, facial features) of known target individuals (such as relatives) to predict if a genome in the dataset is a potential match. In particular, I consider the case where the attacker has access to a face photograph for each target individual, and attempts to create a probabilistic matching between each photograph and genomes in the dataset. I make two main contributions: a) I evaluate the risk of such an attack on photographs obtained in the wild, i.e. user-uploaded images, in contrast to photographs produced in controlled lab settings or 3-D face scans used in prior literature, and b) I propose and evaluate the use of gradient-based techniques to add adversarial noise to face images as an effective defense against such attacks.In the second model, I consider membership-inference attacks on genomic summary statistics, which may take the form of binary allele presence/absence query responses for each position, or their respective alternate allele frequencies. In this model, the attacker has access to a set of target genomes and attempts to infer whether each target genome was likely part of the dataset over which the summary was computed. In this setting, I make four contributions: a) I propose a novel attack model which leverages the distributional separation between likelihood ratio test scores of individuals in a given dataset and a reference population of individuals not in the dataset, b) I present the first formalization of membership-inference attacks in an online query setting, and authenticated and unauthenticated forms of access-control, c) I present highly-scalable greedy approaches to optimizing the privacy-utility tradeoff for genomic summary statistics by inducing binary or real-valued noise to the data release and selectively suppressing a subset of the release, and d) I use generative deep neural networks to construct a more powerful membership-inference model that is fairly robust to data perturbation.
Jeremy Buhler, Bradley A. Malin, Brendan Juba, Netanel Raviv,