Thesis Title: Privacy-preserving Statistical Tools: Differential Privacy and Beyond
Dr. Rachel Cummings, Industrial Engineering and Operations Research, Columbia University
Dr. Yajun Mei, School of Industrial and Systems Engineering, Georgia Tech
Dr. Mark Davenport, School of Electrical and Computer Engineering, Georgia Tech
Dr. Sara Krehbiel, Department of Mathematics and Computer Science, Santa Clara University
Dr. Jeff Wu, School of Industrial and Systems Engineering, Georgia Tech
Date and Time: 1:00 - 3:00 pm ET, Wednesday April 21, 2021
Meeting URL: https://bluejeans.com/960559571
Meeting ID: 960 559 571(BlueJeans)
Differential privacy has emerged as the de facto gold standard in protecting the privacy of individuals when processing sensitive data, because of its powerful formal guarantees. Several companies, including Google, Apple, Microsoft, have deployed differentially private tools, but barriers remain between such systems and full-featured privacy-preserving data analytics. This thesis focuses on two main challenges: private online decision-making and privacy of dataset-level properties.
Most of the existing differentially private tools are for static databases with non-adaptive analysis. However, modern data analysts interact with online dataset adaptively. The first part of this thesis studies private algorithms for two classical statistical online decision-making problems.
In Chapter 2, we study the statistical problem of change-point detection through the lens of differential privacy. This problem appears in many important practical settings involving personal data, such as identifying disease outbreaks based on hospital records, or IoT devices detecting activity within a home. We give the first private algorithms for both online and offline change-point detection. We prove a differential privacy guarantee and accuracy guarantees for our algorithms. We also give the first finite-sample accuracy guarantees for the standard non-private MLE. Additionally, we provide empirical validation, which provides evidence that our algorithms perform well for practical use.
In Chapter 3, we study False Discovery Rate (FDR) control in multiple hypothesis testing under the constraint of differential privacy. Unlike previous work in this direction, we focus on the online setting, meaning that a decision about each hypothesis must be made immediately after the test is performed, rather than waiting for the output of all tests as in the offline setting. We provide new private algorithms based on state-of-the-art results in non-private online FDR control. Our algorithms have strong provable guarantees for privacy and statistical performance as measured by FDR and power. We also provide experimental results to demonstrate the efficacy of our algorithms in a variety of data environments.
Second, classical differential privacy does not protect sensitive global properties of a dataset, such as the distribution of race and gender among users in the training set or proprietary information. We demonstrate a new dataset-level privacy vulnerability and introduce new privacy notions beyond individual privacy in the second part of this thesis.
In Chapter 4, we study the leakage of dataset properties at the population-level. Our primary focus in this work is on attacks to infer dataset properties in the centralized multi-party machine learning setting, where the model is securely trained on several parties' data, and parties only have black-box access to the final model. We propose an effective attack strategy that requires only a few hundred queries to the model and relies on a simple attack architecture that even a computationally bound attacker can use. Our attack successes on different types of datasets including tabular, text, and graph data, and leakage occurs even if the sensitive attribute is not included in the training data and has a low correlation with other attributes and the target variable.
In Chapter 5, we depart from individual privacy to initiate the study of attribute privacy, where a data owner is concerned about revealing sensitive properties of a whole dataset during analysis. We propose definitions to capture attribute privacy in two relevant cases where global attributes may need to be protected:(1) properties of a specific dataset and (2) parameters of the underlying distribution from which dataset is sampled. We also provide two efficient mechanisms and one inefficient mechanism that satisfy attribute privacy for these settings.