The ability to amass, store, manipulate and analyze information from millions of people at once has opened a vast frontier of new research methods. But, whether these methods are used in the service of new business models or new scientific findings, they also raise questions for the individuals whose information comprises these “big data” sets. Can anyone really guarantee that these individuals’ information will remain private?
Along with colleagues from Harvard University, Microsoft Research, Pennsylvania State University and IBM Research, he presented some of his ongoing research on “differentially private” algorithms at the American Association for the Advancement of Science Annual Meeting in San Jose, Calif.
The differential privacy approach ensures that people querying big databases see only representative trends and can’t game their searches to reveal information about specific individuals in the set.
This technique has implications for companies like Google, Facebook and Amazon, businesses which depend on gleaning insights from trends in users’ behavior while maintaining their trust that their personal data is not being exploited or compromised. But it could extend beyond the tech giants and into the world of scientific publishing by addressing the so-called “false discovery” problem, where some scientific findings seem valid but cannot be reproduced. Such false discoveries stem from “overfitting” hypotheses to outlying individuals in a dataset, rather than generalizable trends that apply the population at large.
“There’s this idea where the more privacy you have the less useful the data is,” Roth says. “There’s some truth to that, but it’s not quite so simple. Privacy can also increase the usefulness of data by preventing this kind of overfitting”
The “different” in its name is a reference to the guarantee a differentially private algorithm makes. Its analyses should remain functionally identical when applied to two different datasets: one with and one without the data from any single individual.
“The math behind differentially private algorithms gives you a provable guarantee that, even if you know everything else about the dataset, you can’t learn anything new about my data in particular,” Roth says. “The algorithm essentially can’t tell whether my data is in the set in the first place.”
The very nature of large data sets makes privacy a dicey proposition. Stripping those records of common identifiers, such as names or email addresses, still leaves a trove of information that could, with some effort, be used to pinpoint data from a specific individual. Such “ad hoc” methods of protecting privacy in a data set are ultimately doomed in that the set’s owner can never predict what outside information would-be attackers might use to reverse-engineer the hidden data.
“For example, at one point, it was shown that an attack on Amazon’s recommendation algorithm was possible,” Roth says, “If I knew five or six things you bought on Amazon, I could buy those same things, and all of a sudden, we’re now the two most similar customers in Amazon's recommendation algorithm. I could then start seeing what else you were buying, as whatever you bought would then be recommended to me.”
A differentially private recommendation algorithm would defend against such an attack because it would discount idiosyncratic trends as being just that: only representing an individual’s data rather than something that is statistically valid for the entire set.
Beyond protecting customers’ private information, such an algorithm would also be better at its job.
“You don’t actually want something that is good at predicting what people have bought historically; that may just be an example of you overfitting the data,” Roth says. “You want something that predicts what they are going buy tomorrow — things that are not in the set yet, and the same applies to scientific findings.”
Generating and collecting data is often the most expensive and time-consuming part of a scientific study, so datasets are often shared among scientists. This altruism has a hidden downside, however, as it disrupts the scientific publishing system’s standard method of ascribing significance to a particular finding.
“The way you normally determine if a finding is significant is by computing its ‘p-value,’” Roth says. “This tells you the probability that the correlation you observe would appear just as significant if it occurred by random chance. The standard level for significance is .05, but that also means that if you test 100 hypotheses, even if they’re all wrong, you’d expect five of them would appear significant.
“There are ways to correct for the fact that you test many hypotheses, but the existing methods only work if you came up with all of your hypotheses before anyone ever looked at the data. If scientists re-use the same datasets, these guarantees disappear.”
If rather than accessing raw data, scientists share access to a dataset but only allow it to be queried through a differentially private algorithm, then they recover the ability to protect against “false discoveries” that come from over-fitting the data. Roth and his colleagues Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi and Omer Reingold have theory proving the effectiveness of this approach, which will be presented this summer at the ACM Symposium on the Theory of Computing.
“You always want to have the conclusions you draw to be based on true statistical patterns in the data, rather than the idiosyncrasies of a single individual in the set,” Roth says. “This is the same thing you want in private data analysis, and this why differential privacy can also prevent false discoveries.”