Big data sets are important tools for modern science. Mining for correlations between millions of pieces of information can reveal vital relationships or predict future outcomes, such as risk factors for a disease or structures of new chemical compounds.
These mining operations are not without risk, however. Researchers can have a tough time telling when they have unearthed a nugget of truth, or what amounts to fool’s gold: A correlation that seems to have predictive value, but actually does not because it results from random chance.
Aaron Roth, an assistant professor in the Department of Computer and Information Science in the University of Pennsylvania’s School of Engineering and Applied Science, is developing new data mining tools that can help tell these nuggets apart.
Increasingly, scientists are employing on a data mining technique known as “adaptive analysis,” which involves combining multiple tests for the same data. The technique increases the predictive power of those tests, but it also has the ability to deceive.
As part of a research team that bridges academia and industry, Roth has published a study in Science that outlines a new method for using adaptive analysis without compromising statistical assurances that the tests are drawing valid conclusions.
Existing checks on adaptive analysis can only be applied to very large datasets. Acquiring enough data to run such checks can be logistically challenging or cost prohibitive. The method Roth and his collaborators outlined could increase the power of analysis done on smaller datasets, by flagging ways researchers can come to a “false discovery,” where a finding appears to be statistically significant but can’t be reproduced in new data.
For each hypothesis that needs testing, the method could act as a check against “overfitting,” where predictive trends only apply to a given dataset and can’t be generalized.
Roth’s co-authors include Cynthia Dwork, distinguished scientist at Microsoft Research; Vitaly Feldman, research scientist at IBM’s Almaden Research Center; Moritz Hardt, research scientist at Google; Toniann Pitassi, professor in the Department of Computer Science at the University of Toronto; and Omer Reingold, principle researcher at Samsung Research America.
Spotting false discoveries
While adaptive analysis provides a powerful tool for analysis, it also has the potential to create the appearance of patterns where none exist.
Imagine a person who receives an anonymous tip via email one morning saying the price of a certain stock will rise by the end of the day. At the closing bell, the tipster’s prediction is borne out, and another prediction is made. After a week of unbroken success, the tipster begins charging for his proven prognostication skills.
Many would be inclined to take up the tipster’s offer and fall for this scam. Unbeknownst to his victims, the tipster started by sending random predictions to thousands of people, and only repeated the process with the ones that ended up being correct by chance. While only a handful of people might be left by the end of the week, each sees what appears to be a powerfully predictive correlation that is actually nothing more than a series of lucky coin-flips.
In the same way, “adaptively” testing many hypotheses on the same data–each new hypothesis influenced by the last–can make random noise seem like a signal. That’s known as a false discovery. Because the correlations of these false discoveries are unique to the dataset in which they were generated, they can’t be reproduced when other researchers try to replicate them with new data.
One traditional way to check that a purported signal is not just coincidental noise is to use a “holdout.” This is a dataset kept separate while the bulk of the data is analyzed. Hypotheses generated about correlations between items in the bulk data can then be tested on the holdout. Real relationships would exist in both sets, while false ones would fail to be replicated.
The problem with using holdouts in that way is that, by nature, they can only be reused if each hypothesis tested is independent of the previous ones.
“One adaptive analysis we might like to carry out involves feature selection: Choosing a small number of predictive features, and then learning a predictive model on top of those features,” Roth said. “We might take the features from the training set and only keep them if they are predictive in the holdout set. But this only works if we use a new holdout set for each round of adaptivity.
“One thing you could do is get a totally new set of data for every time you test a hypothesis that is based on something you’ve tested in the past, but that means the amount of data you need to conduct your analysis is going to grow proportionally to the number of hypotheses your are testing”
To address this issue, the researchers developed a tool known as a “reusable holdout.” Instead of testing hypotheses on the holdout set directly, scientists would query it through a “differentially private” algorithm.
The “differentially” in the method’s name is a reference to the guarantee that a private algorithm makes. Its analyses should remain functionally identical when applied to two different datasets: one with and one without the data from any single individual. This means that any findings that would rely on idiosyncratic outliers of a given set would disappear when looking at data through a differentially private lens.
To test their algorithm, the researchers performed adaptive data analysis on a dataset rigged so that it contained nothing but random noise. The dataset was abstract, but could be thought of as one that tested 20,000 patients on 10,000 variables, such as variants in their genomes, for ones that were predictive of lung cancer.
Though by design, none of the variables in the set were predictive of cancer, reuse of a holdout set in the standard way showed that 500 of them had significant predictive power. Performing the same analysis with the researchers’ reusable holdout tool, however, correctly showed the lack of meaningful correlations.
An experiment with a second rigged dataset depicted a more realistic scenario. There, some of the variables did have predictive power, but traditional holdout use created a combination of variables with wildly overestimated this power. The reusable holdout tool correctly identified the 20 that had true statistical significance.
Beyond pointing out the dangers of accidental overfitting, the reusable holdout algorithm can warn users when they were exhausting the statistical power of a dataset. This is a red flag for what is known as “p-hacking,” or intentionally gaming the data to get a publishable level of significance.
Implementing the reusable holdout algorithm will allow scientists to generate stronger, more generalizable findings from smaller amounts of data.
“Our algorithm gives a statistically valid way of doing adaptive data analysis using less data than before,” said Roth.
In addition to a Sloan Fellowship, Roth’s research is supported by a National Science Foundation (NSF) CAREER Award.
“In today’s world, key scientific findings and business decisions depend on conclusions drawn from the analysis of data,” said Nina Amla, Acting Deputy Division Director for Computer and Network Systems at NSF. “This innovative research will inspire more confidence in the outcomes based on these data analyses.”