Researchers in Carnegie Mellon University‘s Event and Pattern Detection Laboratory have won the grand prize in Round 7 of the Yelp Dataset Challenge for their new approach in text analysis to detect disease outbreaks and rising consumer preferences. Their development could someday assist public health officials to increase efficiency and responsiveness to health related trends, outbreaks and emergencies.
Led by Daniel Neill, associate professor and director of the EPD Lab in the H.J. Heinz III College, the team received the grand prize for their paper “Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams.”
“Current state-of-the-art methods of text analysis based on topic modeling are very good at tracking the gradual evolution of topics over time but are not so good at detecting newly emerging topics, as would be necessary in a disease outbreak,” Neill said.
The team developed and integrated new approaches to contrastive topic modeling and spatial event detection. They applied the analysis to two datasets, one provided by the Yelp contest and the other from data collected by Allegheny County emergency departments.
“Approaches to disease surveillance might be effective in picking up types of disease outbreak that correspond well to existing syndrome definitions, such as ‘influenza-like illness,’ or ‘gastrointestinal illness,’ but they would fail to effectively detect novel outbreaks with rare or previously unseen patterns of symptoms,” said Neill, who has worked for over a decade in disease detection in collaboration with public health departments.
Neill and colleagues from CMU‘s Heinz College and the School of Computer Science and the University of Notre Dame, have been developing the semantic scan approach and applying it to public health disease surveillance data since 2011. With EPD Lab doctoral student Mallory Nobles, he is collaborating with the North Carolina and New York City health departments to detect novel disease outbreaks with previously unseen patterns of symptoms.
As applied to the Yelp dataset, broad uses of Semantic Scan could include a range of electronic records, from social media to public complaints, anywhere a need exists to detect emerging patterns, from predicting health needs, the supply and demand of specialized services in a geographic area, and topics trending in publications and patents.
The Yelp Challenge Dataset had information from 86,000 businesses in 2.7 million reviews and 649,000 tips from 687,000 users. The data represented the cities of Charlotte, N.C.; Urbana-Champaign, Ill.; Phoenix, Ariz.; Las Vegas and Madison, Wisc.; Montreal and Waterloo-Kitchener, Canada; Karlsruhe, Germany, and Edinburgh, U.K.
“We knew we had the kind of data academics love to explore, and the associated contest was a fun way to spread the word and grow our network of academics engaging with Yelp,” said Krista Lane, Yelp engineering recruiter.
Yelp, a crowd-sourced website on local businesses, began the Dataset Challenge in 2013 in the spirit of contributing to the academic community, similar to its participation in the open-source community for developers.