IBM and UC San Diego Explore Challenges in Constructing Knowledge Bases for Human Microbiome-Disease Associations

Human Microbiome-Disease

Researchers across nearly every discipline experience the struggle to stay up-to-date with the latest and greatest discoveries being published in their fields. In the field of human microbiome research where thousands of articles are being  published each year, this task is especially daunting. To help address this challenge, researchers have attempted to create organized, searchable resources known as knowledge bases, using computational techniques such as natural language processing (NLP) and text mining, that capture key information from research articles. These techniques are integral to the performance of voice-activated assistants such as Siri and Alexa, and enables human language to be translated into a form computers can digest to provide useful information back to users.

In a new paper in the journal Microbiome researchers at the University of California San Diego (UC San Diego) and IBM are working together through the Artificial Intelligence for Healthy Living program supported by IBM Research AI Horizons Network, to outline the barriers and opportunities for the creation of a comprehensive, accurate, and automated knowledge base for the field of human microbiome research. Dr. Chun-Nan Hsu, Associate Professor of Medicine at UC San Diego, and his team focus on first step in this process: the identification of which microbes are associated with human diseases. Solving this challenge would be a great asset to the field and go a long way towards reducing barriers to insight in microbiome research.

One of the biggest issues the team identified is that even state-of-the-art NLP and text mining tools are ineffective at handling the variety of names used for the same diseases and microbes. The authors note that these naming discrepancies arise due to competing standards, a lack of enforced standards, and incomplete knowledge of updated standards,  as well as the natural evolution of disease and microbial names as classifications evolve and improve over time. Compounding these issues is a shifting landscape of terminology and natural human variation in writing, organization, and specificity, as well as human error. Such variety has kept the task of classification in the realm of human annotators rather than computers, meaning that only a small number of knowledge bases have been created to date, with a relatively small number of microbes and diseases covered.

Nevertheless, the team was able to demonstrate that having a knowledge base of microbe-disease associations can quickly enable novel insights. Dr. Hsu explains that “even existing knowledge bases with limited coverage already make it easy to quickly find some new bacteria targets to study a disease and find contradictory conclusions that may be due to experimental flaws. With the use of type 1 and type 2 diabetes as an example, we found that bacteria reported in contradictory results are known to be the target of some contaminated sequencing tools available to identify the presence of those bacteria in biological samples.” He continued to explain the alternative of working without any knowledge base would require someone to have to query and browse through up to tens of thousands of papers and still may not be able to come up with these findings.

Dr. Hsu and his collaborators at IBM are actively engaged in the development of novel NLP methods that can more effectively read and understand the diseases and microbes mentioned in published articles. These methods can help enable the automated  knowledge base creation that has thus far been out-of-reach, with the goal of saving valuable time and helping to accelerate the pace of human microbiome research in the future.