Genome in a Bottle: Spelling Out DNA’s ‘Dark’ Sequences

genetic sequencing
These genetic sequencing machines produce a tremendous amount of data; my job is to analyze it and develop reference materials to help other researchers look for genetic variants, like those responsible for disease, with confidence that their tests are working properly.Credit: M. Esser/NIST

When the Human Genome Project began 20 years ago, its consortium of researchers from 20 institutes spent more than $1 billion over 10 years to sequence that first genome’s billions of bases. Recently, researchers from the Harvard Personal Genome Project sent me my whole genome sequence at a small fraction of the time and cost. They compared my genome to the reference genome generated by the Human Genome Project and found over 3 million small differences, or variants. I was excited to explore my genome and perhaps find some clues related to the source of my type 1 diabetes and lymphedema, which are sometimes genetic but don’t run in my family. Yet, based on our collaborative work in the NIST-hosted “Genome in a Bottle” (GIAB) Consortium, I know there are still many challenges both in characterizing the sequence of all of the billions of bases in the human genome, the focus of our work, as well as in understanding what the sequence means.

Out of the more than three million variants found in my genome, fewer than 100 currently have a clear meaning. Some were unsurprising, like a variant that gives me seven times higher risk of male baldness, as evidenced by the many bald men in my family, including me. A few also tell me that I carry variants related to rare diseases in my DNA, so if my wife happened to be a carrier our children might be affected. Some variants also tell me I should be careful when taking certain kinds of medicines. Finally, a few variants suggest I may have a slightly higher risk for certain diseases, including one variant that gives me about five times higher risk of developing type 1 diabetes; still, half of the population has the variant and most of them do not get the disorder. None of my variants are currently known to be linked to lymphedema. This type of uncertainty is common even when sequencing someone with a severe disease thought to be genetic, but why?

Many explanations exist for why we often can’t find a clear genetic cause for a disease. For example, many diseases have both environmental and genetic causes. Also, the genetic causes of many diseases are complex, often weakly related to many different variants across the genome, so researchers will need to analyze millions of individuals from different ancestries. In addition, our current methods are not powerful enough to characterize the many types of variants and regions of the genome, which motivates my current work. Even though my NIST research for the past seven years has revolved around analyzing genome sequencing data, the amount of data and its complexity is still overwhelming.

a bookcase of containing over 100 volumes
The first printout of the human genome to be presented as a series of books, displayed in the ‘Medicine Now’ room at the Wellcome Collection, London. The 3.4 billion units of DNA code are transcribed into more than a hundred volumes, each a thousand pages long, in type so small as to be barely legible.Credit: Russ London/Creative Commons Attribution-Share Alike 3.0 Unported

DNA sequencing technologies have greatly improved during and since the Human Genome Project, so that today sequencing a typical genome only costs $1,000 to $10,000, depending on the technology. However, the new methods come with trade-offs and require complex computer analysis to piece together millions of sequence fragments from the human genome. This analysis is highly accurate today for about 80-90 percent of the genome’s small variants, but larger changes in the genome and changes in the repetitive, poorly understood “dark matter” of the genome are much more challenging. Fortunately, sequencing technologies and analysis methods are continually improving to characterize increasingly challenging regions of the genome at lower cost.

To gain trust in the results of new sequencing methods and improve them, experts from industry, academic labs and government formed the Genome in a Bottle Consortium (GIAB). This NIST-led collaboration includes companies and researchers developing new sequencing technologies and analysis methods working together to help clinical and research labs answer the question “So you’ve sequenced a genome, how well did you do?” We focus on characterizing a small number of genomes and literally put them in “bottles” that anyone can purchase from NIST as reference materials. Reference materials are essentially materials for which we have extensively characterized at least one thing about them. For example, we have reference materials with a known amount of cholesterol that clinical laboratories can use when testing your blood to ensure they get the correct answer whether you get tested in the U.S., Asia, or anywhere else in the world. 

For our DNA reference materials, our genomes in a bottle, we chose two mother-father-son trios from the Personal Genome Project and characterized their DNA sequences exceptionally well with many sequencing methods. While different sequencing methods compete in the marketplace, everyone works together openly in GIAB to take advantage of the strengths of each method and approximate, given some uncertainty, the true sequence of these genomes. Much of our NIST work for Genome in a Bottle has been developing methods to integrate results from all sequencing methods and develop our best estimate of the true sequence.  Then, similar to how clinical laboratories test cholesterol reference materials periodically to make sure they get a similar value to NIST’s, they sequence our publicly available DNA reference materials and compare their DNA sequence results to our approximation of the true sequence. Methods developers use these genomes to optimize their methods, and some have even used it to train artificial intelligence models to characterize genomes more accurately. Standards like our reference materials are part of delivering the promise of “precision medicine,” which will enable doctors to tailor treatments and give the right drug to the right person at the right time. Just like how the NIST-developed atomic clocks have enabled unexpected technologies like GPS, our precisely characterized genomes have enabled new technologies we hadn’t imagined at the start.

To help others use our precisely characterized genomes, we led the Global Alliance for Genomics and Health Benchmarking Team.  This team developed standardized methods for any lab to compare the variants they find in the DNA of NIST’s reference materials to NIST’s answer.  Comparing variants is complex because they can be represented in many different ways, so the biggest part of the team’s work was a “meta-comparison” of the tools used to compare variants.  After a few years of regular conference calls with scientists from New Zealand to the UK to the USA, we just published best practices for benchmarking so that performance of any sequencing method can be compared to any other method.

While the variants found by most prominent DNA tests are likely accurate in the easier regions of the genome, interpretation can be challenging. DNA frequently has a certain mystique, such that it’s easy to believe anything “genetic” determines our destiny.  In fact, most of our characteristics are a complex interplay between the sequence in many different parts of our genome and our environment.  In addition, clinical genome sequencing tests often give “variants of uncertain significance”, for which the evidence is not yet clear whether they might cause any disease.  It is even more challenging to interpret popular DNA-based ancestry tests ordered by millions of people around the world. These tests can give a view into the paternal or maternal line, The NY Times published a series of articles about the misuse of ancestry tests in support of white supremacist ideology, and genetics experts in the American Society of Human Genetics strongly condemned this misinterpretation. NPR’s Code Switch hosted a podcast with an interesting discussion of the important differences between race and ancestry. That said, it can be fun to see, for example, how one ancestry test clearly predicted that many of my ancestors migrated from Switzerland to eastern and central Pennsylvania, as I’d expect based on my Mennonite heritage.

Working with genomes from the PGP has been important for the Genome in a Bottle project. As a participant in the PGP (though I’m not one of the GIAB genomes), I went through their rigorous consent process to understand the potential risks for making my genome data and samples publicly available. These risks range from learning about diseases I could do nothing about to life insurance companies raising my premiums due to disease predicted by my genome to someone synthesizing my DNA and planting it at a crime scene. We are very grateful to the PGP and its participants because they have consented to allow anyone to make secondary reference samples. For example, several companies have already made about 100 different secondary reference samples from the GIAB/PGP genomes to meet specific needs for clinical laboratory testing such as samples that try to mimic small DNA fragments from tumors found in blood. One current challenge is the lack of families of non-Caucasian ancestry, so PGP wrote a blog describing the need for diverse volunteers for the project. We are working on finding reference samples from individuals of ancestries other than the Caucasian and Asian families we’ve characterized so far because this will help ensure the accuracy of sequencing tests across all ancestries.

Perhaps the cause of my diseases will be found in the “dark matter” of the genome. More importantly, a deeper understanding of the genome may help us treat conditions like cancer, muscular dystrophy, Huntington’s disease and schizophrenia. I am honored to be among the wide variety of experts in the Genome in a Bottle Consortium working to enable the new technologies that will measure these challenging parts of the genome and shed light on a path to cures for previously untreatable diseases.