Lupski, J.R., et al. (2010). Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. New England Journal of Medicine advance online 10.1056/nejmoa0908094
Roach, J.C., & et al. (2010). Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science : 10.1126/science.1186802
Two new papers out today - the first ever studies to employ whole-genome sequencing for disease gene discovery - neatly illustrate both the promise and the challenges lying ahead both for clinical and personal genomics.
The
first paper presents the final - and successful - outcome of
geneticist James Lupski's attempt to track down the genetic basis of his own disease. Lupski suffers from a syndrome called Charcot-Marie-Tooth (CMT) disease, a neurological condition which results in muscle weakness and wasting. The paper describes the process of sifting through the thousands of potentially functional variants to eventually pin down the mutations responsible, which turn out to be in a gene that has been previously associated with CMT.
This study is a clear illustration of the power of whole-genome sequencing to cast light on a long-standing personal mystery (Lupski has been searching for his disease mutation for decades). However, Lupski was fortunate that his mutation fell within a gene that had already been demonstrated to be linked to CMT; as the second study shows, researchers hunting for entirely novel disease-causing genes face a more serious challenge.
Here the outcome is less unambiguously cheerful: this paper illustrates that even with complete genomes it can still be hard to pick apart the genetic origins of disease.
Despite having entire genome sequences from four individuals,
the researchers could only narrow down the list of candidate disease-causing genes to a shortlist of four - and it was only with the addition of large-scale sequence data from an additional two unrelated patients that the most likely gene could be identified (this result was
published in a separate paper in November last year).
The basic problem here is that
we're still extremely bad at differentiating between mutations causing serious disease and perfectly benign polymorphisms - each of us have genomes littered with genetic variants that
look like nasty mutations but have little or no effect on health. In fact, Lupski's genome illustrates this nicely: one of the mutations causing his disease is a premature stop codon that disrupts the function of a gene - but
his genome also contains an additional 120 stop codons disrupting other genes, presumably without severe health effects.
So all of us are walking around with hundreds of gene-disrupting variants, and finding the single causative gene amongst all that noise is seriously challenging. In the case of the Miller syndrome study adding more genomes from other family members helps a lot, but it wasn't quite enough to nail down the gene responsible.
There's some ominous implications here for personal genomics as we move into the whole genome sequencing era. If it's hard to find a severe disease mutation using four complete genomes, how much more difficult will it be to interpret variants with much more subtle effects on health using only one genome (i.e. your own)? What will we do with rare, potentially serious-looking variants found in an individual's genome but nowhere else?
Predicting the functional effects of such variants - particularly if they happen to fall in one of the thousands of human genes without any confident functional annotation - is notoriously difficult. Yet each of them represents a potentially actionable piece of data, a variant that may portend some serious but preventable condition looming in our future or the future of our children - if only we had the knowledge we needed to interpret it.
In fact, the Lupski paper reminds us that even the functional annotation that
does exist is far from universally reliable: the team found that Lupski was homozygous for 5 other mutations marked in the
HGMD database as causing severe diseases that he does not actually have. It's likely that these represent errors either in the database or the primary literature (many alleged Mendelian mutations in the literature are in fact benign variants spuriously recorded as disease-causing).
The key message here is that
sequencing technology is still moving far faster than our ability to interpret the resulting data. Squeezing more value out of personal genome sequencing will require improved databases of variants (cue the
1000 Genomes Project) and vastly improved tools for inferring the functional effects of novel variants - a task that will require combining evolutionary data with large-scale functional experiments.
These two papers represent the first foray into the brave new world of clinical genetics: gene discovery and diagnosis using complete genome sequencing. The projections I've seen suggest that hundreds of severe disease patients will have complete genomes sequenced this year, and thousands more will have all of their protein-coding genes (i.e. their
exomes) sequenced. We'll be learning a lot about the complexities of genetic variation in the process; and hopefully, the end result will be vastly improved tools with utility for personal genomics in general.
Comments
Daniel,
fantastic post! This is precisely the problem with the rush to market with these tools and why Pollack will write his article. The future of this stuff for predicting common disease risk will require 100,000 genomes in most cases. We have a long way to go. And you highlight some of the problems very nicely.
Thanks for the no hype post!
-Steve
Posted by: Steven Murphy MD | March 11, 2010 6:13 AM
I think disease identification by genome sequencing reflects an inflection point in the maturation of the applications of DNA sequencing. Perhaps, historically 2010 will be noted for these publications in the context of the other genomes coming online monthly.
Posted by: David Bachinsky | March 11, 2010 8:42 AM
from the Nick Wade's NYTimes article on these papers:
"About 2,000 sites on the human genome have been statistically linked with various diseases, but in many cases the sites are not inside working genes, suggesting there may be some conceptual flaw in the statistics."
wtf?
Posted by: p-ter | March 11, 2010 12:13 PM
p-ter,
Heh - I'm right in the middle of blogging that exact same paragraph.
Seems like someone's been listening a little too hard to David Goldstein?
Posted by: Daniel MacArthur | March 11, 2010 12:23 PM
Also, Wade's comment, "less than a dozen genomes had been decoded, all of healthy people," has folks in St. Louis apoplectic.
Posted by: Michael T. | March 11, 2010 2:24 PM
Brilliant and informative post especially important in view of the NY Times article on the subject. I found the reader comments on the article almost as interesting as the article itself. People are very worried about their privacy and are also becoming savvy enough to realize we need a lot more data.
Posted by: Ron Ranauro | March 12, 2010 6:50 AM
I think whole genome sequence based GWAS will truly shine when it is cheap enough to do it with at least 1,000 cases and 1,000 controls.
I suspect this will take another five years to get there. Two to three years to get the genome sequenced at less than $1,000. Another two to three years to develop the computational algorithms and infrastructure and to actually conduct the case control study.
Posted by: Geneticist from the East | March 12, 2010 11:30 AM
Hi Geneticist,
I think you're being a little pessimistic - I'm expecting to see the $1,000 (reagent cost) genome in 2011, and the infrastructure required to map and analyse 2,000 genomes is already available; it's just expensive to put together.
Posted by: Daniel MacArthur | March 12, 2010 11:45 AM
One thing missing from your analysis (which is well above average) is some discussion about any hope of eventual "cost effectiveness".
Personally, I see very little hope there. The NYT states that an entire sequenced genome can now be yours for "only" $50,000. What they neglect to mention is the cost of then pulling any useful information out of the incredibly huge number of data points.
And the subsequent cost of putting any actual information to use- creating some kind of functional therapy. Designer RNA's, possibly, or specific protein blockers?
None of that is cheap, or ever will be.
So what they're working on, really, is NOT a technology which will bring broad benefits to the health of mankind- but a new way to extend the lifespans of the extremely wealthy.
Personally, I'm not in the mood at the moment, to help them out.
Posted by: Greenpa | March 12, 2010 11:50 AM