Monday, April 9, 2018

Those hoofbeats just might come from zebras

Image by Eric Dietrich via Wikimedia Commons
A physician in the 1940s wanted to convey to his students that the most obvious diagnosis is most likely to be the correct one, so he coined a saying that has become famous: “When you hear hoofbeats, think of horses not zebras.” Applying this concept to complex disease genetics, if a risk-associated variant causes a non-synonymous mutation in a coding sequence, the first hypothesis to consider is that it affects disease risk by altering the protein. But although this is often the case, one of the lessons we can learn from a large new study, published today and now available for browsing and searching in the T2D Knowledge Portal, is that we should not forget about zebras.

The new study, from a global coalition of scientists (Mahajan et al., Nature Genetics 2018), is an exome-wide association study that surveyed the T2D associations of variants within the protein-coding regions of the genome. Including more than 81,000 T2D cases, over 370,000 controls, and multiple ancestries, this study has a three-fold larger effective sample size than any previous study. Using p-value < 2.2 x 10-7 as a threshold for significance across the exome, the authors found 69 significantly associated coding variants representing 40 distinct association signals in 38 loci—16 of which had not been previously associated with T2D risk.

To get a better idea of which variants in these loci were causal for T2D risk, the researchers performed fine mapping for 37 of the 40 significant signals. They meta-analyzed T2D associations for over 500,000 individuals of European descent, performed imputation, and then generated 99% credible sets for each signal—that is, sets of variants that are 99% likely to include the causal variant. To calculate the credible sets, they used an “annotation-informed prior” model of causality that took into account the distribution of associations for different variant impact classes and also the overlap of variants with putative enhancer elements.

The 37 association signals for which the authors generated credible sets were all due to coding variants that would cause changes in the sequence of the encoded protein. But surprisingly, the fine mapping analysis found that coding variants were likely to be causal for T2D risk at fewer than half of these loci.

One of these surprising results involves a gene that is well-known to be relevant to T2D: PPARG. Involvement of the PPARG protein in T2D is beyond doubt, since this ligand-inducible transcription factor is the target of thiazolidinedione drugs that are used to treat T2D. A common variant in PPARG, rs1801282, that causes a p.Pro12Ala change in the protein has been assumed to account for the T2D association, but there is little experimental evidence that this change affects PPARG function.

In the credible set generated in this study, the probability that rs1801282 is causal was not found to be particularly high. Included in this credible set along with rs1801282 are 19 non-coding variants. One of these was previously shown to affect a binding site for the transcription factor PRRX1 and to affect expression of PPARG2, a PPARG isoform. This suggests the intriguing possibility that the T2D risk in this locus is caused, partly or wholly, by variants affecting regulation rather than protein sequence.

A similar pattern, with partial causality due to non-coding variants, was seen at an additional 7 loci. And in 13 other loci, even though these loci were discovered via coding variant signals, non-coding variants had the highest probability of causing risk.

According to Professor Mark McCarthy of the University of Oxford, one of the principal investigators of the study, “Our study shows that we should not jump to conclusions when we see that one of our association signals includes a variant around which we can base an attractive mechanistic narrative. The “average” coding variant is more likely to be causal than the “average” noncoding variant, but even at the set of loci where we detect a significant coding variant association, it is as likely as not that the signal is driven instead by one of the non-coding variants nearby. By bringing together genetic and genomic data, we can improve our prospects for finding the causal variants at GWAS loci, but these should be the starting points for empirical studies not a destination in themselves.” Dr. McCarthy has written a commentary on this study; read it here.

So, in investigating complex disease genetics, it is still a good bet that a coding variant affects disease risk via altered protein sequence: at least in some parts of the world, hoofbeats are very often due to horses. But this study reminds us that it is always a good idea to look beyond the obvious hypothesis, and remember the zebras.

This paper includes many other discoveries, and we recommend that you read the paper to get the full story. We are pleased to announce that in addition to publishing the paper, the authors have made their results available to the T2D research community immediately upon publication, in the T2D Knowledge Portal.

The dataset in the T2DKP is named ExTexT2D (ExTended exome array genotyping for T2D) and includes associations for T2D, both unadjusted and adjusted for BMI. A description of the dataset along with a table listing the cohorts of the study subjects can be found on the Data page, and you can browse and search the ExTexT2D exome chip analysis dataset at these locations in the T2DKP:

On Gene pages (see an example) on the Common variants and High-impact variants tabs
On Variant pages (see an example) in the Associations at a glance section and the Association statistics across traits table
Via the Variant Finder search
View a Manhattan plot of associations across the genome by selecting “type 2 diabetes” or “type 2 diabetes adj BMI” in the View full genetic association results for a phenotype menu on the home page.

This dataset offers by far the largest sample size for exploring associations of low-frequency and common coding variants with T2D. The size of the study enabled evaluation of which coding variants mediate GWAS signals and which are simply "proxies" to the true causal variant, as revealed in the credible set analysis. With the addition of this dataset, the T2DKP offers in-depth information on two aspects of exome associations: common and low-frequency variant associations in ExTexT2D, and comprehensive coding variant associations in the 19K exome sequence analysis dataset (soon to include 50,000 exomes).

We are pleased to provide access to these important new results. Please contact us with any questions or comments about these new data or the T2DKP in general!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.