Friday, December 21, 2018

Last 2018 T2DKP release, dedicated to Todd Green

Todd Green
Today's release in the Type 2 Diabetes Knowledge Portal of multiple new datasets, including associations for 17 new phenotypes, is dedicated to the memory of Todd Green, a Portal team member who passed away unexpectedly on November 19. Todd was an integral part of the Portal project since its inception several years ago, and had worked at the Broad Institute for many years previously. His expertise in applying statistical methods to GWAS and sequencing studies earned him co-authorship on more than 50 papers on the genetics of complex traits, focusing on inflammatory bowel disease and type 2 diabetes. The Portal team will miss him greatly, and we will carry on his spirit in the work that we do.

This release adds to the T2DKP nine new datasets:
  • ADIPOGen GWAS, a study by Dastani et al. and the ADIPOGen Consortium, is a multi-ethnic meta-analysis of adiponectin associations in nearly 46,000 subjects. Levels of the hormone adiponectin are inversely associated with type 2 diabetes and other metabolic traits. 
  • Leptin GWAS, published by Kilpeläinen et al., is a meta-analysis of 23 association studies, with over 32,000 subjects, for unadjusted and BMI-adjusted circulating levels of leptin. Levels of leptin, a hormone secreted by adipocytes, correlate with adiposity measures such as body fat mass and body fat index. 
  • Four MAGIC HbA1c GWAS ancestry-specific datasets. This study from the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC), published by Wheeler et al., is a meta-analysis of HbA1c levels. The results in the T2DKP comprise separate datasets for four ancestries: African American (~7,600 subjects), East Asian (~21,000 subjects), European (~124,000 subjects), and South Asian (~9,000 subjects). 
  • CHARGE Fatty Acid GWAS, published in two papers (Guan et al. and Wu et al.) from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, includes associations with plasma levels of nine different fatty acids and phospholipids in nearly 9,000 individuals of European ancestry. 
  • Global Urate Genetics Consortium GWAS, published by Köttgen et al., is a meta-analysis of variant associations with levels of serum urate across more than 100,000 European-ancestry individuals. Although levels of serum urate are not on their own considered to be associated with T2D risk, a recent clustering analysis (Udler et al.) found that loci associated with decreased serum urate concentrations cluster with loci associated with liver function and lipid metabolism, potentially identifying biochemical pathways involved in a sub-type of type 2 diabetes.
  • JDRF Diabetic Nephropathy Collaborative Research Initiative GWAS (Salem et al.) represents pre-publication sharing of results from a genome-wide association study across nearly 19,500 individuals with type 1 diabetes who were assessed for a wide range of phenotypes related to kidney function and kidney disease. Results for associations with ten different phenotypes (both unadjusted, and adjusted for HbA1c and BMI) are integrated into the T2DKP.

Summary statistics for all of the datasets listed above except for the unpublished JDRF DNCRI dataset are available for public download. In addition, researchers in the Slim Initiative in Genomic Medicine for the Americas (SIGMA) Consortium have made three sets of summary statistics available for download: GWAS SIGMA, SIGMA exome chip analysis, and the exome sequences from the SIGMA cohorts that are included in the 19k exome chip analysis dataset. Find details and the download links here.

This release also includes a new feature: a column showing the gene in which a variant resides is now included by default in the table of High-impact variants on the Gene page.

The High-impact variants table on the CDC123 gene page

This table is meant to highlight variants that impact the coding sequence of a gene and thus may exert their effects on disease risk through that gene product. However, since the Gene page integrates information about variants across the region of a gene, including 100kb of up- and downstream sequences, the variants shown in this table may be located in the coding sequences of nearby genes rather than in the gene that is the focus of the Gene page. In the example shown above, from the CDC123 gene page, only one of the variants in the table affects the CDC123 coding sequence; the inclusion of the Gene column makes this immediately clear.

As 2018 draws to a close, we on the T2DKP team extend our best wishes to T2DKP visitors for health, happiness, and good research results. We look forward to hearing from you in the New Year!

Wednesday, November 7, 2018

Meet the Knowledge Portal team at AHA

This weekend, cardiovascular researchers from around the globe will be meeting in Chicago for the 2018 Scientific Sessions of the American Heart Association. Members of the Knowledge Portal Network team will be there to meet and talk with geneticists and biologists who use the Portals and get your input on how we can improve them.

Please come visit us at booth #2249 in the Exhibit Hall! We'll be there on Saturday, Nov. 10 from 11am-5pm; on Sunday, Nov. 11 from 10am-4:30pm; and on Monday, Nov. 12 from 10am-3pm.

Tuesday, October 23, 2018

New features and a new Portal released at ASHG

The Knowledge Portal team is back at work after a fantastic week at the American Society of Human Genetics meeting. We had many great conversations with researchers at our exhibit booth and at the Broad Institute exhibit booth, where we had a couple of guest spots. This year, we also held a workshop session on the Knowledge Portal Network and the Diabetes Epigenome Atlas (DGA), and about 80 people came to learn the basics of navigating the Knowledge Portals and the DGA. We were asked to provide the slides from that session, and they can be viewed here, but please note that they may not be easy to interpret without the accompanying oral presentation. We are working on creating both instructional webinars and short videos explaining different aspects of the Portals; stay tuned! And in the meantime, please contact us with any questions--we're here to help.

Part of the Knowledge Portal Network team at our ASHG booth

As usual, we released a number of new features on the Type 2 Diabetes Knowledge Portal in time for the ASHG meeting:

Calculated credible sets

Credible sets are useful because they assign to individual variants in a locus a probability of being causal for a phenotype. On Gene Pages (see an example), when viewing the type 2 diabetes phenotype, the Credible sets tab displays credible sets generated and published by Mahajan et al. (2018). However, credible sets have not been generated by researchers for phenotypes in the T2DKP other than T2D.

Now, the T2DKP provides calculated credible sets for all phenotypes. When viewing a phenotype other than T2D on the Gene page, the Credible sets tab is replaced by a Calculated credible set tab. This LocusZoom module, developed by our AMP T2D partners at the University of Michigan, automatically calculates posterior probabilities from p-values. Calculated credible sets include up to 10 variants; the credible interval covered by the set may vary, depending on the strength of associations across the region.

UK Biobank PheWAS

Recently, we added to the T2DKP another LocusZoom module for displaying phenome-wide associations. The PheWAS display, showing associations for a variant across all of the phenotypes included in the T2DKP, is the default visualization in the "Associations at a glance" section of Variant pages (see an example). Now, by checking the "Use UKBB data" box, you can view associations for a variant across about 1,400 UK Biobank phenotypes from an analysis performed by our AMP T2D partners at the University of Michigan.

New LocusZoom visualization shows variant associations across UK Biobank phenotypes

Forest plot visualization of variant associations

We also provide yet another LocusZoom visualization on a separate tab of the "Associations at a glance" section of the Variant page. The Forest plot is an alternative way to visualize phenotypic associations for a variant. In addition to displaying the significance of associations, the Forest plot also shows the direction of effect and the confidence interval for variant associations.

Forest plot on the Variant page

Genetic Risk Score module

The T2DKP now includes an initial version of the Genetic Risk Score module.  This is an instantiation of the same custom burden test that is found on Gene pages, but instead of using as input a set of variants across a gene, the module uses a set of 243 variants identified by Mahajan et al. (2018) that are significantly associated with T2D risk. The module draws on 9 different datasets, including 3 housed at the Broad Data Coordinating Center and 6 housed at the T2DKP Federated node at EBI. Just like the burden test, it allows you to choose a phenotype for analysis, adjust the set of variants if desired, filter the sample set by many criteria, and set custom covariates before running the analysis. The results obtained from this module can potentially reveal genetic relationships between phenotypes. The module is still under development, and we would appreciate your feedback on it!

New Knowledge Portal added to the network

At the ASHG meeting we unveiled the newest member of the Knowledge Portal Network: the Sleep Disorder Knowledge Portal (SDKP),  for the genetics of sleep and circadian traits. There is currently one dataset for sleep genetic associations in the SDKP, "UK Biobank Sleep Traits GWAS," which includes chronotype, sleep duration, insomnia, daytime sleepiness, and nap traits. Additional association datasets are available for type 2 diabetes and glycemic traits, anthropometric traits, measures of kidney function, and psychiatric traits, and more sleep data will be added soon.

Monday, October 15, 2018

Connect with the Knowledge Portal Network team at ASHG!

This week, the human genetics research community will come together in San Diego for one of the most important conferences of the year: the annual American Society of Human Genetics meeting. The Knowledge Portal Network team will be there, and in addition to presenting all the new data and features in the Type 2 DiabetesCerebrovascular Disease, and Cardiovascular Disease Knowledge Portals (KPs), we're launching an entirely new Portal: the Sleep Disorder Knowledge Portal, for the genetics of sleep and circadian traits.

We'll also present an interactive workshop on Friday that will go over the basics of navigating the Knowledge Portal Network. Download the flyer here, and find more details below.

Here's the schedule of events for the week:

Tuesday, October 16
2:05-2:30 pm: Jason Flannick will present a talk, "Infrastructure for analyzing and disseminating large-scale genetic data for type 2 diabetes and other complex diseases," in the ASHG/IGES/ISCB Joint Symposium.
Room 6C - Upper Level/San Diego Convention Center

Wednesday, October 17
The Knowledge Portal team will be at our booth, #219, in the exhibit hall from 10am-4:30pm.
We'll also be at the Broad Institute Genomic Services booth, #1634, from 10:30-11:30am.
At 2:30pm, Richa Saxena, the P.I. for the Sleep Disorder Knowledge Portal, will be at our booth to talk about the SDKP.

Thursday, October 18
The team will again be at our booth, #219, in the exhibit hall from 10am-4:30pm.

Friday, October 19
We'll again be at our booth, #219, in the exhibit hall from 10am- 4:30pm, but today the booth will be closed around lunchtime so that we can present a special tutorial session on the Knowledge Portals. See details and sign up below. After the session, we'll be back at our booth until 4:30pm and will also be at the Broad Institute Genomic Services booth, #1634, from 2:30 - 3:30pm.

At lunchtime on Friday, grab your laptop and come to a workshop on the Knowledge Portals:

Navigating complex disease genetics: using the Knowledge Portal Network to move from SNPs to functional insights
Room 28C, Upper Level, San Diego Convention Center

We'll go over some basics, illustrate workflows, and answer questions about how you can use KPs to investigate SNPs, genes, or regions of interest and turn genetic data into insights about complex diseases.

Please sign up so we can plan for refreshments. We'll send you a reminder a few days beforehand. We look forward to seeing you there! Please contact us with any questions or suggestions for topics you'd like to discuss.

Monday, October 8, 2018

DIAMANTE GWAS dataset adds close to a million samples along with fine-mapping to the T2DKP

In a groundbreaking paper published today, Anubha Mahajan and colleagues (Mahajan et al., Nature Genetics 2018) report on a meta-analysis of unprecedented size for genetic associations with type 2 diabetes (T2D) along with fine-mapping analyses to identify causal variants that can suggest new therapeutic targets. We are pleased to provide access to the summary results as well as the results of the fine-mapping today in the T2D Knowledge Portal (T2DKP).

Working as part of the DIAGRAM (DIAbetes Genetics Replication And Meta-analysis) and DIAMANTE (DIAbetes Meta-ANalysis of Trans-Ethnic association studies) consortia, the researchers aggregated and meta-analyzed genome-wide association studies for about 900,000 individuals of European ancestry (about 74,000 T2D cases and 824,000 controls). The studies were imputed using the most comprehensive reference panels possible, and in all, the analysis considered about 27 million genotyped or imputed variants.

After performing T2D association analysis (both unadjusted and adjusted for body mass index) 243 loci were seen to be associated with T2D at genome-wide significance or better (p-value for association ≤ 5 x 10-8). Of these, 135 were novel--not detected previously in any T2D association analysis to date.

Within these loci, each of which included multiple significantly associated variants, the researchers performed approximate conditional analysis to determine whether the associations were independent of each other. They found surprising complexity within some loci; for example, the well-known TCF7L2 locus appears to include as many as 8 distinct association signals!

All of the T2D associations from this study may be viewed in the T2DKP. They are represented in two datasets, named "DIAMANTE (European) T2D GWAS" and "UK Biobank T2D GWAS (DIAMANTE-Europeans Sept 2018)."  Manhattan plots showing the distribution of the associations across the genome may be seen by selecting either the "Type 2 diabetes" or "Type 2 diabetes adj BMI" phenotypes from the phenotype selection menu on the T2DKP home page. On Gene pages of the T2DKP, the results may be viewed in tables of variant associations and in the interactive LocusZoom visualization (see below). Results from this study are also displayed on Variant pages of the T2DKP.

LocusZoom plot on the PPARG Gene page

The credible set analysis performed in this study is also incorporated into the T2DKP. On the "Credible sets" tab of Gene pages, you may choose to visualize any of the credible sets available for the region. Epigenomic annotations that overlap the positions of the variants in the credible set are presented in an interactive display that allows you to select particular chromatin states or tissues to view. In the example shown below, one of the credible sets in the TCF7L2 region includes just two variants, and the one with the highest posterior probability overlaps active enhancer regions in adipose and liver tissue--both of which are important for T2D.

Detail of the Credible sets tab of the TCF7L2 Gene page

The multiple causal variants identified in this study support previous investigations on the biological mechanisms behind T2D and suggest new hypotheses that will likely lead to therapeutic insights. After reading the paper and a blog post from the authors, we invite you to explore the results in the T2DKP and to contact us with any suggestions or questions!

Wednesday, September 26, 2018

New datasets and many new phenotypes in the T2DKP

Today we release several new datasets, including associations for many new phenotypes and individual-level data for secure interactive analysis, to the Type 2 Diabetes Knowledge Portal.

The AAGILE GWAS dataset, from the African American Glucose and Insulin Genetic Epidemiology (AAGILE) Consortium, brings more diversity of ancestry to the T2DKP, with meta-analysis of fasting glucose and BMI-adjusted fasting insulin associations from over 20,000 African American individuals. These results were combined with associations for over 57,000 individuals of European ancestry from the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) in a trans-ethnic meta-analysis.

This release also adds two new diabetic kidney disease datasets from the SUMMIT (SUrrogate markers for Micro- and Macro-vascular hard endpoints for Innovative diabetes Tools) consortium. All of the more than 40,000 subjects in the "Diabetic Kidney Disease GWAS: subjects with T1D or T2D" dataset had either type 1 or type 2 diabetes. The study measured seven different renal phenotypes in these subjects, including four that are new to the T2DKP. Summary association results are available for the entire group and for sub-cohorts that separate T1D from T2D and European from Asian ancestry. A separate dataset from SUMMIT, "Diabetic Kidney Disease GWAS: subjects with T1D or T2D, ESRD vs. controls" is comprised of more than 5,600 diabetics, nearly 1,200 of whom had end-stage renal disease. These two datasets greatly expand the range of diabetic complications for which genetic association data are available in the T2DKP.

The T2DKP is federated, meaning that in addition to the Data Coordinating Center at the Broad Institute, some results are drawn from a sister site at the European Bioinformatics Institute (EMBL-EBI). This system allows data that may not leave Europe to be represented in the T2DKP. Six of the new datasets in this release are housed at the T2DKP Federated Node at EMBL-EBI.

The Hoorn Diabetes Care System (DCS) dataset includes associations for 12 different anthropometric, blood lipid, blood pressure, and liver and kidney function measures for a cohort of over 3,400 type 2 diabetics in the Netherlands.

The GoDarts project (Genetics of Diabetes Audit and Research in Tayside Scotland) recruits type 2 diabetics and matching controls in the Tayside region of Scotland. This release includes five new datasets from GoDarts, representing experiments performed using different arrays. Each experiment determined genetic associations for a wide variety of phenotypes, including two that are new to the T2DKP: levels of adiponectin and leptin, hormones that are associated with risk of T2D and obesity.

Results from all of these datasets may be searched using the Variant Finder tool and may be browsed:

• On Gene Pages in the Common variants and High-impact variants tables and in LocusZoom plots;

• On Variant Pages in the Associations at a glance section, the Associations across all datasets section, and in LocusZoom plots;

• From the View full genetic association results for a phenotype search on the home page: first select a phenotype, then select a dataset on the resulting page.

Individual-level data from the Hoorn DCS and GoDarts datasets also power secure interactive analyses using the Genetic Association Interactive Tool (GAIT) on Variant Pages. With the new additional data, nearly 61,000 individual-level samples are now available for custom association analysis.

Please take a look at the new results and contact us any time with questions or suggestions!

Tuesday, August 14, 2018

Sign up for a hands-on tutorial session on the Knowledge Portals

Are you attending the American Society of Human Genetics meeting in October? If so, save your Friday lunch break for a tutorial session on the Knowledge Portals!

Navigating complex disease genetics: using the Knowledge Portal Network to move from SNPs to functional insights
12:30pm - 1:45pm
Friday, October 19
San Diego Convention Center
Room 28C, Upper Level

Bring your laptop and your questions about the Type 2 Diabetes, Cerebrovascular Disease, or Cardiovascular Disease Knowledge Portals (KPs). We'll go over some basics, illustrate workflows, and answer questions about how you can use KPs to investigate SNPs, genes, or regions of interest and turn genetic data into insights about complex diseases.

Please sign up so we can plan for refreshments. We'll send you a reminder a few days beforehand. We look forward to seeing you there! Please contact us with any questions or suggestions for topics you'd like to discuss.

Friday, June 22, 2018

New data release brings new phenotypes and huge sample sizes to the T2DKP

Progressing towards the goal of the Accelerating Medicines Partnership in Type 2 Diabetes (AMP T2D) to aggregate, analyze, and present comprehensive genetic data relative to T2D in order to speed up the validation of new drug targets, today we release 10 new datasets to the Type 2 Diabetes Knowledge Portal. These datasets contain variant associations for 17 phenotypes, including 7 that are new to the T2DKP, from over 1.4 million samples.

Four of the new datasets were generated by collaborators in AMP T2D, the parent organization of the T2DKP. AMP T2D is a pre-competitive partnership among the National Institutes of Health, industry, and not-for-profit organizations, managed by the Foundation for the National Institutes of Health, that supports the generation of genetic association data and many other kinds of genomic data as well as providing access to these data in the T2DKP, to facilitate the translation of these data into biological knowledge about T2D.

For all four of these datasets, quality control and association analysis were performed by the Analysis Team of the AMP Data Coordinating Center (AMP DCC) at the Broad Institute, using standard, state-of-the-art methods. These processes are completely transparent and fully documented: the experimental design and analysis are summarized on our Data page, and detailed reports are available for download. In this first phase of analysis, associations were determined for type 2 diabetes, fasting glucose levels, and fasting insulin levels--both unadjusted, and adjusted for body mass index. Future analyses will add more phenotypes.

One of these datasets,  Diabetic Cohort - Singapore Prospective Study GWAS, was contributed by collaborators at the National University of Singapore. Consisting of 3,864 samples, it is a T2D case-control study to identify genetic and environmental risk factors for diabetes in Singapore Chinese. The other three new sets that were analyzed at the AMP DCC, contributed by collaborators at the University of Michigan, are from the Finland-United States Investigation of NIDDM Genetics (FUSION) Study that seeks to to map and identify genetic variants that predispose to type 2 diabetes or affect variability in diabetes-related traits. The three FUSION datasets include FUSION GWAS, with 1,681 samples; FUSION Metabochip, with 2,163 samples, and FUSION exome chip analysis, with 3,485 samples.

All four of these datasets now have “Early Access Phase 1” status, which is assigned to new data. This status denotes that although analysis and quality control checks have been performed, the data are not yet considered to be in their final state. During the early access period, users may analyze the data but may not submit the results of these analyses for publication. Find the full details about the different phases of data release on our Policies page.

In addition to the datasets from AMP T2D partners, we have also added or updated 6 new sets of publicly-available association summary statistics for phenotypes relevant to T2D:

  • The previous CKDGen GWAS dataset for chronic kidney disease has been replaced with a newer study from the CKDGen consortium, imputed to the 1000 Genomes reference set (Gorski et al., 2017), with 110,517 samples;
  • Early Growth Genetics Consortium GWAS associations for childhood obesity (Bradfield et al., 2012), with 13,848 samples;
  • Body fat distribution associations (Shungin et al., 2015), with 245,749 samples, have been added to the existing GIANT GWAS dataset;

Results from all the new datasets may be viewed at these locations in the T2D Knowledge Portal:

• On Gene Pages (e.g., GCKR) in the Common variants and High-impact variants tables and in LocusZoom plots;

• On Variant Pages (e.g.rs1260326) in the Associations at a glance section, the Association statistics across traits table, and in LocusZoom static plots;

• From the View full genetic association results for a phenotype search on the home page: select a phenotype and view the top variants in a Manhattan plot and table;

• Using the Variant Finder tool: specify multiple criteria and retrieve the set of variants meeting those criteria from any of these datasets.

Additionally, individual-level data from the Diabetic Cohort - Singapore Prospective Study GWAS and FUSION GWAS datasets are available for secure custom interactive analyses using these tools in the T2DKP:

• Using the Genetic Association Interactive Tool (GAIT) on Variant Pages, you may choose a phenotype for association analysis, choose custom covariates, filter the sample pool by specifying a range of values for one or more phenotypes, then run on-the-fly analysis.

• Dynamic LocusZoom plots on Gene and Variant pages allow you to run association analysis using one or more variants of your choice as covariates, in order to test whether associations are independent.

With today's release, the T2DKP includes genetic associations for 68 phenotypes from a total of 35 datasets. We welcome submissions of new datasets for incorporation into the T2DKP. Find information about collaboration here, and please contact us with questions.

Monday, June 18, 2018

See you at ADA!

The 78th Scientific Sessions of the American Diabetes Association are coming up in just a few days, and the T2D Knowledge Portal team will be there!

As usual, we'll have a booth in the exhibit hall. We'll be at booth #1075 from 10am to 4pm on Saturday and Sunday 6/23-24, and from 10am to 2pm on Monday 6/25. Come say hello, get a demonstration of the T2D, Cardiovascular Disease, or Cerebrovascular Disease Knowledge Portals, and pick up some of the T2DKP sticky notes that we'll be giving away!

Here's who you might find at the booth when you stop by:

There will also be presentations from several members of our group on Saturday, June 23:
  • Jason Flannick, PhD will give a talk on "The Type 2 Diabetes Knowledge Portal" at 11:30am.
Session: Quantifying Diabetes: Genomics, Electronic Health Records, and Automated Control
Location: W312
  • Jose C. Florez, MD, PhD, will moderate an interactive poster session, "Delving into Type 2 Diabetes Genetics", at 12:30 pm.
Location: Poster hall
  • Miriam Udler, MD, PhD will present "Genetic testing for Monogenic Diabetes--Whom to Test, What and How to Order?" at 2:15pm.
Session: Monogenic Diabetes Testing is Ready for Prime Time--Integrating Genetics into Your Practice
Location: W304E-H

We hope to meet you in Orlando!

Friday, June 1, 2018

New T2DKP features help distill knowledge from data

We are pleased to announce four new features in the Type 2 Diabetes Knowledge Portal that simplify the interpretation of genetic association data, making it easier to pinpoint variants and datasets that are informative for a disease or phenotype of interest.

"Clumping" variants by linkage disequilibrium

The first step in getting an overview of the results of a particular experiment is typically to plot variant associations vs. chromosomal location, in a so-called "Manhattan plot." These plots are available from the T2DKP home page after choosing a phenotype from the list:

After selecting a phenotype, you may select a dataset, and the Manhattan plot is displayed above a table of the top variants:

Now, in addition to selecting a dataset to view associations, you may select a threshold for linkage disequilibrium (LD) in order to reduce the number of linked variants that represent a single association signal. For example, without "clumping" variants by LD (r2 = 1), when viewing the DIAGRAM 1000G GWAS dataset there are 70 significantly associated variants in the IGFBP2 gene; but setting the most stringent LD threshold  (r2 = 0.1) reduces that number to just 8 variants by displaying only the most significant associations after clumping variants by LD. Intermediate LD thresholds of r2 = 0.2. 0.4, 0.6, or 0.8 may also be set, allowing more versatility in this analysis.

New Region page

The Gene page of the T2DKP (see an example) integrates and summarizes information about the associations of variants across the region of a gene. Now, you can see this integration and summation for any region of the genome, not just the areas surrounding protein-coding genes. Simply enter a chromosome and coordinates in the home page search box:

The resulting page resembles a Gene page. The traffic light integrates all associations across the region to give you an immediate indication of whether there are significant associations found in any of the datasets in the T2DKP. Further down the page, tools and displays let you drill down to the specifics for a phenotype or variant of interest. This new Region page provides a way to explore any part of the genome in great detail.

PheWAS graphic on the Variant page

Previously, the Variant page of the T2DKP displayed significant associations for each variant in a graphic that showed a color-coded box for each phenotype-dataset combination. But the rapidly increasing number of phenotypes becoming available from biobank studies has made this view unsustainably large. In its place, we have incorporated a phenome-wide association study (PheWAS) visualization developed at the University of Michigan. The graphic shows at a glance which phenotype associations are most significant for a particular variant. Mouse over a point to see more details.

All Associations graphic on the Variant page

The PheWAS graphic distills variant associations in order to highlight the most significant ones. But suppose you want to drill down to the details and explore associations in every dataset, viewing parameters like sample size, odds ratio, and more? There's a graphic for that too: our new All Associations interactive graphic, located in the "Associations across all datasets" section of the variant page. Start by using keywords to filter phenotypes. Filtering allows you to view one specific phenotype, several related phenotypes, or phenotypes in a broad category, such as glycemic phenotypes; both the graphic and the table below it change in response to phenotype filtering.  There are also options to filter by setting ranges of p-values and/or sample sizes.

The graph plots p-value (vertical axis) vs. dataset sample size (horizontal axis) for each association. Points in the graph are triangular; whether the triangle points up or down indicates a positive or negative direction of effect, respectively. Mousing over a point shows you more details about the association and the dataset. This graphic can help you evaluate whether an association is likely to be real. As shown in the illlustration below, a genuine signal should increase in significance (i.e., decrease in p-value) with increasing sample size.

Stay in touch!

Like the rest of the T2DKP, these features are under continuous development. Please give them a try and let us know what you think.

Friday, May 11, 2018

T2DKP Spring Newsletter

The latest issue of our quarterly newsletter is now available. Download it here and get the latest!

Tuesday, May 8, 2018

NIDDK Workshop: Towards a Functional Understanding of the Diabetic Genome 2018

Recently, members of the T2D Knowledge Portal team were fortunate to participate in a fascinating workshop hosted by the NIDDKTowards a Functional Understanding of the Diabetic Genome. Speakers highlighted the diversity of ongoing research projects that aim to translate disease-associated variants into functional insights in type 2 diabetes.

The workshop featured presentations on multiple data types that can provide clues about the mechanisms by which sequence variants affect T2D risk. Many of these offer insights into transcriptional regulation: epigenomic chromatin modifications; tissue-specific RNA levels; eQTLs; transcription factor binding sites; long-range interactions between chromosomes that bring promoters and enhancers into proximity; and regulatory pathways. Others focus on downstream processes such as protein-protein interactions, biochemical pathways, and metabolomics.

It will be crucial to integrate all of these data types with genetic association data in order to get a complete picture of how particular genomic regions influence T2D biology, and at the T2DKP we are working towards incorporating as many of these data types as possible.

Although the presentations in this workshop were diverse, some common themes were evident. One was that although the insulin-secreting beta cells in pancreatic islets are hugely significant to T2D, and most T2D risk variants influence insulin secretion, current research projects are confirming and underscoring the importance of other tissues. Fat, liver, skeletal muscle (which comprises 40% of human body weight), and brain are all intimately involved in the development of T2D.

Another common theme for ongoing T2D research is that things may often be much more complicated than they first appear. A single genomic region associated with T2D risk may harbor multiple independent causal variants, each potentially having different regulatory effects, possibly affecting different tissues, and causing varied phenotypic consequences. Even if these variants alter a protein-coding sequence, they may not act through their effects on that sequence. These genetically complicated regions, such as those elucidated in FTO or TCF7L2, may be more common than we previously thought.

A third overall conclusion from the workshop is that model organism research can accelerate the investigation of candidate genes. The short life cycles of Drosophila and zebrafish, and the versatile genetic tools available for these systems, allow for rapid and systematic interrogation of gene function. Zebrafish glucose and lipid metabolism have much in common with those processes in human cells, and with their transparent bodies, zebrafish literally give us a window into pancreatic development.  In addition to being a well-developed model system, the mouse offers much greater genetic diversity than human, with about 40 million SNPs in the mouse genome as compared to about 10 million in the human genome.

At the T2DKP, efforts to integrate many of these data types are in progress, and integration of others is being planned. We continue to work towards making the T2DKP a comprehensive resource for the T2D research community, to help accelerate the translation of variant associations into knowledge about disease mechanisms and identification of potential drug targets.

Many of the presentations at the workshop featured web resources of potential interest to T2D researchers, listed below. The T2DKP is connected with the first, the Diabetes Epigenome Atlas. We are interested providing better connections between the T2DKP and other relevant resources. If you would be particularly interested in seeing links from the T2DKP to one of the resources below, or if you know of a resource that would be informative, we would love to hear your suggestions!

  • HaploReg: explore annotations of the noncoding genome at variants on haplotype blocks
  • ExPecto: tissue-specific gene expression effect predictions for human mutations
  • DeepSea: predict the cell type-specific epigenetic state of a sequence and the chromatin effects of sequence variants
  • GeNets: unified web platform for network-based analyses of genetic data
  • DCell: a deep neural network simulating cell structure and function

Wednesday, May 2, 2018

Join the Knowledge Portal Network team!

At the Knowledge Portal Network (currently consisting of the Type 2 Diabetes, Cerebrovascular Disease, and Cardiovascular Disease Knowledge Portals), we are looking for energetic, talented people to help us produce web portals that aggregate and serve genetic association results to the world in order to spark insights into complex diseases. There are positions open for a software engineer to help in developing and producing these web portals, and for a technical release manager to manage and coordinate tasks during production and maintenance of the portals.

The positions are located at the Broad Institute in Cambridge, MA, a dynamic and exciting work environment where cutting-edge science is applied to critical biomedical problems.

Find more details and apply for the software engineer or technical release manager positions at the Broad Careers site.

Friday, April 27, 2018

New T2DKP release adds individual-level data for interactive analysis

With the April release of the Type 2 Diabetes Knowledge Portal, we are increasing the number of datasets and samples available for interactive analysis via the LocusZoom and GAIT tools. These tools now access individual-level data from three additional datasets, all of which were quality controlled and analyzed at the Accelerating Medicines Partnership in Type 2 Diabetes (AMP T2D) Data Coordinating Center (DCC):
  • CAMP GWAS: 3,628 multi-ancestry samples from the MGH Cardiology and Metabolic Patient cohort, generated by a public-private partnership between Pfizer Inc. and Massachusetts General Hospital;
  • METSIM GWAS: 8,791 European ancestry samples from the Metabolic Syndrome in Men study.
These individual-level data are available as "dynamic" datasets, powered by Hail software, in LocusZoom on Gene pages and Variant pages of the T2DKP, for the following phenotypes: 
  • BioMe AMP T2D GWAS: type 2 diabetes, BMI, diastolic blood pressure, fasting glucose, HbA1c, HDL cholesterol, LDL cholesterol, systolic blood pressure
  • CAMP GWAS: type 2 diabetes, BMI, fasting glucose, fasting insulin
  • METSIM GWAS: type 2 diabetes, BMI, diastolic blood pressure, fasting glucose, fasting insulin, HbA1c, HDL cholesterol, LDL cholesterol, systolic blood pressure
To perform interactive analyses on these data in LocusZoom, select one of the available phenotypes in step 1 and then choose a "dynamic" dataset in step 2.

When you click on a variant in the resulting LocusZoom plot, the option to condition on that variant appears in the tooltip:

Clicking on that link starts on-the-fly association analysis for the region while conditioning on that variant, which can reveal whether association signals are independent of each other. You can choose to condition on multiple variants. The variants of your choice are listed in the upper left-hand corner of the plot, and the list may be edited:

Individual-level data from these three datasets are also available for interactive analysis via the Genetic Association Interactive Tool (GAIT) on Variant Pages. After selecting one of the datasets, you will be able to choose a phenotype for association analysis, filter the sample pool by specifying a range of values for one or more phenotypes, choose custom covariates, and then run on-the-fly association analysis for your chosen subset of samples. Find all of the details about how to use this tool in our GAIT guide.

We hope that the increased ability to interact with individual-level data in the T2DKP will be helpful to your research. As always, we are happy to answer any questions about these or other data and tools; please contact us for help.

Tuesday, April 17, 2018

Developing a model for collaborative science: a mid-term perspective on the AMP T2D Partnership

In 2011, Dr. Francis Collins, Director of the National Institutes of Health (NIH), met with leaders in biomedical research to discuss a frustrating problem. Continual improvements in molecular biological and genomic techniques were generating an avalanche of data relevant to complex diseases, yet the translation of these data into insights about disease mechanisms and drug targets was unacceptably slow. It was clear that an entirely new paradigm for collaborative research would be needed to speed up the extraction of knowledge from data.

The result of these discussions was the creation of the Accelerating Medicines Partnership (AMP), one branch of which focuses on type 2 diabetes (T2D)—a life-threatening disease that affects hundreds of millions of people worldwide, whose incidence is growing, and whose progression cannot yet be effectively stopped or reversed. AMP T2D, a five-year project, includes the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK); the pharmaceutical companies Janssen Pharmaceuticals, Eli Lilly and Company, Merck, Pfizer, and Sanofi; the University of Michigan; the University of Oxford; the Broad Institute; and other researchers around the globe. The Foundation for the National Institutes of Health (FNIH) also provides funding and coordination for the project.

Drawing on the strengths of both academia and industry, this public-private partnership brings together all stakeholders in a pre-competitive space to share data and combine resources, with the goal of validating new drug targets faster. Now in Spring 2018, roughly mid-way through the funding period, it is evident that this collaboration has resulted in remarkable progress on both scientific and collaborative fronts.

Genetic association data: the foundation of AMP T2D

Genetic association studies interrogate the genomes of individuals at millions of specific genomic positions to discover sequence variants that are correlated with the incidence of disease. From the outset, AMP T2D aimed to support the generation of unprecedented amounts of new genome-wide association study (GWAS), exome sequencing, and whole-genome sequencing data within the project as well as their aggregation with all relevant publicly available data. Originally, 5 sites were funded by the NIDDK to generate new data and deposit them into the AMP T2D Data Coordinating Center (DCC) at the Broad Institute. As the project evolved, another site was funded by the NIDDK and 8 more sites were funded by the FNIH. Additionally, an Opportunity Pool of funds from the NIDDK was created, allowing the AMP T2D Steering Committee to award smaller grants for complementary research projects in a flexible, science-driven manner.  Currently 10 Opportunity Pool projects are in progress, and more awards will be given in the future.

Not only has the number of genetic association studies increased since the inception of AMP T2D, but also the number of samples surveyed in each has grown dramatically, from typically under 100,000 to approaching 1 million today. The increased statistical power conferred by these large sample sizes has led to a huge increase in the number of loci found to be significantly associated with T2D, from about 70 at the start of the project to nearly 430.

Improvements in genomic technologies in the past few years have allowed AMP T2D collaborators to generate increasing amounts of sequencing data, which make it possible to comprehensively interrogate all alleles and to uncover rare variation. At the project’s start, T2D associations with exome sequences (covering the protein-coding regions of the genome) were available for about 13,000 samples, and no whole-genome sequencing studies had been published. Now, more than 2,600 whole genomes are available, and analysis of a set of 50,000 exomes—the largest disease-specific aggregation of exome sequencing data to date—is nearly complete. Importantly, many of the associations that have been newly discovered in sequencing studies involve relatively rare variants that affect protein-coding regions. It is often more straightforward to develop hypotheses about the impact of such variants than it is for variants outside of coding regions.

As the AMP T2D partnership has grown in prominence in the diabetes field, the DCC has been approached by investigators outside the project who want to contribute their data in order to aggregate and display them in the context of AMP T2D data. In early 2017, researchers in the 70kforT2D project, which found novel T2D associations by re-analyzing existing GWAS data, offered their results for integration into the DCC and display in the Type 2 Diabetes Knowledge Portal (T2DKP; see below) before publication.

70kforT2D GWAS was first pre-publication dataset to be added to the T2DKP from outside the AMP T2D partnership, and it was particularly appropriate that these scientists, whose results illustrate the value of data sharing, themselves chose to freely share their results. Incorporation of datasets into the AMP T2D DCC and T2DKP offers investigators the chance to take advantage of the expertise of the AMP DCC analysis team, apply cutting-edge analysis tools to their data, and display their results broadly to the T2D research community in the context of multiple datasets. The AMP T2D DCC is open to incorporating T2D-relevant datasets from all investigators (find details on contributing data here).

In addition to the datasets generated by AMP T2D partners and other T2D researchers, which focus on associations with T2D, glycemic measures, and T2D complications, the AMP T2D DCC also collects publicly available genetic association datasets for traits relevant to T2D, such as anthropometric measures, blood pressure and lipid levels, and heart and kidney disease.

Orthogonal data types to help identify and prioritize causal variants and genes

Finding genetic variants that are associated with T2D risk is critically important to understanding the genetics of T2D, but it is only a first step. The most significantly associated variant in a genomic region may not be the causal variant that is responsible for altered T2D risk. Researchers perform fine mapping to analyze genetic associations in specific regions of the genome and generate credible sets—that is, sets of variants that are predicted to include the causal variant. Mid-way through the AMP T2D funding period, emphasis among the data-generating partners is beginning to shift from simply generating association data to performing fine mapping and credible set analysis.

But even after predicting which sequence variations are responsible for altered risk, finding clues about how they affect risk requires integration with additional data types. Information about the functional importance of the genomic region where a variant is located—its relevance to gene expression, protein function, networks and pathways, metabolite levels, and more, all determined on a tissue-specific basis—can help prioritize genes and pathways for in-depth experimental investigation. These kinds of research were built into AMP T2D from the beginning, and as the importance of these data types became even clearer, several Opportunity Pool awards were given to projects focusing on complementary data types that shed light on the significance of genetic associations.

Several of these projects focus on generating tissue-specific epigenomic data: histone modifications, DNA methylation, chromatin conformation, transcription factor binding, 3-dimensional chromosome structure, and other data types. Epigenomic data can provide important clues about the mechanisms by which sequence variation affects T2D risk, particularly for variants that lie outside of protein-coding regions. For example, if a risk-associated variant is seen to disrupt a transcription factor binding site, this would support the hypothesis that the transcription factor and its target genes are relevant to T2D.

To make these data accessible to researchers, one Opportunity Pool award supports the creation of the Diabetes Epigenome Atlas, which collects and displays epigenomic datasets relevant to T2D. In the near future, these data will be fully integrated with genetic association data in the Type 2 Diabetes Knowledge Portal (see below).

Other Opportunity Pool projects are concerned with processes downstream of gene expression. Discovering interactions between proteins implicated in T2D risk, for example, could help to uncover all of the players in pathways important for the development of T2D, increasing the number of potential drug targets. Determining the effects of variants on the levels of key metabolites can illuminate the metabolic pathways that change during the development of T2D. 

In addition to generating all of these orthogonal data types, AMP T2D partners are developing algorithms and using machine learning to classify and prioritize variants on the basis of the functional annotations that accompany them. Finally, other Opportunity Pool projects will use model organisms to test and validate drug targets that are suggested by these analyses.

Tools and methods to speed analysis and interpretation

At the inception of AMP T2D it was also clear that the development of new methods and tools would need to accompany the generation of data, and support for these activities was built into the program. One major technical effort has addressed an obstacle to global data aggregation: because of institutional and national privacy regulations, some datasets may not leave their site of origin to be aggregated with other datasets at the AMP T2D DCC. A group at the European Bioinformatics Institute has built a technical replicate of the DCC and knowledgebase, such that data stored there are equally as accessible for browsing, searching, and interactive analysis as are the data stored at the AMP T2D DCC at the Broad Institute. This federation mechanism allows global data accessibility even when data aggregation is not permitted.

Other efforts supported by AMP T2D are aimed at improving the speed and efficiency at which data can be taken in and analyzed. In one project, a data intake system is being developed that will streamline the process for both data submitters and for the DCC team, and will be applicable to data submission both at the Broad DCC and at other federated sites. Another project has created a software pipeline, LoamStream, that will largely automate quality control and association analysis of incoming data. Currently, LoamStream is in use for quality control of genotype data, and this has already greatly reduced the time required to process new datasets. Future work will extend the pipeline to association analysis and will also allow it to take in sequence data as well as genotype data.

A genetic association of a variant with T2D gains credibility if multiple independent studies replicate the association. Thus, it is important for researchers to be able to evaluate the weight of available evidence. But currently this is difficult to assess from the association datasets in the AMP T2D DCC, because many are based on overlapping sets of subjects. AMP T2D partners at the University of Michigan and University of Oxford are working on a method to take these overlaps into account and synthesize associations from multiple datasets into a “bottom-line” significance for association of a variant with T2D, which will aid in prioritizing variants for future work.

Multiple AMP T2D projects for analysis, interpretation, and custom interactive analysis of variant-phenotype associations are ongoing at the Universities of Michigan, Chicago, and Oxford, Vanderbilt University, and the Broad Institute. These projects are aimed at facilitating, in various ways, the path from variant associations to functional knowledge, and all have been or will be integrated into the T2D Knowledge Portal (see below).

Hail software offers a pipeline that speeds up the analysis of huge genomic datasets, while the gnomAD resource aggregates and harmonizes exome and genome sequences to provide a catalog of genetic diversity, in more than 100,000 humans, that aids in interpretation of variant associations with disease. A tool under development in the gnomAD project will display the effects of variants on protein structures as another way to deduce their potential impact.

Other analysis modules include gene-based association methods for using expression data to predict genes that may impact a phenotype (PrediXcan and MetaXcan), and a phenome-wide association study (PheWAS) method for visualization of the associations of a variant across multiple phenotypes, which is a crucial consideration during drug development. 

The interactive visualization tool LocusZoom will integrate many of these methods to display variant associations and credible sets, epigenomic and functional annotations, and phenotype associations across a genomic region as well as offering custom association analysis.

An example LocusZoom plot

AMP T2D Knowledge Portal: democratizing T2D genetic results for researchers world-wide

AMP T2D was founded on the idea that in order to truly accelerate progress, genomic information must be freely accessible to all scientists and presented in a way that is understandable by a broad range of researchers working on T2D biology, not only by human geneticists and bioinformaticians with special computational skills. So the roadmap for the project included not only data generation and analysis, but also the production of a publicly available web resource that would integrate data types, interpret the evidence, and present of all these results. 

While it is under continuous development, mid-way through the initial funding period the T2D Knowledge Portal (T2DKP) is already a well-established resource. Other web resources collect genetic association data, but the T2DKP is unusual in providing harmonized datasets to which a consistent analysis pipeline has been applied. Rather than simply cataloging datasets, it offers distilled and synthesized results along with their interpretation, to guide more detailed exploration of the evidence. And, unlike any other extant resource, it offers researchers the ability to perform interactive queries on protected individual-level data. 

T2DKP home page

The Gene page of the T2DKP (see an example) illustrates the presentation of immediately understandable summary information along with the opportunity to drill down to the details. An algorithm considers the associations of all variants across a gene, for all phenotypes and in all datasets aggregated at the DCC, and calculates from them a “traffic light” signal for the gene: green to indicate that there is a significant association for at least one phenotype; yellow to indicate suggestive, if not highly significant, associations; and red to indicate that there is no evidence for association for any of the phenotypes considered in the T2DKP. Below this, tables and graphics invite users to explore all variants across the gene, their impacts on the encoded protein, and their associations, as well as their positions relative to epigenomic marks across the region in multiple tissues.

The T2DKP currently offers the ability to run custom, interactive association analyses using two different tools. In the LocusZoom visualization, users may choose one or more variants as covariates before performing association analysis. The Genetic Association Interactive Tool (GAIT) for single variant associations, which also powers the custom burden test for gene-level associations, is even more versatile, presenting the distributions of different characteristics of the sample set (age, sex, BMI, glycemic measures, blood lipid levels, and many more) and allowing users to filter the set by multiple criteria and to choose custom covariates before performing association analysis. Both of these tools allow analytical access to the individual-level data, whether housed at the Broad DCC or at the EBI federated node, in a secure environment so that data privacy is always protected.

Evolution of a collaborative environment

AMP T2D organization

The AMP T2D partnership is a multifaceted project (illustrated above) that embraces several aspects of basic research and combines them with building a product, the T2DKP. In connecting scientists both within and outside of consortia, in academia and in industry, working on genetic associations or functional studies, it is becoming the nexus of the T2D genetics community. Researchers are finding the T2DKP helpful for accessing even their own results and for viewing them in the context of multiple phenotypic associations and other complementary data types. Pharmaceutical partners are finding help via the Target Prioritization project, in which the tools and methods developed within AMP T2D are being used to prioritize a list of genes of mutual interest for further investigation.

Perhaps most importantly, AMP T2D has made researchers—both within and outside of the project—aware of the value of sharing data for representation in the context of all other relevant data. Only by compiling and interpreting all available information will we be able to make the best hypotheses about genes and pathways that are possible drug targets and prioritize them for in-depth functional investigation.

AMP T2D and beyond

In the remainder of the initial AMP T2D funding period, we expect continued progress in each of the areas discussed above. The data intake and analysis pipelines will be improved, and new data will be incorporated at an increasing pace—including data from the UK Biobank, which has generated association results for 500,000 genotyped subjects and more than 2,500 traits. Associations will be added for many more phenotypes related to T2D, including diabetic complications and longitudinal phenotype data that connect the development of various traits to the timeline of incident T2D.  Much more T2D-relevant epigenomic data will be available for query as well as for browsing, via dynamic connection with the Diabetes Epigenome Atlas. And entirely new data types (for example, metabolomic and proteomic data) arising from Opportunity Pool projects will be added to the T2DKP.

Ongoing work on tools and methods will result in the addition of many more interactive modules to the T2DKP. Researchers will be able to view PheWAS data; prune lists of variants by their linkage disequilibrium relationships; calculate credible sets and genetic risk scores with custom parameters; perform more versatile interactive burden tests; prioritize genes by pre-calculated association scores; overlay the positions of coding variants on protein structures to help assess their impact; and perform enrichment analysis on sets of loci to suggest pathways implicated in disease processes.

The Knowledge Portal platform developed for AMP T2D has already proved extensible to other complex diseases: in 2017, both the Cerebrovascular Disease and Cardiovascular Disease Knowledge Portals were launched. In the future, connections within the ecosystem formed by the T2D, Cerebrovascular, and Cardiovascular Portals will be improved, so that researchers can easily assess the impact of a variant or involvement of a gene for all of these related diseases. If funding and collaboration considerations allow, perhaps one day these Portals will merge into a single cardiometabolic disease genetics Knowledge Portal to accelerate the development of new therapeutics in this broader area.

Finally, the ultimate goal of this funding period is that by its end, the data generation, analysis, and interpretation will have facilitated the validation of multiple promising drug targets for further investigation. Given the rate of progress on multiple fronts, this seems a realistic goal. We hope that this unique collaborative environment will continue to accelerate T2D genetic research and will become a paradigm for other research communities.

Monday, April 9, 2018

Those hoofbeats just might come from zebras

Image by Eric Dietrich via Wikimedia Commons
A physician in the 1940s wanted to convey to his students that the most obvious diagnosis is most likely to be the correct one, so he coined a saying that has become famous: “When you hear hoofbeats, think of horses not zebras.” Applying this concept to complex disease genetics, if a risk-associated variant causes a non-synonymous mutation in a coding sequence, the first hypothesis to consider is that it affects disease risk by altering the protein. But although this is often the case, one of the lessons we can learn from a large new study, published today and now available for browsing and searching in the T2D Knowledge Portal, is that we should not forget about zebras.

The new study, from a global coalition of scientists (Mahajan et al., Nature Genetics 2018), is an exome-wide association study that surveyed the T2D associations of variants within the protein-coding regions of the genome. Including more than 81,000 T2D cases, over 370,000 controls, and multiple ancestries, this study has a three-fold larger effective sample size than any previous study. Using p-value < 2.2 x 10-7 as a threshold for significance across the exome, the authors found 69 significantly associated coding variants representing 40 distinct association signals in 38 loci—16 of which had not been previously associated with T2D risk.

To get a better idea of which variants in these loci were causal for T2D risk, the researchers performed fine mapping for 37 of the 40 significant signals. They meta-analyzed T2D associations for over 500,000 individuals of European descent, performed imputation, and then generated 99% credible sets for each signal—that is, sets of variants that are 99% likely to include the causal variant. To calculate the credible sets, they used an “annotation-informed prior” model of causality that took into account the distribution of associations for different variant impact classes and also the overlap of variants with putative enhancer elements.

The 37 association signals for which the authors generated credible sets were all due to coding variants that would cause changes in the sequence of the encoded protein. But surprisingly, the fine mapping analysis found that coding variants were likely to be causal for T2D risk at fewer than half of these loci.

One of these surprising results involves a gene that is well-known to be relevant to T2D: PPARG. Involvement of the PPARG protein in T2D is beyond doubt, since this ligand-inducible transcription factor is the target of thiazolidinedione drugs that are used to treat T2D. A common variant in PPARG, rs1801282, that causes a p.Pro12Ala change in the protein has been assumed to account for the T2D association, but there is little experimental evidence that this change affects PPARG function.

In the credible set generated in this study, the probability that rs1801282 is causal was not found to be particularly high. Included in this credible set along with rs1801282 are 19 non-coding variants. One of these was previously shown to affect a binding site for the transcription factor PRRX1 and to affect expression of PPARG2, a PPARG isoform. This suggests the intriguing possibility that the T2D risk in this locus is caused, partly or wholly, by variants affecting regulation rather than protein sequence.

A similar pattern, with partial causality due to non-coding variants, was seen at an additional 7 loci. And in 13 other loci, even though these loci were discovered via coding variant signals, non-coding variants had the highest probability of causing risk.

According to Professor Mark McCarthy of the University of Oxford, one of the principal investigators of the study, “Our study shows that we should not jump to conclusions when we see that one of our association signals includes a variant around which we can base an attractive mechanistic narrative. The “average” coding variant is more likely to be causal than the “average” noncoding variant, but even at the set of loci where we detect a significant coding variant association, it is as likely as not that the signal is driven instead by one of the non-coding variants nearby. By bringing together genetic and genomic data, we can improve our prospects for finding the causal variants at GWAS loci, but these should be the starting points for empirical studies not a destination in themselves.” Dr. McCarthy has written a commentary on this study; read it here.

So, in investigating complex disease genetics, it is still a good bet that a coding variant affects disease risk via altered protein sequence: at least in some parts of the world, hoofbeats are very often due to horses. But this study reminds us that it is always a good idea to look beyond the obvious hypothesis, and remember the zebras.

This paper includes many other discoveries, and we recommend that you read the paper to get the full story. We are pleased to announce that in addition to publishing the paper, the authors have made their results available to the T2D research community immediately upon publication, in the T2D Knowledge Portal.

The dataset in the T2DKP is named ExTexT2D (ExTended exome array genotyping for T2D) and includes associations for T2D, both unadjusted and adjusted for BMI. A description of the dataset along with a table listing the cohorts of the study subjects can be found on the Data page, and you can browse and search the ExTexT2D exome chip analysis dataset at these locations in the T2DKP:

On Gene pages (see an example) on the Common variants and High-impact variants tabs
On Variant pages (see an example) in the Associations at a glance section and the Association statistics across traits table
Via the Variant Finder search
View a Manhattan plot of associations across the genome by selecting “type 2 diabetes” or “type 2 diabetes adj BMI” in the View full genetic association results for a phenotype menu on the home page.

This dataset offers by far the largest sample size for exploring associations of low-frequency and common coding variants with T2D. The size of the study enabled evaluation of which coding variants mediate GWAS signals and which are simply "proxies" to the true causal variant, as revealed in the credible set analysis. With the addition of this dataset, the T2DKP offers in-depth information on two aspects of exome associations: common and low-frequency variant associations in ExTexT2D, and comprehensive coding variant associations in the 19K exome sequence analysis dataset (soon to include 50,000 exomes).

We are pleased to provide access to these important new results. Please contact us with any questions or comments about these new data or the T2DKP in general!