Tuesday, April 17, 2018

Developing a model for collaborative science: a mid-term perspective on the AMP T2D Partnership

In 2011, Dr. Francis Collins, Director of the National Institutes of Health (NIH), met with leaders in biomedical research to discuss a frustrating problem. Continual improvements in molecular biological and genomic techniques were generating an avalanche of data relevant to complex diseases, yet the translation of these data into insights about disease mechanisms and drug targets was unacceptably slow. It was clear that an entirely new paradigm for collaborative research would be needed to speed up the extraction of knowledge from data.

The result of these discussions was the creation of the Accelerating Medicines Partnership (AMP), one branch of which focuses on type 2 diabetes (T2D)—a life-threatening disease that affects hundreds of millions of people worldwide, whose incidence is growing, and whose progression cannot yet be effectively stopped or reversed. AMP T2D, a five-year project, includes the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK); the pharmaceutical companies Janssen Pharmaceuticals, Eli Lilly and Company, Merck, Pfizer, and Sanofi; the University of Michigan; the University of Oxford; the Broad Institute; and other researchers around the globe. The Foundation for the National Institutes of Health (FNIH) also provides funding and coordination for the project.

Drawing on the strengths of both academia and industry, this public-private partnership brings together all stakeholders in a pre-competitive space to share data and combine resources, with the goal of validating new drug targets faster. Now in Spring 2018, roughly mid-way through the funding period, it is evident that this collaboration has resulted in remarkable progress on both scientific and collaborative fronts.

Genetic association data: the foundation of AMP T2D

Genetic association studies interrogate the genomes of individuals at millions of specific genomic positions to discover sequence variants that are correlated with the incidence of disease. From the outset, AMP T2D aimed to support the generation of unprecedented amounts of new genome-wide association study (GWAS), exome sequencing, and whole-genome sequencing data within the project as well as their aggregation with all relevant publicly available data. Originally, 5 sites were funded by the NIDDK to generate new data and deposit them into the AMP T2D Data Coordinating Center (DCC) at the Broad Institute. As the project evolved, another site was funded by the NIDDK and 8 more sites were funded by the FNIH. Additionally, an Opportunity Pool of funds from the NIDDK was created, allowing the AMP T2D Steering Committee to award smaller grants for complementary research projects in a flexible, science-driven manner.  Currently 10 Opportunity Pool projects are in progress, and more awards will be given in the future.

Not only has the number of genetic association studies increased since the inception of AMP T2D, but also the number of samples surveyed in each has grown dramatically, from typically under 100,000 to approaching 1 million today. The increased statistical power conferred by these large sample sizes has led to a huge increase in the number of loci found to be significantly associated with T2D, from about 70 at the start of the project to nearly 430.

Improvements in genomic technologies in the past few years have allowed AMP T2D collaborators to generate increasing amounts of sequencing data, which make it possible to comprehensively interrogate all alleles and to uncover rare variation. At the project’s start, T2D associations with exome sequences (covering the protein-coding regions of the genome) were available for about 13,000 samples, and no whole-genome sequencing studies had been published. Now, more than 2,600 whole genomes are available, and analysis of a set of 50,000 exomes—the largest disease-specific aggregation of exome sequencing data to date—is nearly complete. Importantly, many of the associations that have been newly discovered in sequencing studies involve relatively rare variants that affect protein-coding regions. It is often more straightforward to develop hypotheses about the impact of such variants than it is for variants outside of coding regions.

As the AMP T2D partnership has grown in prominence in the diabetes field, the DCC has been approached by investigators outside the project who want to contribute their data in order to aggregate and display them in the context of AMP T2D data. In early 2017, researchers in the 70kforT2D project, which found novel T2D associations by re-analyzing existing GWAS data, offered their results for integration into the DCC and display in the Type 2 Diabetes Knowledge Portal (T2DKP; see below) before publication.

70kforT2D GWAS was first pre-publication dataset to be added to the T2DKP from outside the AMP T2D partnership, and it was particularly appropriate that these scientists, whose results illustrate the value of data sharing, themselves chose to freely share their results. Incorporation of datasets into the AMP T2D DCC and T2DKP offers investigators the chance to take advantage of the expertise of the AMP DCC analysis team, apply cutting-edge analysis tools to their data, and display their results broadly to the T2D research community in the context of multiple datasets. The AMP T2D DCC is open to incorporating T2D-relevant datasets from all investigators (find details on contributing data here).

In addition to the datasets generated by AMP T2D partners and other T2D researchers, which focus on associations with T2D, glycemic measures, and T2D complications, the AMP T2D DCC also collects publicly available genetic association datasets for traits relevant to T2D, such as anthropometric measures, blood pressure and lipid levels, and heart and kidney disease.

Orthogonal data types to help identify and prioritize causal variants and genes

Finding genetic variants that are associated with T2D risk is critically important to understanding the genetics of T2D, but it is only a first step. The most significantly associated variant in a genomic region may not be the causal variant that is responsible for altered T2D risk. Researchers perform fine mapping to analyze genetic associations in specific regions of the genome and generate credible sets—that is, sets of variants that are predicted to include the causal variant. Mid-way through the AMP T2D funding period, emphasis among the data-generating partners is beginning to shift from simply generating association data to performing fine mapping and credible set analysis.

But even after predicting which sequence variations are responsible for altered risk, finding clues about how they affect risk requires integration with additional data types. Information about the functional importance of the genomic region where a variant is located—its relevance to gene expression, protein function, networks and pathways, metabolite levels, and more, all determined on a tissue-specific basis—can help prioritize genes and pathways for in-depth experimental investigation. These kinds of research were built into AMP T2D from the beginning, and as the importance of these data types became even clearer, several Opportunity Pool awards were given to projects focusing on complementary data types that shed light on the significance of genetic associations.

Several of these projects focus on generating tissue-specific epigenomic data: histone modifications, DNA methylation, chromatin conformation, transcription factor binding, 3-dimensional chromosome structure, and other data types. Epigenomic data can provide important clues about the mechanisms by which sequence variation affects T2D risk, particularly for variants that lie outside of protein-coding regions. For example, if a risk-associated variant is seen to disrupt a transcription factor binding site, this would support the hypothesis that the transcription factor and its target genes are relevant to T2D.

To make these data accessible to researchers, one Opportunity Pool award supports the creation of the Diabetes Epigenome Atlas, which collects and displays epigenomic datasets relevant to T2D. In the near future, these data will be fully integrated with genetic association data in the Type 2 Diabetes Knowledge Portal (see below).

Other Opportunity Pool projects are concerned with processes downstream of gene expression. Discovering interactions between proteins implicated in T2D risk, for example, could help to uncover all of the players in pathways important for the development of T2D, increasing the number of potential drug targets. Determining the effects of variants on the levels of key metabolites can illuminate the metabolic pathways that change during the development of T2D. 

In addition to generating all of these orthogonal data types, AMP T2D partners are developing algorithms and using machine learning to classify and prioritize variants on the basis of the functional annotations that accompany them. Finally, other Opportunity Pool projects will use model organisms to test and validate drug targets that are suggested by these analyses.

Tools and methods to speed analysis and interpretation

At the inception of AMP T2D it was also clear that the development of new methods and tools would need to accompany the generation of data, and support for these activities was built into the program. One major technical effort has addressed an obstacle to global data aggregation: because of institutional and national privacy regulations, some datasets may not leave their site of origin to be aggregated with other datasets at the AMP T2D DCC. A group at the European Bioinformatics Institute has built a technical replicate of the DCC and knowledgebase, such that data stored there are equally as accessible for browsing, searching, and interactive analysis as are the data stored at the AMP T2D DCC at the Broad Institute. This federation mechanism allows global data accessibility even when data aggregation is not permitted.

Other efforts supported by AMP T2D are aimed at improving the speed and efficiency at which data can be taken in and analyzed. In one project, a data intake system is being developed that will streamline the process for both data submitters and for the DCC team, and will be applicable to data submission both at the Broad DCC and at other federated sites. Another project has created a software pipeline, LoamStream, that will largely automate quality control and association analysis of incoming data. Currently, LoamStream is in use for quality control of genotype data, and this has already greatly reduced the time required to process new datasets. Future work will extend the pipeline to association analysis and will also allow it to take in sequence data as well as genotype data.

A genetic association of a variant with T2D gains credibility if multiple independent studies replicate the association. Thus, it is important for researchers to be able to evaluate the weight of available evidence. But currently this is difficult to assess from the association datasets in the AMP T2D DCC, because many are based on overlapping sets of subjects. AMP T2D partners at the University of Michigan and University of Oxford are working on a method to take these overlaps into account and synthesize associations from multiple datasets into a “bottom-line” significance for association of a variant with T2D, which will aid in prioritizing variants for future work.

Multiple AMP T2D projects for analysis, interpretation, and custom interactive analysis of variant-phenotype associations are ongoing at the Universities of Michigan, Chicago, and Oxford, Vanderbilt University, and the Broad Institute. These projects are aimed at facilitating, in various ways, the path from variant associations to functional knowledge, and all have been or will be integrated into the T2D Knowledge Portal (see below).

Hail software offers a pipeline that speeds up the analysis of huge genomic datasets, while the gnomAD resource aggregates and harmonizes exome and genome sequences to provide a catalog of genetic diversity, in more than 100,000 humans, that aids in interpretation of variant associations with disease. A tool under development in the gnomAD project will display the effects of variants on protein structures as another way to deduce their potential impact.

Other analysis modules include gene-based association methods for using expression data to predict genes that may impact a phenotype (PrediXcan and MetaXcan), and a phenome-wide association study (PheWAS) method for visualization of the associations of a variant across multiple phenotypes, which is a crucial consideration during drug development. 

The interactive visualization tool LocusZoom will integrate many of these methods to display variant associations and credible sets, epigenomic and functional annotations, and phenotype associations across a genomic region as well as offering custom association analysis.

An example LocusZoom plot

AMP T2D Knowledge Portal: democratizing T2D genetic results for researchers world-wide

AMP T2D was founded on the idea that in order to truly accelerate progress, genomic information must be freely accessible to all scientists and presented in a way that is understandable by a broad range of researchers working on T2D biology, not only by human geneticists and bioinformaticians with special computational skills. So the roadmap for the project included not only data generation and analysis, but also the production of a publicly available web resource that would integrate data types, interpret the evidence, and present of all these results. 

While it is under continuous development, mid-way through the initial funding period the T2D Knowledge Portal (T2DKP) is already a well-established resource. Other web resources collect genetic association data, but the T2DKP is unusual in providing harmonized datasets to which a consistent analysis pipeline has been applied. Rather than simply cataloging datasets, it offers distilled and synthesized results along with their interpretation, to guide more detailed exploration of the evidence. And, unlike any other extant resource, it offers researchers the ability to perform interactive queries on protected individual-level data. 

T2DKP home page

The Gene page of the T2DKP (see an example) illustrates the presentation of immediately understandable summary information along with the opportunity to drill down to the details. An algorithm considers the associations of all variants across a gene, for all phenotypes and in all datasets aggregated at the DCC, and calculates from them a “traffic light” signal for the gene: green to indicate that there is a significant association for at least one phenotype; yellow to indicate suggestive, if not highly significant, associations; and red to indicate that there is no evidence for association for any of the phenotypes considered in the T2DKP. Below this, tables and graphics invite users to explore all variants across the gene, their impacts on the encoded protein, and their associations, as well as their positions relative to epigenomic marks across the region in multiple tissues.

The T2DKP currently offers the ability to run custom, interactive association analyses using two different tools. In the LocusZoom visualization, users may choose one or more variants as covariates before performing association analysis. The Genetic Association Interactive Tool (GAIT) for single variant associations, which also powers the custom burden test for gene-level associations, is even more versatile, presenting the distributions of different characteristics of the sample set (age, sex, BMI, glycemic measures, blood lipid levels, and many more) and allowing users to filter the set by multiple criteria and to choose custom covariates before performing association analysis. Both of these tools allow analytical access to the individual-level data, whether housed at the Broad DCC or at the EBI federated node, in a secure environment so that data privacy is always protected.

Evolution of a collaborative environment

AMP T2D organization

The AMP T2D partnership is a multifaceted project (illustrated above) that embraces several aspects of basic research and combines them with building a product, the T2DKP. In connecting scientists both within and outside of consortia, in academia and in industry, working on genetic associations or functional studies, it is becoming the nexus of the T2D genetics community. Researchers are finding the T2DKP helpful for accessing even their own results and for viewing them in the context of multiple phenotypic associations and other complementary data types. Pharmaceutical partners are finding help via the Target Prioritization project, in which the tools and methods developed within AMP T2D are being used to prioritize a list of genes of mutual interest for further investigation.

Perhaps most importantly, AMP T2D has made researchers—both within and outside of the project—aware of the value of sharing data for representation in the context of all other relevant data. Only by compiling and interpreting all available information will we be able to make the best hypotheses about genes and pathways that are possible drug targets and prioritize them for in-depth functional investigation.

AMP T2D and beyond

In the remainder of the initial AMP T2D funding period, we expect continued progress in each of the areas discussed above. The data intake and analysis pipelines will be improved, and new data will be incorporated at an increasing pace—including data from the UK Biobank, which has generated association results for 500,000 genotyped subjects and more than 2,500 traits. Associations will be added for many more phenotypes related to T2D, including diabetic complications and longitudinal phenotype data that connect the development of various traits to the timeline of incident T2D.  Much more T2D-relevant epigenomic data will be available for query as well as for browsing, via dynamic connection with the Diabetes Epigenome Atlas. And entirely new data types (for example, metabolomic and proteomic data) arising from Opportunity Pool projects will be added to the T2DKP.

Ongoing work on tools and methods will result in the addition of many more interactive modules to the T2DKP. Researchers will be able to view PheWAS data; prune lists of variants by their linkage disequilibrium relationships; calculate credible sets and genetic risk scores with custom parameters; perform more versatile interactive burden tests; prioritize genes by pre-calculated association scores; overlay the positions of coding variants on protein structures to help assess their impact; and perform enrichment analysis on sets of loci to suggest pathways implicated in disease processes.

The Knowledge Portal platform developed for AMP T2D has already proved extensible to other complex diseases: in 2017, both the Cerebrovascular Disease and Cardiovascular Disease Knowledge Portals were launched. In the future, connections within the ecosystem formed by the T2D, Cerebrovascular, and Cardiovascular Portals will be improved, so that researchers can easily assess the impact of a variant or involvement of a gene for all of these related diseases. If funding and collaboration considerations allow, perhaps one day these Portals will merge into a single cardiometabolic disease genetics Knowledge Portal to accelerate the development of new therapeutics in this broader area.

Finally, the ultimate goal of this funding period is that by its end, the data generation, analysis, and interpretation will have facilitated the validation of multiple promising drug targets for further investigation. Given the rate of progress on multiple fronts, this seems a realistic goal. We hope that this unique collaborative environment will continue to accelerate T2D genetic research and will become a paradigm for other research communities.

Monday, April 9, 2018

Those hoofbeats just might come from zebras

Image by Eric Dietrich via Wikimedia Commons
A physician in the 1940s wanted to convey to his students that the most obvious diagnosis is most likely to be the correct one, so he coined a saying that has become famous: “When you hear hoofbeats, think of horses not zebras.” Applying this concept to complex disease genetics, if a risk-associated variant causes a non-synonymous mutation in a coding sequence, the first hypothesis to consider is that it affects disease risk by altering the protein. But although this is often the case, one of the lessons we can learn from a large new study, published today and now available for browsing and searching in the T2D Knowledge Portal, is that we should not forget about zebras.

The new study, from a global coalition of scientists (Mahajan et al., Nature Genetics 2018), is an exome-wide association study that surveyed the T2D associations of variants within the protein-coding regions of the genome. Including more than 81,000 T2D cases, over 370,000 controls, and multiple ancestries, this study has a three-fold larger effective sample size than any previous study. Using p-value < 2.2 x 10-7 as a threshold for significance across the exome, the authors found 69 significantly associated coding variants representing 40 distinct association signals in 38 loci—16 of which had not been previously associated with T2D risk.

To get a better idea of which variants in these loci were causal for T2D risk, the researchers performed fine mapping for 37 of the 40 significant signals. They meta-analyzed T2D associations for over 500,000 individuals of European descent, performed imputation, and then generated 99% credible sets for each signal—that is, sets of variants that are 99% likely to include the causal variant. To calculate the credible sets, they used an “annotation-informed prior” model of causality that took into account the distribution of associations for different variant impact classes and also the overlap of variants with putative enhancer elements.

The 37 association signals for which the authors generated credible sets were all due to coding variants that would cause changes in the sequence of the encoded protein. But surprisingly, the fine mapping analysis found that coding variants were likely to be causal for T2D risk at fewer than half of these loci.

One of these surprising results involves a gene that is well-known to be relevant to T2D: PPARG. Involvement of the PPARG protein in T2D is beyond doubt, since this ligand-inducible transcription factor is the target of thiazolidinedione drugs that are used to treat T2D. A common variant in PPARG, rs1801282, that causes a p.Pro12Ala change in the protein has been assumed to account for the T2D association, but there is little experimental evidence that this change affects PPARG function.

In the credible set generated in this study, the probability that rs1801282 is causal was not found to be particularly high. Included in this credible set along with rs1801282 are 19 non-coding variants. One of these was previously shown to affect a binding site for the transcription factor PRRX1 and to affect expression of PPARG2, a PPARG isoform. This suggests the intriguing possibility that the T2D risk in this locus is caused, partly or wholly, by variants affecting regulation rather than protein sequence.

A similar pattern, with partial causality due to non-coding variants, was seen at an additional 7 loci. And in 13 other loci, even though these loci were discovered via coding variant signals, non-coding variants had the highest probability of causing risk.

According to Professor Mark McCarthy of the University of Oxford, one of the principal investigators of the study, “Our study shows that we should not jump to conclusions when we see that one of our association signals includes a variant around which we can base an attractive mechanistic narrative. The “average” coding variant is more likely to be causal than the “average” noncoding variant, but even at the set of loci where we detect a significant coding variant association, it is as likely as not that the signal is driven instead by one of the non-coding variants nearby. By bringing together genetic and genomic data, we can improve our prospects for finding the causal variants at GWAS loci, but these should be the starting points for empirical studies not a destination in themselves.” Dr. McCarthy has written a commentary on this study; read it here.

So, in investigating complex disease genetics, it is still a good bet that a coding variant affects disease risk via altered protein sequence: at least in some parts of the world, hoofbeats are very often due to horses. But this study reminds us that it is always a good idea to look beyond the obvious hypothesis, and remember the zebras.

This paper includes many other discoveries, and we recommend that you read the paper to get the full story. We are pleased to announce that in addition to publishing the paper, the authors have made their results available to the T2D research community immediately upon publication, in the T2D Knowledge Portal.

The dataset in the T2DKP is named ExTexT2D (ExTended exome array genotyping for T2D) and includes associations for T2D, both unadjusted and adjusted for BMI. A description of the dataset along with a table listing the cohorts of the study subjects can be found on the Data page, and you can browse and search the ExTexT2D exome chip analysis dataset at these locations in the T2DKP:

On Gene pages (see an example) on the Common variants and High-impact variants tabs
On Variant pages (see an example) in the Associations at a glance section and the Association statistics across traits table
Via the Variant Finder search
View a Manhattan plot of associations across the genome by selecting “type 2 diabetes” or “type 2 diabetes adj BMI” in the View full genetic association results for a phenotype menu on the home page.

This dataset offers by far the largest sample size for exploring associations of low-frequency and common coding variants with T2D. The size of the study enabled evaluation of which coding variants mediate GWAS signals and which are simply "proxies" to the true causal variant, as revealed in the credible set analysis. With the addition of this dataset, the T2DKP offers in-depth information on two aspects of exome associations: common and low-frequency variant associations in ExTexT2D, and comprehensive coding variant associations in the 19K exome sequence analysis dataset (soon to include 50,000 exomes).

We are pleased to provide access to these important new results. Please contact us with any questions or comments about these new data or the T2DKP in general!

Tuesday, March 6, 2018

T2DKP Winter Newsletter

The latest issue of our quarterly newsletter is now available. Download it here to find out what we've been up to!

Thursday, March 1, 2018

New release today, as the KPN moves to a regular release schedule

At the Knowledge Portal Network (consisting of the Type 2 Diabetes, Cardiovascular Disease, and Cerebrovascular Disease Knowledge Portals), we are establishing a regular bimonthly release schedule. Every other month, new data and features will be incorporated into the Portals. Today, we are pleased to announce the first of these releases.

New data in the Type 2 Diabetes Knowledge Portal

This release adds two new datasets to the T2DKP. The Diabetic Cohort - Singapore Prospective Study Program is a T2D case-control study to identify genetic and environmental risk factors for diabetes in Singapore Chinese. The DC-SP2 GWAS set, a meta-analysis of summary level T2D associations from 3,951 individuals, was contributed by Drs. Rob Martinus Van Dam, E Shyong Tai, and Xueling Sim from the National University of Singapore. They have also submitted individual-level data from this study to the Accelerating Medicines Partnership Data Coordinating Center (AMP DCC), and these data will be incorporated into the T2DKP after quality control and analysis are complete.

In addition to this set, we have incorporated the publicly available summary statistics from the DIAGRAM 1000G GWAS. This dataset, from the DIAGRAM (DIAbetes Genetics Replication And Meta-analysis) consortium, is a meta-analysis of 26,676 T2D cases and 132,532 control participants from 18 GWAS (Scott RA, et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. (2017) Diabetes 66:2888). Samples were imputed using the all ancestries 1000 Genomes Project reference panel.

More details about both of these datasets are available on our Data page.

New features specific to the Type 2 Diabetes Knowledge Portal

We have expanded the range of data available for interactive analysis by adding individual-level data from the CAMP GWAS, BioMe AMP T2D GWAS, and METSIM GWAS datasets to the dynamic analysis modules LocusZoom and GAIT (Genetic Association Interactive Tool). LocusZoom, powered by the Hail software developed at the Broad Institute as part of the AMP T2D project, allows you to perform custom association analysis while conditioning on specific variants or sets of variants.

GAIT offers alternative options for custom association analysis, such as filtering samples by their phenotypic characteristics (e.g., age, BMI, cholesterol levels) and choosing specific covariates. To date, seven different datasets comprised of over 67,000 samples are available for dynamic analysis in GAIT. These include datasets housed both at the AMP DCC (19k exome sequence analysis; CAMP GWAS; BioMe AMP T2D GWAS; METSIM GWAS) and at the EBI Federated node (EXTEND GWAS; Oxford Biobank exome chip analysis; GoDARTS Affymetrix GWAS).

We have also taken an initial step towards integration of the T2DKP with a new federated node, the T2DREAM database of epigenomic data relevant to T2D. In the near future, epigenomic data displayed in the T2DKP will be drawn dynamically from T2DREAM. In the meantime, we have added gene- and variant-specific links to T2DREAM from the re-styled External Resources section at the bottom of Gene and Variant pages.

New features for all Knowledge Portals

Some of the improvements in this release are visible in all the Portals of the Knowledge Portal Network. One of the most significant affects LocusZoom, the dynamic plot that displays variant associations along with their genomic coordinates, linkage disequilibrium, and other information. Previously, the only way to select a phenotype was to scroll through a long list. Now, a new phenotype filter lets you enter one or more search criteria and filter the list by those criteria. Once you have selected a phenotype, the datasets that include associations for that phenotype are presented for selection. Previously, only one dataset (the one with the largest sample size) was available for each phenotype; now, associations from all relevant datasets may be viewed in LocusZoom.

Portion of the updated LocusZoom interface, showing phenotype filtering capability.

The sample filtering panel of the user interface for the custom burden test and GAIT (Genetic Association Interactive Tool) has also been improved to make it more intuitive to use. The External Resources sections of Gene and Variant pages have been re-styled, and gene- and variant-specific links to PheWeb have been added. PheWeb displays phenotypes most significantly associated with the gene or variant, based on a GWAS for over 2,400 phenotypes in UK Biobank data that was performed by Ben Neale's group. Finally, the home pages of all the Portals have been redesigned to make the appearance of the disease-specific portals more distinct.

Please browse these new data and features, and let us know what you think!

Tuesday, February 6, 2018

Federation brings three new datasets to the T2DKP

Our mission at the Type 2 Diabetes Knowledge Portal (T2DKP) is to aggregate and analyze genetic association data relevant to T2D, and to make the knowledge that can be gleaned from these data available to researchers around the world. But it isn't possible to aggregate all of the relevant data in one place: privacy regulations at the institutional, regional, and national levels determine how these data are handled, and whether or where they can be transferred.

The T2DKP is supported by the Accelerating Medicines Partnership in Type 2 Diabetes (AMP T2D),  a pre-competitive partnership among the National Institutes of Health, industry, and not-for-profit organizations, managed by the Foundation for the National Institutes of Health. Because AMP T2D seeks to facilitate discovery of new targets for T2D treatment by making as much data as possible available via the T2DKP, it funded the development of a mechanism for establishing interconnected federated nodes of the T2DKP that would enable researchers to interact with all of the data regardless of where they are located.

This goal was realized with the creation, by a team led by Thomas Keane and Dylan Spalding, of a federated node of the T2DKP at the European Bioinformatics Institute (EBI).  Data housed at the EBI node are stored in such a way that their specific privacy requirements are met, but they are made available for remote queries via T2DKP tools and interfaces. Results from such queries are served up alongside results from all of the datasets housed in the AMP T2D Data Coordinating Center (DCC) at the Broad Institute. Researchers may browse and query data from any location without even needing to know where they reside. This federation mechanism represents both an important technical advance in handling and protecting data, and a significant step forward in democratizing and improving access to genetic association results.

The first dataset to be incorporated into the Portal via the EBI federated node was the Oxford BioBank exome chip analysis dataset, which contains association data for glycemic, lipid, and blood pressure traits from over 7,100 subjects in Oxfordshire, U.K. The EBI Federated Node has now added three more datasets:

  • The EXTEND GWAS dataset, generated by Drs. Timothy Frayling and Andrew Wood and their colleagues, is comprised of 7,159 samples (1,395 T2D cases and 5,764 controls) from the Exeter EXTEND Biobank. It includes associations for a wealth of glycemic, anthropometric, cardiovascular, renal, and hepatic phenotypes--including many that are new to the T2DKP.
  • The GoDARTS Affymetrix GWAS dataset, from Dr. Colin Palmer and colleagues, includes summary-level statistics for associations with BMI and blood lipid levels from 3,307 diabetic participants in the Genetics of Diabetes Audit and Research Study in Tayside Scotland. In addition, individual-level data from over 17,000 subjects (including the set from which summary statistics were calculated) are available via the GAIT tool (see below). 
  • The Oxford BioBank Axiom GWAS dataset, from Dr. Fredrik Karpe and colleagues, includes associations for BMI and blood lipid levels from 7,193 participants, all healthy men and women between 30 and 50 years of age. It represents an additional analysis of the same samples contained in the Oxford BioBank exome chip analysis dataset.
These datasets are described in detail on our Data page. Summary results from all three sets are integrated into Gene and Variant pages in the T2DKP, and may also be viewed in the Manhattan plots accessible by searching for a phenotype from the T2DKP home page. The Variant Finder also queries these datasets.

The individual-level data behind all three of these datasets is accessible for custom association analysis in our Genetic Association Interactive Tool (GAIT) on Variant pages. Using this tool, researchers can filter samples to create a custom subset with defined characteristics such as age, gender, BMI, and other measures, and then run on-the-fly association analysis within that sample subset. Now, GAIT queries datasets both at the DCC and at the Federated node, using the same methodology for each, in a way that is transparent to users of the tool. The new Federated datasets bring the total number of individual-level samples available for custom analysis in the T2DKP to 67,768.

Monday, January 22, 2018

GWAS data re-analysis yields novel results about T2D risk

"Waste not, want not." The old proverb is about frugality, but a study published today gives it a whole new dimension. Lead author Sílvia Bonàs, directed by Josep Mercader and David Torrents and collaborating with many colleagues at the Barcelona Supercomputing Center, the Broad Institute, and other institutions (Bonàs-Guarch et al. (2018), Nature Communications 9), decided to investigate variants associated with type 2 diabetes (T2D) by re-analyzing existing GWAS data rather than initiating a new study.

This was a frugal strategy, conserving both time and resources. But the benefits of this approach went way beyond frugality. By aggregating multiple datasets and using unified, current methods for quality control, imputation, and association analysis, the researchers discovered nuggets of significant information that were not apparent in the original analyses of the individual sets. And all of these nuggets are freely available for browsing and searching in the T2D Knowledge Portal (T2DKP).

To amass these data, the researchers combined all of the individual-level T2D case-control GWAS data that were available from the European Genome-Phenome Archive (EGA) and the database of Genotypes and Phenotypes (dbGaP). After harmonization and quality control, data from 70,127 subjects (12,931 cases and 57,196 controls) remained, inspiring them to name the project "70KforT2D".

In the time since the original studies had been performed, better and more comprehensive reference panels for imputation had been generated by the 1000 Genomes and UK10K projects. By using both of these panels for imputation, the researchers were able to substantially increase the number of variants that could be imputed. They ended up with more than 15 million variants, including more than 5 million rare variants and over 1.3 million indels, which have previously been difficult to impute.

In performing association analysis, the authors took advantage of existing large datasets of T2D association summary statistics for meta-analysis, being careful to only combine non-overlapping samples. They also took advantage of the T2D Knowledge Portal to verify some associations for low-frequency variants that were located in coding regions and had suggestive, but not unambiguously significant, p-values. The significance of the T2D associations of these variants was confirmed by meta-analysis along with the associations seen in two large studies in the T2DKP (GoT2D exome chip analysis, with nearly 80,000 samples, and the 17K exome sequence analysis dataset with 17,000 samples).

The association analysis identified 57 loci associated with T2D risk at the genome-wide significance level or better (p-value ≤ 5x10e-8), seven of which had not previously been associated with T2D. The high quality of the data made it possible to fine-map the variants at each of these loci and construct credible sets. Many of the putative causal variants—including those in previously identified loci—were indels rather than single-nucleotide polymorphisms, underscoring the importance of an imputation procedure that discovers indels.

The T2D-associated loci discovered in this study give some tantalizing hints about genes potentially involved in T2D, and suggest new avenues for detailed wet-lab investigation. We can’t review all of them in this space, but one association is particularly interesting for the generalizable lessons it teaches us about case-control GWAS for T2D.

This association, which the authors validated and replicated using additional datasets, involves the X chromosome variant rs146662075. The risk allele confers a 2-fold elevated risk of developing T2D, in males. The variant appears to affect an enhancer that could regulate expression of AGTR2, a gene known to be involved in modulating insulin sensitivity—making it a very interesting subject for investigation with regard to T2D. More work is needed to figure out whether this is really a male-specific effect, or whether it was only detectable in males because imputation for the X chromosome is more accurate in males, who have only one copy of the chromosome.

The first lesson learned from this association is that the X chromosome harbors important loci, and deserves attention in association studies. While this seems obvious, since the X chromosome comprises 5% of the genome, it has been neglected in most studies to date.

The second lesson is that for an adult-onset disease like T2D, it’s very important to pay attention to the details of case-control classification. If there are young people in the control group, they may actually be future T2D cases, destined to develop the disease later in life. When the authors tried to replicate the initial discovery for this variant in different datasets, the associations were not as significant as expected. But after digging deeper into the experimental cohorts, they found that most of the replication datasets had many subjects younger than 55, which was the average age for T2D onset for these cohorts. Re-running the analysis after excluding controls younger than 55 and also excluding those who appeared to be pre-diabetic, based on an oral glucose tolerance test, brought the replication results into concordance with the discovery results and confirmed the significance of the rs146662075 association.

In keeping with the spirit of open access, the authors provided the summary statistics from this work to the T2DKP even before publication. These results are incorporated into the T2DKP and are visible on Gene and Variant pages as well as searchable via the Variant Finder. The authors have also made the full summary statistics available for public download.

The novel and important findings from this study strongly reaffirm the value of data sharing. Not only are data sharing and re-analysis the right things to do for reasons of fairness, equity, and frugality; they can also spark new insights and move science forward in unexpected ways.

Friday, January 19, 2018

New METSIM dataset adds individual-level GWAS data to the T2DKP

The Finnish population is a valuable genetic resource. Having undergone multiple population bottlenecks, this relatively homogeneous population is enriched in low-frequency and loss-of-function variants. Even better, Finns are generally willing to participate in research studies, and many measures of their health are detailed in comprehensive electronic health records.

To take advantage of these characteristics, the METSIM (Metabolic Syndrome in Men) study (Laakso et al. 2017, J. Lipid Res. 58, 481-493) was initiated in 2005. Over 10,000 Finnish men were examined between 2005 and 2010. All of the subjects were phenotyped extensively, with an emphasis on traits associated with type 2 diabetes (T2D), cardiovascular disease, and insulin resistance, and their genotypes and exome sequences were determined. Subsets of the group have been characterized in more detail, with whole-genome sequencing and detailed analyses of transcripts and gene expression, DNA methylation, gut microbiome composition, and other phenotypes.

Now, you can easily access results from the METSIM cohort in the T2D Knowledge Portal. Variant associations with T2D, fasting glucose levels, and fasting insulin levels are available, both unadjusted or adjusted for body mass index. The individual-level data are also available for interactive analyses using our Genetic Association Interactive Tool (GAIT; see below), which allows you to design and run custom association analyses using custom subsets of the samples, while always protecting patient privacy. The addition of METSIM data brings to nearly 68,000 the number of samples available for analysis in GAIT.

The Foundation for the NIH and the Accelerating Medicines Partnership in Type 2 Diabetes were instrumental in bringing these data, generated by researchers in Finland and the U.S., to the T2DKP. Individual-level genotype data from 1,185 T2D cases and 7,357 controls were deposited into the Data Coordinating Center (AMP T2D DCC), and analysis and quality control were performed by the DCC analysis team. The experiment design and analysis are summarized on our Data page, and detailed reports that fully document the analysis are available for download.

The METSIM GWAS dataset currently has "Early Access Phase 1" status in the T2DKP, which is assigned to new data. This status denotes that although analysis and quality control checks have been performed, the data are not yet considered to be in their final state. During the early access period, users may analyze the data but may not submit the results of these analyses for publication. Find full details about the different phases of data release on our Policies page.

Results from METSIM GWAS may be viewed at these locations in the T2D Knowledge Portal:

• On Gene Pages (e.g., MTNR1B) in the Common variants and High-impact variants tables and in LocusZoom static plots, for the phenotypes T2D, T2D adjusted for BMI, fasting glucose, fasting glucose adjusted for BMI, fasting insulin, and fasting insulin adjusted for BMI;

• On Variant Pages (e.g.rs579060) in the Associations at a glance section, the Association statistics across traits table, and in LocusZoom static plots;

• From the View full genetic association results for a phenotype search on the home page: first select one of the phenotypes listed above, and then on the resulting page, select the METSIM GWAS dataset.

Individual-level METSIM GWAS data may be used for custom interactive analyses using these tools in the T2DKP:

• Using the Variant Finder tool, you may specify multiple criteria and retrieve the set of variants meeting those criteria;

• Using the Genetic Association Interactive Tool (GAIT) on Variant Pages, you may select the METSIM GWAS dataset, choose one of 5 phenotypes for association analysis, choose custom covariates, and filter the sample pool by specifying a range of values for one or more of 8 different phenotypes, then run on-the-fly analysis.

Phenotypes available for association analysis of METSIM GWAS data in GAIT

Covariates available for selection when analyzing METSIM GWAS data in GAIT

Samples may be filtered by setting ranges for one or more of 8 phenotypes for the METSIM GWAS dataset