Thursday, June 6, 2019

New T2DKP release features potential T2D effector genes

Today, in a new release of the Type 2 Diabetes Knowledge Portal, we present a distillation of many years of work from the global T2D research community: a list of the genes most likely to represent effectors for the development of T2D, based on a heuristic developed by Anubha Mahajan and Mark McCarthy that takes into account a variety of genetic and genomic evidence.

Identifying such candidate effectors is the goal of the Accelerating Medicines Partnership in Type 2 Diabetes (AMP T2D), established in 2014. AMP T2D brought together stakeholders from government, academia, and industry in order to speed up translation of genetic data into insights about disease mechanisms and drug targets. The generation, aggregation, and analysis of unprecedented amounts of data in this collaborative effort has spurred efforts to develop methods for the systematic integration of data (see for example Fernandez-Tajes et al., 2019).

Now, by prioritizing and integrating multiple sources of evidence, Mahajan and McCarthy have classified genes according to the likelihood that they are involved in development of T2D.  The sources of evidence that they consider include genetic association data; functional genomic data such as eQTLs and chromatin conformation; mutant phenotype evidence from model organisms and knockdown screens in human cells; and other evidence gathered from the literature. The heuristic is described in detail in downloadable documentation.

Today's release of the T2DKP includes an interactive table that displays these classifications and allows you to view and explore all of the evidence underlying them.

Section of the Predicted T2D effector gene table. Columns are sortable, and columns containing combined evidence expand to show the individual evidence types comprising that classification.

When viewing this list, several caveats should be remembered. These are predictions only, and the strength of the predictions varies considerably among genes in the list. Also, any heuristic has limits, especially those developed in the absence of a clear "gold-standard" set, as this one was. Still, we hope that this list will be a valuable resource that can help suggest or support experimental directions for T2D researchers. We welcome feedback on the heuristic and the interface. Over the next year we plan to develop software to facilitate the generation and updating of these results.

Today's release of the T2DKP also includes 8 new datasets:

  • BioBank Japan GWAS (an overall set plus sex-stratified sets) bring to the T2DKP genetic associations for a wide range of phenotypes from over 190,000 individuals of East Asian ancestry. Phenotypes in these sets include many clinical measures as well as disease status for T2D, atrial fibrillation, and open-angle glaucoma.
  • Singapore Chinese Eye Study (SCES) GWAS, Singapore Malay Eye Study (SiMES) GWAS, and Singapore Indian Eye Study (SINDI) GWAS provide T2D associations for individuals of East Asian and South Asian ancestry.
  • Singapore Living Biobank GWAS datasets include associations with anthropometric and lipid traits for Chinese and Malay populations. 
All of these datasets are described fully on the T2DKP Data page.

Another new feature of today's release is that a link to standalone versions of our custom association analysis tools, the Genetic Association Interactive Tool (GAIT) and the Custom burden test, is now available on the Analysis Modules page. Both of these tools securely access individual-level data to compute on-the-fly genetic associations using custom parameters. GAIT, for single-variant association analysis, was previously only accessible on Variant pages; the Custom burden test for computing the disease burden across a gene was previously accessible only on the High-impact Variants tab of Gene pages. 

Finally, today's release includes a new instructional video that leads you through the features of the T2DKP Variant page. The video is listed on, and linked from, the T2DKP Resources page.




Check out our latest newsletter for more details about these and other recent additions to the T2DKP.

Monday, June 3, 2019

See you at ADA next weekend!

The Type 2 Diabetes Knowledge Portal team will once again be presenting an exhibit booth at the 79th Scientific Sessions of the American Diabetes Association in San Francisco next weekend. This year, we're excited to be presenting along with our collaborators from the Diabetes Epigenome Atlas (DGA).

Stop by the booth (#2306) to get a personal, hands-on demonstration of the new tools and features, or just to say hello and let us know what new data and features you’d like to see in the T2DKP or DGA.

We’ll be there during all the exhibit hall hours:

Saturday, June 8:     10am-4pm
Sunday, June 9:       10am-4pm
Monday, June 10:    10am-2pm

Please email us if you would like to schedule a 1:1 tutorial session at a particular time, or just stop by our booth. We hope to see you there!

Wednesday, May 22, 2019

T2DKP now offers a T2D-specific exome sequence collection of unprecedented size

The largest known exome sequence analysis specific to a complex disease was published today in Nature, and all of the results are now freely available in the Type 2 Diabetes Knowledge Portal (T2DKP) to support researchers worldwide as they make decisions about how to prioritize potential T2D drug targets for investigation. The paper, “Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls” (Flannick et al.), describes a multi-ancestry analysis of both variant-level and gene-level genetic associations for type 2 diabetes.

The paper is the culmination of years of work from a global collaboration to generate exome sequences across five ancestry groups. The project began as an effort by the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in multi-Ethnic Samples (T2D-GENES) consortium to perform exome sequencing and T2D association analysis for about 13,000 samples, and evolved into a consortium of consortia—about 30 international sites in all, including the GoT2D, ESP, SIGMA, LuCAMP, and ProDIGY consortia—that partnered to design a study including as many exomes as possible. The Accelerating Medicines Partnership in Type 2 Diabetes grew out of this effort, and today supports a wide range of genetic association and other studies aimed at elucidating the mechanisms behind T2D, as well as supporting the T2DKP to serve these results to the world.

The study included participants of African American, East Asian, European, Hispanic/Latino, and South Asian ancestry. The researchers sequenced exomes (the protein-coding regions of the genome) from these participants and performed gene-level association analysis in order to detect rare variants and uncover allelic series within genes. They also performed single-variant association analysis for a subset of the samples using genome-wide arrays and imputation. A comparison of the two methods confirmed that the strength of exome sequencing is its ability to identify informative, often rare, alleles that may yield clues to disease mechanisms, while array-based GWAS provides a more comprehensive picture of strongly associated loci.

The researchers found exome-wide significant gene-level T2D associations for three genes (MC4R, PAM, and SLC30A8). Replication of the gene-level associations in a meta-analysis of three independent exome sequencing datasets confirmed the significance of these associations and found exome-wide significance for a fourth gene, UBE2NL. The variant alleles uncovered in these genes are effectively “experiments of nature” that may subtly alter the structure, function, or stability of the gene products and could be very helpful in suggesting further research directions to discover the roles of these proteins in T2D risk.

But what of the other genes whose gene-level associations didn’t meet exome-wide significance? Suspecting that these associations could still provide valuable information, the authors decided to test whether these association scores were meaningful. They created sets of genes that were known or likely to have a role in T2D risk: for example, genes known to be T2D drug targets, genes in which mutations cause maturity onset diabetes of the young (MODY), or genes whose mouse homologs confer glycemic phenotypes when knocked out. In each set, genes in the sets had more significant gene-level T2D associations than would be expected by chance, suggesting that their scores were meaningful despite relatively low statistical significance. Analysis of additional sets of genes, for example those located in strongly T2D-associated GWAS loci, supported this conclusion.

Thus, although future studies with larger sample sizes will be needed to uncover strongly significant gene-level associations, the associations generated from this study can still provide evidence to support prioritization of research effort and resources. For example, the gene-level scores could help suggest which gene in a T2D-associated locus is most likely to be relevant to T2D. The series of variant alleles in individual genes that were identified in this study could help indicate whether it is gain or loss of protein function that affects T2D risk, an important piece of information for drug development.

So that researchers worldwide may benefit from these results, with agreement from all of the authors the results were made available in the T2DKP when the pre-print of the paper was posted to BioRxiv. “A main message of the paper is that rare variants potentially provide a much more valuable resource for drug development than previously thought,”  said Jason Flannick, first author on the paper. “We can actually detect evidence of their disease association in many genes that could be targeted by new medications or studied to understand the fundamental processes underlying disease. But because there is so much more information than just the variants in the genes cited in the paper, making all of the results available to everybody is critical for them to have the largest impact.”


In the T2DKP, this dataset is termed the AMP T2D-GENES exome sequence analysis set and is described on the Data page. The single-variant T2D associations may be browsed and searched throughout the T2DKP: on Gene and Variant pages, in Interactive Manhattan plots, and via the Variant Finder tool. The Genetic Association Interactive Tool (GAIT) for single variants and the custom burden test for genes provide secure interaction with the individual-level data from this set, allowing the user to filter samples and set custom parameters before performing on-the-fly association analysis.

The gene-level association scores are displayed in the T2DKP via two avenues. A new page lists genes with their association scores and other information such as the number of variants used to calculate the score. The variants comprising the scores may be filtered by any of 7 different categories, and the results of two different aggregation test methods are also available. Gene-level scores are also shown in the Gene Prioritization Toolkit on Gene pages. See our recent blog post for a description of this interface.

In addition to the sheer volume of these exome sequencing results, their open availability in the T2DKP is a remarkable milestone for the diabetes genetics research community. "I believe the T2D genetics community is setting examples both for human genetics, in data aggregation and joint analysis, and in its commitment to sharing of these results on an open platform enabling non-experts to make direct use of the results," says Noël Burtt, Director of Operations and Development for Knowledge Portals and Diabetes at the Broad Institute. The T2DKP team is proud to be a part of this collaborative effort.

Read the press release

Tuesday, April 23, 2019

New Gene Prioritization Toolkit adds value to GWAS results

Genetic association data from genome-wide association studies (GWAS) are foundational for our understanding of type 2 diabetes and other complex diseases. But in order to apply these results to diagnosis, drug development, and treatment, we need to identify the effector genes that explain those genetic associations. This is rarely straightforward: most SNPs associated with disease are located outside of coding regions of the genome, so that their impact on genes is not obvious; and even a variant located in a protein-coding gene may actually affect a different gene. And to complicate things further, a variant that is strongly associated with disease may not have a direct impact on a gene, but may rather be "along for the ride" with a tightly linked causal variant.

Today we have released a prototype, experimental version of an interactive tool in the Type 2 Diabetes Knowledge Portal that can help bridge the gap between genetic association results and the effector genes that are directly involved in disease. We are aggregating additional data types—for example, transcriptional regulation, tissue specificity, curated biological annotations, and more—and integrating them using cutting-edge computational methods in order to mine insights from GWAS data. The new Gene Prioritization Toolkit presents these data types and results to help researchers evaluate candidate causal genes around a genetic association signal.

As a first step in developing this tool, we needed to find a way to store many different connections between variants, genes, tissues, phenotypes, and biological annotations. We decided to use a Neo4J graph database, which holds data nodes and their relationships with each other and can support complex, scientifically meaningful queries.


Neo4J graph showing variants on chromosome 8 that are associated with glycemic phenotypes. Orange circles represent variants; pink, p-values; blue, phenotypes; red, phenotype group; green and brown, variant annotations.

We have also created pipelines to apply computational methods to the genetic association data in the T2DKP. In brief, we are currently running:
  • MetaXcan, which integrates tissue-specific expression data from GTEx and genetic association data to predict the potential that a gene is causal for a phenotype in a given tissue;
  • DEPICT, which integrates multiple data sources including transcriptional co-regulation, Gene Ontology annotations,  model organism phenotypes, and more to predict membership of a gene in a pathway and the probability of its association with a given phenotype;
  • eCAVIAR and COLOC, two methods that quantify the probability that a variant is causal in both genetic association and eQTL studies.
We present the results of these methods in an interactive table on a new tab of the Gene page (see an example), "Genes in region". 



In addition to the results of the methods listed above, the table includes gene-level T2D associations generated by two types of burden test (Firth and SKAT) from an analysis of nearly 50,000 exome sequences by Jason Flannick and colleagues, as well as the phenotypes of knockout mice that are mutant for homologs of the human genes in the region, from the Mouse Genome Database. All of these methods and data types are described in more detail in our downloadable help documentation for the new interface.

The table shows all of these data types for each gene across the region. It has two alternative views: the Significance view, in which table cells are color-coded by significance, and the Records view, in which shading indicates the number of records in each cell. This visual summary allows you to compare genes quickly across methods. Clicking on a cell opens a window listing full details of the results.

The table also supports versatile sorting. Columns may be dragged and dropped in order to group comparable genes, as shown below:


Default view of the Gene Prioritization table. Columns represent genes and rows represent methods or data types. Cell color denotes significance, with darker shades indicating higher significance.

The same table after custom re-ordering of columns to group three genes that all have significant eCAVIAR and COLOC scores.

In addition, the table may be transposed so that the columns represent methods and the rows represent genes. This allows sorting by significance within a method, so that the gene with the most significant result for each method is easily identified.

This entire system, from data storage through the computational pipelines through the user interface, has been designed to be flexible and modular so that in the future we will be able to add new methods and data types easily and rapidly. As we actively develop the system, we are very interested in feedback from researchers about how to improve it. Please try it out and let us know what you think!





Thursday, April 18, 2019

GPS information for BMI and obesity now available in the CVDKP

Genome-wide polygenic scores (GPS) have great potential for helping to advance research on complex diseases and traits. Not only can they help predict individual genetic risk, but they can also help us understand the physiology of disease, by identifying groups at the extremes of risk whose clinical profiles can be studied or who may be enrolled in clinical trials.

Following up on their previous work that generated GPSs for five complex diseases, co-lead authors Amit Khera and Mark Chaffin, along with senior author Sekar Kathiresan and colleagues, have now developed a GPS for body mass index (BMI) and obesity, published today in Cell. To help promote obesity research, the authors have provided an open-access file listing the variants and weights that comprise the GPS. That file is now available for download from the Data page of our sister Knowledge Portal, the Cardiovascular Disease Knowledge Portal.

To generate this GPS, Khera and colleagues started with a large, recently published genome-wide association study (GWAS) for BMI in more than 300,000 UK Biobank participants (Locke et al., 2015) and applied an algorithm that assigned a weight to each of 2.1 million variants, also taking into account factors such as the proportion of variants with non-zero effect size and the degree of correlation between a variant and its neighbors. They validated the GPS by applying it to nearly 120,000 additional UK Biobank participants, finding that the score was strongly correlated with measured BMI, and then applied it to four independent testing datasets.

We don't have space here to cover the many interesting details uncovered by the researchers, but overall, this work shows that a high GPS strongly predicts increased risk of severe obesity, cardiometabolic disease, and all-cause mortality. Those with the very highest GPS had a level of risk for obesity similar to that conferred by a rare monogenic mutation in the MC4R gene.

The GPS has the potential to be a powerful tool for people struggling with overweight and obesity. "Importantly, we are in the early days of identifying how we can best inform and empower patients to overcome health risks in their genetic background," said Khera in a press release from the Broad Institute. "We are incredibly excited about the potential to improve health outcomes."

We invite you to read the paper, take a look at the file of variants and weights freely available from the CVDKP Data page, and contact us with any questions!


Wednesday, April 3, 2019

New Hoorn DCS dataset available in the T2DKP via federation

A new dataset, "Hoorn DCS 2019," is now available in the Type 2 Diabetes Knowledge Portal via the T2DKP Federated node at the European Bioinformatics Institute (EBI). The Hoorn Diabetes Care System (DCS) cohort is a prospective cohort of type 2 diabetics in the West Friesland region of the Netherlands, for whom clinical measurements are collected annually. Association analysis was performed at EBI across 1,997 samples for 16 phenotypes, including glycemic, anthropometric, cardiovascular, and renal traits. The Hoorn DCS 2019 dataset is described in detail on the T2DKP Data page.

This new dataset is housed at the EBI Federated node of the T2DKP, which enables researchers to interact with results that may not be transferred to the AMP T2D Data Coordinating Center (DCC) at the Broad Institute because of institutional, regional, or national regulations. Data at the EBI node are stored in such a way that their specific privacy requirements are met, but they are available for secure remote queries via T2DKP tools and interfaces. Results from such queries are served up alongside results from all of the datasets housed at the AMP T2D DCC, such that researchers may browse and query data from any location without even needing to know where the data reside. This federation mechanism represents both an important technical advance in handling and protecting data, and a significant step forward in democratizing and improving access to genetic association results. Results at the EBI Federated node now comprise 9 datasets, nearly 40,000 samples, and associations for a wide variety of phenotypes.

Summary results from all of these datasets are integrated into Gene and Variant pages in the T2DKP, and may also be viewed in interactive Manhattan plots or queried using the Variant Finder tool. The individual-level data behind the datasets are accessible for custom association analysis in our Genetic Association Interactive Tool (GAIT) on Variant pages. Using this tool, researchers can filter samples to create a custom subset with defined characteristics such as age, gender, BMI, and other measures, and then run on-the-fly association analysis within that sample subset.

Please take a look at the new dataset and contact us with any questions or comments!

Wednesday, March 27, 2019

Get more help using the T2DKP: videos and webinars

Although we try hard to make all of the contents and interfaces of the Type 2 Diabetes Knowledge Portal clear and user-friendly, they can still be difficult to understand—especially for scientists who are not experts in human genetics. It can be hard to know where to get started among the dozens of complex datasets, several data types, multiple phenotypes, and custom analysis tools included in the Portal. So we're starting to produce two different kinds of video content that will complement our written documentation and satisfy all you auditory learners out there.

First, we're creating short videos (5 minutes or less) that focus on particular aspects of the data and features. Our first one (view on YouTube) is an overall introduction to the T2DKP and the Knowledge Portal architecture. Stay tuned for more in the coming months! And if you would like to see a video on a particular subject, please let us know.

Second, we're planning regular webinars that you can join online. The first of these hour-long sessions gave an overview of the project that created the T2DKP, the data it contains, the major entry points to the results, and a preview of future directions. You can view a recording of the webinar here and download the slides here. If you would like to be notified in advance of these sessions, be sure to sign up for our email list.

Please check out these videos and let us know what you think!