Wednesday, May 18, 2016

Expanding the landscape of human genetic variation data in the Type 2 Diabetes Knowledge Portal

With the addition of four new sequence data sets to our database, the number of variants and associations accessible via the Portal pages and tools has increased by millions.

Two of the new data sets are from projects that have obtained sequence data from a wide range of individuals. The ExAC data set, comprising exome sequences collected and harmonized by the Exome Aggregation Consortium, includes sequence data from 60,706 unrelated people of multiple ancestries. The 1000 Genomes data set, from the International Genome Sample Resource project (IGSR), is composed of whole-genome sequences from 2,504 individuals in four different ethnic groups. 

The allele frequencies of variants in the different ethnic groups surveyed in the 1000 Genomes data set can be seen in the “How common is…?” section on the Variant pages (view an example). And both the ExAC and 1000 Genomes data sets can be queried using the Variant Finder tool. You can select them via a new tab on the interface, “Additional search options”, where you can choose these data sets and also add more criteria to your search. 

The Data set pull-down menu on the "Additional Search Options" tab of the Variant Finder lets you specify 1000 Genomes or ExAC data.

Available selections in the Data set pull-down menu.

The other two new data sets in the Portal were both generated by the GoT2D consortium. A whole-genome sequence data set (GoT2D WGS) adds data from 2,657 individuals, including the associations of noncoding variants that were not present in the previous whole-exome sequence data set from the GoT2D project. This new data set brings T2D association data across 30 million variants to the Portal. The GoT2D WGS + replication data set adds imputation to that set, bringing the sample size to over 47,000 and including most low-frequency and common variants.  

The new GoT2D data can be seen in multiple sections of the Portal’s Gene and Variant pages, and may also be accessed by selecting these data sets in the Variant Finder.

In addition to these major new additions, today’s release of data also includes some bug fixes and data harmonization.

Get out there and explore the new data landscape in the Portal, and let us know what you think!

Monday, May 9, 2016

Better summaries of variant information convey the most important information at a glance

We’ve made significant improvements to the information we display on the Variant pages of the T2D Knowledge Portal. The summary at the top of each Variant page (view an example) now shows the reference nucleotide and the variant nucleotide at that position. Transcripts covering the variant are listed, along with several important details for each transcript: the change caused by the variant in the encoded protein sequence (if applicable); the Sequence Ontology term describing the consequence of the variation (for example, “missense variant”); and the expected effect of the variant on protein function, as predicted by the PolyPhen and Sift algorithms.

Summary section of the Variant page

Just below the summary on the Variant page, we’ve also improved the graphic showing the association of the variant with T2D and related traits. We’ve re-named this section “associations at a glance” because it immediately shows the most important information about these associations. 

At-a-glance section of the Variant page. Click the image to view a larger version.

The boxes in this graphic represent the associations of this variant with T2D (at the top) and with other traits (below, in an expandable section). Under the hood, the software is now pulling up information more quickly so that the display is more responsive. We’ve also made it more pleasant to look at, tidying up the shape of the boxes and the alignment of the information they contain.

But beyond the style improvements, we’ve added a lot of substance. Where available, each association now includes the odds ratio (for dichotomous traits) or the effect size (for continuous traits) and the direction of effect. Positive effects are shown in blue, and negative effects in purple. 

We’ve also added the sample size, in black text in the bottom left corner of the box, for each data set. This indicates the total number of individuals involved in the study. And if available, the frequency and count of the variant in the data set are shown in red and blue text at the bottom middle and bottom right corner of the box, respectively. The count indicates the number of haplotypes in the set that contain the variant, while the frequency indicates the occurrence of the variant allele in the sampled population.

This additional information can help you evaluate the significance of associations. The sample size and variant count determine the power of the data set to establish the association. The higher the power, the more accurate the estimate of the variant’s effect.

Finally, when a variant is associated with other traits in addition to T2D, those traits in the same category are labeled with the same color. For example, in the display above, proinsulin levels, fasting glucose, HOMA-B, and two-hour glucose—all glycemic phenotypes—are labeled in orange, while triglycerides, LDL cholesterol, and cholesterol—lipid phenotypes—are labeled in red. This lets you see easily when a variant is linked to multiple traits that could reflect a common process or pathway, possibly offering a clue to the mechanism by which it affects physiology.

So this improved graphic now gives you an idea, literally at a single glance, of how strongly a variant is associated with T2D, how significant that association is, and whether it is also associated with other traits. 

We made these improvements in response to suggestions from scientists who use the T2D Knowledge Portal. We hope to hear your feedback too!

Friday, May 6, 2016

T2D Knowledge Portal in the news

The poster that we presented at the Biocuration 2016 conference was selected by F1000Research as the featured poster or slide of the month! As an organization promoting open access to publications and data, they were particularly interested in the challenge we face at the Portal in designing tools that allow researchers to gain valuable insights from the data while still protecting confidential patient information. Read their take on it in their blog post.