Written by Nascent Transcript writer Sarah A. Gagliano, PhD
February 2018
A publicly available resource of half a million individuals with genetic data, and thousands of diverse phenotypes, is a worthy cause for celebration in the genetic research community. The UK Biobank is currently the largest genotyped prospective cohort available to researchers worldwide. For those in genetics, this resource provides immense opportunities for discovering novel (or investigating known) trait-variant associations, and for developing computationally efficient statistical methods for “big” data. This article provides an overview of the types of data available through the UK Biobank, and a glimpse of the diverse research and discoveries that these data are empowering.
The UK Biobank consists of adults, aged 40-69, who were recruited across the United Kingdom [1]. Researchers have access to phenotypic, genotypic, and magnetic resonance imaging (MRI) data for many of these individuals.
A Wealth of Data
In addition to basic demographics, the UK Biobank has phenotypic data on thousands of traits derived from various sources. Baseline and ongoing follow-up data are available from verbal interviews and questionnaires (on topics ranging from sun exposure and sleeping habits to social support), from measurements of physical and cognitive factors, and from medical records, including International Classification of Diseases (ICD) billing codes [1]. The “Data Showcase” is an excellent starting place for browsing through the structure of the phenotypic data.
As for genetic data, individuals in the UK Biobank have been genotyped on genome-wide arrays, and genotypes were imputed at non-genotyped sites across the genome. The interim data release of the UK Biobank, which came out in mid-2015, offered genetic data for approximately 150,000 individuals, [2] and the full set containing genetic data for half a million individuals became available in the second half of 2017. The genetic data was imputed using the Haplotype Reference Consortium (HRC) panel, [3] resulting in about 40 million variant sites.
The data also underwent a second round of imputation using a joint panel combining the UK10K and 1000 Genomes reference datasets. However, an error in genomic positions was identified in this second round of imputation, and researchers were notified that the data would be re-imputed and re-released. The HRC-imputed sites were not affected by the error. As of the end of January 2018, a revised version of the imputed data had not yet been released.
Additionally, MRI scans of vital body parts, such as the brain and heart, have been conducted on upwards of 19,000 individuals, although sample sizes can vary by scan. Imaging assessments are ongoing, as the UK Biobank is aiming for scans on a total of 100,000 individuals.
Early Applications and Findings
Researchers are already making use of this wealth of data for various aims, including illuminating novel gene-trait associations and introducing new statistical methods that are computationally efficient at this scale. A quick search using the term “UK Biobank” in the preprint server BioRxiv attests to the growing number of studies that have been conducted, which will only continue to grow as different aspects of the data are probed.
The large sample size in the UK Biobank allows for the discovery of novel risk variants associated with particular traits, which previous studies were underpowered to detect due to low samples sizes. To give a sense of the diversity of phenotypes collected by the UK Biobank, some examples of traits for which novel loci have now been identified by means of this dataset include neuroticism, [4] aortic valve stenosis, [5] and lifetime cannabis use [6].
Novel associations have also been discovered through investigations based exclusively on the phenotypic data. For instance, among current smokers, worsening sleep duration was found to be associated with increased cigarettes per day, [7] and higher levels of educational attainment may contribute to myopia [8].
Furthermore, with the release of the UK Biobank data, the development of computational and statistical tools has gained momentum to address the challenge of analyzing such “big” data. A few examples include an R package established to manage and query the UK Biobank data, [9] a method developed for the joint analysis of summary statistics from GWASs of different traits, [10] and a method created to control for case-control imbalance and sample relatedness in association studies [11].
The release of the UK Biobank provides researchers with a unique resource to aid in enhancing our understanding of the relationship between genetics and human characteristics and diseases, and in developing computationally efficient tools to handle the demands of such large amounts of data. To register and apply for access to these data, visit the UK Biobank Access Management System (AMS).
References
1. Bycroft et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/07/20/166298
2. UK Biobank 2015 Genotype imputation and genetic association studies of UK Biobank Interim Data Release, May 2015.
3. Haplotype Reference Consortium 2016 Nature Genetics 48:1279-1283.
4. Luciano et al. 2018 Nat Genet 50(1):6-11.
5. Helgadottir et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/11/03/213595
6. Pasman et al. 2018 BioRxiv https://www.biorxiv.org/content/early/2018/01/08/234294
7. Patterson et al. 2018 Addictive Behaviors 77:47-50.
8. Mountjoy et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/08/04/172247
9. Hanscombe et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/06/30/158113
10. Turley et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/07/03/118810
11. Zhou et al. 2017 BioRxiv https://www.biorxiv.org/content/early/2017/11/24/212357