Combining genome-wide association studies, polygenic risk scores and SNP-SNP interactions to investigate the genomic architecture of human complex diseases: more than the sum of its parts

Meijsen, Joeri Jeroen

Combining genome-wide association studies, polygenic risk scores and SNP-SNP interactions to investigate the genomic architecture of human complex diseases: more than the sum of its parts

Files

Meijsen2018.pdf (2.87 MB)

Date

2018-11-30

Authors

Meijsen, Joeri Jeroen

Full item page

Abstract

Major Depressive Disorder is a devastating psychiatric illness with a complex genetic and environmental component that affects 10% of the UK population. Previous studies have shown that that individuals with depression show poorer performance on measures of cognitive domains such as memory, attention, language and executive functioning. A major risk factor for depression is a higher level of neuroticism, which has been shown to be associated with depression throughout life. Understanding cognitive performance in depression and neuroticism could lead to a better understanding of the aetiology of depression. The first aim of this thesis focused on assessing phenotypic and genetic differences in cognitive performance between healthy controls and depressed individuals and also between single episode and recurrent depression. A second aim was determining the capability of two decision-tree based methods to detect simulated gene-gene interactions. The third aim was to develop a novel statistical methodology for simultaneously analysing single SNP, additive and interacting genetic components associated with neuroticism using machine leaning. To assess the phenotypic and genetic differences in depression, 7,012 unrelated Generation Scotland participants (of which 1,042 were clinically diagnosed with depression) were analysed. Significant differences in cognitive performance were observed in two domains: processing speed and vocabulary. Individuals with recurrent depression showed lower processing speed scores compared to both controls and individuals with single episode depression. Higher vocabulary scores were observed in depressed individuals compared to controls and in individuals with recurrent depression compared to controls. These significant differences could not be tied to significant single locus associations. Derived polygenic scores using the large CHARGE processing speed GWAS explained up to 1% of variation in processing speed performance among individuals with single episode and recurrent depression. Two greedy non-parametric decision-tree based methods – C5.0 and logic regression - were applied to simulated gene-gene interaction data from Generation Scotland. Several gene-gene interactions were simulated under multiple scenarios (e.g. size, strength of association levels and the presence of a polygenic component) to assess the power and type I error. C5.0 was found to have an increased power with a conservative type I error using simulated data. C5.0 was applied to years of education as a proxy of educational attainment in 6,765 Generation Scotland participants. Multiple interacting loci were detected that were associated with years of education, some most notably located in genes known to be associated with reading and spelling (RCAN3) and neurodevelopmental traits (NPAS3). C5.0 was incorporated in a novel methodology called Machine-learning for Additive and Interaction Combined Analysis (MAICA). MAICA allows for a simultaneous analysis of single locus, polygenic components, and gene-gene interaction risk factors by means of a machine learning implementation. MAICA was applied on neuroticism scores in both Generation Scotland and UK Biobank. The MAICA model in Generation Scotland included 151 single loci and 11 gene-gene interaction sets, and explained ~6.5% of variation in neuroticism scores. Applying the same model to UK Biobank did not lead to a statistically significant prediction of neuroticism scores. The results presented in this thesis showed that individuals with depression performed significantly lower on the processing speed tests but higher on vocabulary test and that 1% of variation in processing speed can be explained by using a large processing speed GWAS. Evidence was provided that C5.0 had increased power and acceptable type I error rates versus logic regression when epistatic models exist – even with a strong underlying polygenic component, and that MAICA is an efficient tool to assess single locus, polygenic and epistatic components simultaneously. MAICA is open-source, and will provide a useful tool for other researchers of complex human traits who are interested in exploring the relative contributions of these different genomic architectures.

URI

http://hdl.handle.net/1842/33094

This item appears in the following Collection(s)

Edinburgh Medical School thesis and dissertation collection