Edinburgh Research Archive

Applying machine learning methods to identify genetically distinct patient subgroups in COVID-19 critical illness

Abstract

There is evidence that critically ill COVID-19 patients exhibit a range of clinical phenotypes, which can influence their response to treatment. Understanding the underlying biological reasons for this variation may contribute to the development of more effective therapeutic interventions. The overarching aim of this thesis was to identify distinct subgroups of COVID-19 critical illness and then explore whether genetic loci were differentially associated with each cluster. The functional implications of such genetic associations were further explored in an attempt to identify genetic targets with potential translational implications. Using electronic healthcare records and genotyped data from individuals who participated in both the Genetics of Mortality in Critical Care (GenOMICC) and the ISARIC Coronavirus Clinical Characterisation Consortium (ISARIC4C) studies, I aimed to identify distinct subgroups of critical illness in COVID-19. I analyzed 7,609 patients, clustering them into subgroups based on the presence of 25 symptoms reported upon hospitalization. To identify these subgroups, I implemented three different clustering methods: two hard clustering methods, Clustering for Large Applications (CLARA) and Latent Class Analysis (LCA), and one multi-assignment method, the Multi-Assignment Clustering for Boolean Data (BMAC). These subgroups were then used to investigate whether differential genetic effects exist at both known and novel susceptibility loci. To achieve this, I conducted a Genome-Wide Association Study (GWAS) using a one-vs-rest study design, where each cluster was alternately treated as the case group, while the remaining clusters served as controls. To identify the optimal number of subgroups, clustering solutions were explored across a range from k = 4 to k = 7. CLARA was implemented for k = 3 to k = 5, as higher numbers of clusters could not be reliably executed due to data structure limitations. The LCA algorithm successfully clustered the dataset into solutions ranging from k = 2 to k = 20, with identified clusters being stable and consistently present across different values of k. The results of both methods were validated using a larger dataset of 51,068 ISARIC4C critical care individuals. Additionally, the BMAC algorithm was implemented for k = 4 to k = 7 to evaluate how well this soft clustering method validated the two hard clustering approaches. The analysis results indicate that across all clustering solutions, most clusters included three core symptoms: fever, cough, and shortness of breath. Cluster differentiation was driven by the presence of less common symptoms, such as myalgia, loss of taste and smell, and gastrointestinal symptoms. A distinct subgroup consistently emerged, consisting of individuals with mostly unrecorded symptoms. While evaluating the degree of similarity across clustering methods and validating each model, LCA emerged as the most consistent method. LCA clusters were successfully validated using a larger ISARIC4C dataset, and their qualitative characteristics were consistent and in agreement with those from the other two methods. Notably, LCA effectively handled individuals with missing data and identified all major combinations of distinct symptom profiles, even at lower cluster resolutions (k = 5). To determine the optimal clustering solution, clusters for k = 4 to k = 6 were evaluated for differential effects of demographic and clinical variables. The "no symptoms" group consistently displayed reduced comorbidity associations but elevated rates of severe interventions, such as ECMO, suggesting severe illness despite the limited recorded symptoms. The "lack of taste-smell" group maintained a stable profile with a significant association with hypertension. The "G.I." cluster also had a stable association with hypertension, but its clinical intervention associations indicate a less severe disease. Lastly, the "myalgia and fatigue" cluster exhibited variable associations with comorbidities and treatment across models. Significant associations, in at least two models, were found for obesity, chronic cardiac disease, and asthma, while the cluster also received less invasive treatments. The 6-class model introduced a unique "fever-confusion-G.I." cluster, characterized by elevated associations with mild liver disease and distinct intervention requirements. The identified clusters were stable across models ranging from 2 to 9 clusters. As the model progressed to higher-resolution solutions, new clusters were formed by partitioning clusters into smaller ones, with minimal sample movement among the existing ones. In order to select the best LCA models, I took into account the power for detecting differential genetic effects among the clusters. To achieve this, I employed a one-vs-rest GWAS approach, using individuals belonging to each cluster as cases, with the remaining 7,609 individuals as controls. For each cluster for k = 2 to k = 9, I calculated the statistical power for the one-vs-rest GWAS method and the minimum identifiable odds ratio (OR) per cluster, which ranged between 1.27 and 2.05. However, at resolutions of k = 6, smaller clusters exhibited unrealistically high minimum detectable ORs. Thus, accounting for both statistical power and the clinical variability of the clusters, I selected the 5-cluster solution for the subsequent genetic analysis. Due to mapping issues and data availability, genotypes were retrieved for 4,540 out of the 7,609 individuals, split into 3,949 Whole Genome Sequencing (WGS) and 591 array genotypes. To identify genetic associations within distinct COVID-19 clinical clusters, two separate GWAS analyses were conducted using WGS data via the SAIGE method and genotype array data via REGENIE. The results from both datasets were meta-analyzed using the METAL software, applying fixed-effect inverse-variance weighting and focusing on common variants. Across all 5 GWAS, no variant reached the genome-wide significance threshold of p < 10⁻⁸. However, 12 loci were identified at the less stringent and suggestive threshold of p < 10⁻⁶. For the gastrointestinal symptoms cluster, suggestive loci were found near RPL31P9 on chromosome 18 (rs79623266, p-value = 1.49 × 10⁻⁷) and PDE7B, LOC101928373 on chromosome 6 (rs116700640, p-value = 4.96 × 10⁻⁷). The lack of taste and smell cluster had associations near AC097462.1 on chromosome 4 (rs7673024, p-value = 3.40 × 10⁻⁷) and CECR3 on chromosome 22 (rs73876679, p-value = 4.10 × 10⁻⁷). For the no recorded symptoms cluster, suggestive associations were observed near LINC01855 on chromosome 19 (rs8110844, p-value = 5.27 × 10⁻⁷) and GLIS3 on chromosome 9 (chr9:3809726:C:T, p-value = 8.73 × 10⁻⁷). In the core symptoms cluster, four suggestive loci were identified: one near RNY4P34 on chromosome 2 (rs1975583, p-value = 6.46 × 10⁻⁷) and three unannotated loci located on chromosome 7 (chr7:13243280:G:A, p-value = 3.85 × 10⁻⁷), chromosome 10 (rs142230359, p-value = 4.24 × 10⁻⁷), and chromosome 5 (chr5:53288003:G/GT, p-value = 9.58 × 10⁻⁷). Finally, the myalgia cluster exhibited one peak, located on chromosome X (chrX:81344729:A/C, p-value = 7.701 × 10⁻⁷), which could not be annotated to any known gene. To further explore gene-level associations, Transcriptome-Wide Association Studies (TWAS) were conducted using the MetaXcan framework, integrating eQTL and sQTL data from GTEx v8 tissues. For each cluster, individual TWAS were performed in whole blood and lung tissues, with a meta-TWAS across all tissues identifying significant gene and intron-level associations. At the Bonferroni-corrected threshold, CATIP in the "lack of taste-smell" cluster showed significant association in lung tissue (p = 3.0 × 10⁻⁶), while RBP1 exhibited a near-significant association in blood tissue within the "core symptoms" cluster (p = 4.21 × 10⁻⁶). Looking at the less stringent threshold of p < 10⁻⁵, there were an additional 14 associations for the eQTL analysis and only two for the sQTL analysis. Notably, sQTL results revealed two suggestive associations: one in ADGRE5 in the "no symptoms" meta-cluster and an antisense gene to MAPK6 in the "core symptoms" meta-cluster, at p-values of 2.27 × 10⁻⁶ and 9.42 × 10⁻⁶, respectively. This study highlights the utility of clustering to identify meaningful subgroups of COVID-19 patients with critical illness—a focus often overlooked by studies that concentrate on general hospitalization. However, while clustering identified distinct clinical subgroups, these did not appear to be linked to significant genetic differences. Potential genetic variation may exist, but it appears subtle and likely require larger sample sizes. Nonetheless, the one-vs-rest GWAS approach and cluster-derived case definitions offer a promising framework for future studies, especially for conditions like type 2 diabetes or acute respiratory distress syndrome, where larger datasets and distinct subtypes may yield stronger genetic signals.