Interpreting the effects of missense variants through the lens of protein structure and molecular mechanisms
View/ Open
GerasimaviciusL_2023.pdf (12.64Mb)
Date
28/08/2023Item status
Restricted AccessEmbargo end date
28/08/2024Author
Gerasimavičius, Lukas
Metadata
Abstract
The last decades have seen massive breakthroughs in next-generation sequencing technologies, allowing us to explore genetic disorders through extensive mutation databases. A large portion of these variants are missense mutations, which cause a change in the identity of an amino acid residue at the protein level. However, the abundance of data has highlighted the inadequacy of clinical settings to evaluate the phenotypic consequences of missense mutations. They are seldom functionally characterized and the impacts of most remain uncertain. Projects like gnomAD have also shown that truly deleterious missense variants are rare and the majority of the observed variation has no known phenotypic effect. Thus, the task of identifying which variants are associated with disease is challenging due to the considerable noise in the form of benign background variation and a lack of efficient large-scale experimental approaches.
Considerable effort has been put into developing generalizable in silico models that could predict the effects of protein variants at scale. A large number of computational methods, termed variant effect predictors (VEPs), make use of evolutionary sequence conservation, phylogenetics and physicochemical properties to evaluate the probability of mutations being damaging. However, more interpretable avenues are presented by leveraging high-quality protein structural information, which is now highly abundant on the Protein Data Bank and available for the entire proteome through AlphaFold2. Structure-based stability predictors are methods that can evaluate the change in Gibbs free energy (ΔΔG) of folding or binding for a protein structure upon mutation. Curiously, they are routinely used in the field of clinical genetics, despite not being trained for disease identification, as they provide cues that can help interpret the underlying molecular disease mechanisms and not just the likelihood of damage.
I start by reviewing clinical variant classification practices and the issues that led to a considerable number of variants being associated with uncertain significance. I explore diverse sequence- and structure-based computational approaches for missense variant evaluation, as well as the previous attempts and methodology of benchmarking them. Multiple studies have been published with considerably varying conclusions, that are affected by biases such as self-testing and data circularity. As an alternative to using pathogenic and benign variant datasets for benchmarking classification performance, I explore large-scale functional multiplex assays (MAVEs), which circumvent some of the known biases. Finally, I specifically expand on the methodological approaches and shortcomings of structure-based stability predictors and discuss the outstanding overall challenges of missense disease variant interpretation.
To address the lack of testing, I explore the variant identification performance of 13 methodologically diverse structure-based stability predictors in distinguishing between pathogenic ClinVar and putatively benign gnomAD variants. I identify FoldX as the best predictor at this task, but more importantly, show that using absolute values of ΔΔG increases the performance of most predictors. This result suggests separating variants by magnitude and not effect direction may be more effective at distinguishing between pathogenic and benign variation. However, I also demonstrate high per-gene performance heterogeneity for all predictors, illustrating that there are other underlying molecular aspects that affect our ability to computationally estimate variant importance.
While much effort has been dedicated to exploring how variants disrupt protein structure and cause a loss of function (LOF), alternative molecular mechanisms, specifically gain-of-function and dominant-negative effects, are often overlooked. I demonstrate that variants from non-LOF disease mechanism genes are significantly less likely to perturb structural stability, as predicted through FoldX on both subunit and complex structures, and are thus more poorly distinguishable from benign variation. Using complex structures, which involve protein-protein, protein-ligand and other biomolecules interactions, does considerably increases the perceived energetic effects, showcasing the importance of evaluating variants in the full structural context. Interestingly, I also show that mechanism-specific performance issues affect most VEPs, which would suggest non-LOF variants are indeed less conserved and milder at the physicochemical level, and current methodologies have a bias for LOF mechanism variants. Most importantly, I demonstrate that non-LOF mutations could potentially be identified through their tendency to cluster in three-dimensional space, as their effects appear localised to particular regions.
A more quantitative benchmarking opportunity is presented by exploring how well stability predictor scores correlate with experimentally derived functional impacts from MAVEs. I demonstrate that the best performing predictors, FoldX and Rosetta, are not only able to separate variants, but also maintain the relative effect ranking and distinction between full loss-of-function and hypomorphic variants. Further, I demonstrate that certain MAVE phenotypes are consistently more correlated with stability effects, with survival and protein abundance assays signifying a more LOF-oriented phenotype. Notably, using complex structures drastically improves the ability to rank variants in relation to their functional effects in datasets with targets that take part in biomolecular interactions with proteins or nucleic acids as part of their function.
Finally, I explore the utilization of structural information for predicting pathogenicity by leveraging knowledge of existing mutations and their spatial distribution throughout structures. I demonstrate how distance to previously identified pathogenic variants can be predictive of deleteriousness for novel variants, with performance on the same scale as established VEPs like SIFT. More than that, the simple distance metric displays top performance, outcompeting other predictors, for a large number of genes characterized by non-LOF disease mechanisms, improving upon both stability predictors and VEPs in terms of consistency. I end by showcasing how layering such structural features on top of an accurate underlying evolutionary conservation-based predictor can result in a predictor that closes the variant identification performance gap between LOF and non-LOF disease genes. This result showcases the advantages of new heuristic structural approaches, that do not suffer from the systemic biases of current methodologies.