Edinburgh Research Archive

Statistical and machine learning approaches to genomic medicine

Item Status

Embargo End Date

Authors

Bradley, Jacob R.

Abstract

In this thesis, we develop new statistical and machine learning methods for genomic medicine, and apply them to problems in diagnostics and precision oncology. Our overall aim is to introduce techniques that inform practical decision making in the design and use of clinical tests. The work combines domain-specific context with modern advances in Bayesian hierarchical modelling, high-dimensional statistics, and causal inference. We begin in Chapter 1 with an introduction to the concepts and methodologies that are common throughout the thesis. This includes the necessary context from molecular biology, an overview of genomics in medicine with a particular focus on cancer (the subject of Chapters 3 and 4), and a description of data-generating technologies such as DNA sequencing and gene expression profiling. We also provide an in-depth introduction to the relevant statistical learning methods and techniques. This sets the scene for the three projects presented in subsequent chapters. In Chapter 2 we analyse the resolution of the loop-mediated isothermal amplification (LAMP) assay. LAMP is a technology that can be used in medical tests that require quantifying the presence of RNA for each of a set of gene targets. Motivated by the unmet need for statistically principled methods for guided LAMP optimisation, we show how to use data from clinical and synthetic samples to improve the resolution of a LAMP-based diagnostic test for sepsis patients. In this context, by optimisation of the assay we refer both to the selection of gene targets, and to the tuning of reactions conditions and selection of optimal primers to produce robust, high-resolution measurements of gene expression. Our analysis identifies novel quantities associated with primer design that may drive assay performance. Chapter 3 focuses on designing gene panels to estimate tumour mutation burden (TMB) and other exome-wide biomarkers, which are used to determine which cancer patients will benefit from immunotherapy. The cost of whole-exome sequencing presently limits the widespread use of such biomarkers. In this chapter, we introduce a data-driven framework for the design of targeted gene panels for estimating a broad class of biomarkers including tumour mutation burden and tumour indel burden. The first goal is to develop a generative model for the profile of mutation across the exome, which allows for gene- and variant typedependent mutation rates. Based on this model, we then propose a procedure for constructing biomarker estimators. Our approach allows the practitioner to select a targeted gene panel of prespecified size and construct an estimator that only depends on the selected genes. Alternatively, our method may be applied to make predictions based on an existing gene panel, or to augment a gene panel to a given size. We demonstrate the excellent performance of our proposal using data from three non-small cell lung cancer studies, as well as data from six other cancer types. In Chapter 4, we consider causal questions in survival analysis, and investigate the extent to which the heterogeneous treatment effects of immunotherapy vary according to patients’ clinical and genomic features. Methods for identifying heterogeneous treatment effects from survival data are still in their infancy, and so in this chapter we benchmark some recently proposed strategies. In particular, we show that high-throughput targeted sequencing data may offer better understanding into which patients are likely to benefit from immunotherapy, using state-of-the art statistical learning methods based on causal survival forests and regularisation.

This item appears in the following Collection(s)