Genomic epidemiology of SARS-CoV-2: from outbreak investigations, to national and international surveillance efforts
O'Toole, Áine Niamh
The response of the global genomics community to the SARS-CoV-2 pandemic has been unprecedented. At time of writing there are more than 3.7 million SARS-CoV-2 genome sequences shared publicly on GISAID (www.gisaid.org). This scale of data on that order of magnitude presents novel opportunities and challenges for the field of genomic epidemiology. This thesis describes the development, validation and implementation of novel tools to facilitate different aspects of genomic epidemiology, from outbreak investigations to surveillance efforts. The Pango nomenclature lineage system is a set of rules that defines epidemiological lineages of SARS-CoV-2. Pango defines lineages from whole genome sequences, which 195 nations around the world have been producing for SARS-CoV-2. In chapter 1, I discuss the development and validation of pangolin, a software tool developed to assign the most likely Pango lineage to novel SARS-CoV-2 genomes. Initially, pangolin used a classic phylogenetic approach to assign lineages although further methods were trialled and implemented as the pandemic progressed to cope with the scale of and analytical challenges associated with SARS-CoV-2 data. Since it was first implemented, millions of SARSCoV-2 genomes have been assigned lineages with the pangolin tool from users across the world. For a number of reasons, labs may not be in a position to produce full genome sequences. Chapter 2 investigates how the lineage system can be used if only spike nucleotide sequences are available and defines ‘lineage sets’ that summarise what lineage information exists within a given spike haplotype. We find that for many lineages, including the main lineages corresponding to the WHO-defined variants of concern (VOCs), the spike nucleotide sequence is sufficient to distinguish Pango lineages and I describe the development of a software tool hedgehog that is a wrapper for pangolin that both defines and assigns these spike-based lineage sets. Pango lineage assignments with pangolin have been used almost ubiquitously across the globe and provide a simple, quick piece of information to classify SARS-CoV-2 genomes. However, for both outbreak investigations and routine surveillance, a more in-depth analysis is needed to give more than just this one piece of information. In chapter 3, I present civet, a software tool that addresses the challenge of the SARS-Cov-2 global dataset that is on the order of 3.7 million sequences and performs robust phylogenetic analyses on query sequences of interest, whilst contextualising them in the background data. Using civet, the user can produce an interactive report that summarises genomic, phylogenetic and epidemiological information, enabling routine analyses and investigations to be carried out in a single command. The suite of tools in this thesis have been developed to enable researchers to rapidly get robust and actionable information from SARS-CoV-2 genomes for genomic epidemiology efforts worldwide.