Bacterial host attribution and bioinformatic characterisation of enteric bacteria Salmonella enterica and Escherichia coli from different hosts and environments
With the advent of relatively low cost whole genome sequencing (WGS), it is now possible to obtain sequences from large numbers of bacterial strains and interrogate their core and accessory genomes in relation to associated metadata. While there are some bacterial species with preferred hosts, especially in terms of disease, there has been no real systematic genomic investigation of host and niche specificity of ’generalist’ bacteria, i.e., those that can be isolated from multiple hosts and environments. The main aim of this research was to determine if host and/or niche-specific proteins can be identified for ’multi-host adapted’ bacteria such as E. coli and Salmonella Typhimurium (STm) in order to predict the ’origin’ of a strain and its zoonotic potential from its sequence. Two datasets of ’multi-host’ bacteria were analysed: 1,203 STm isolates from 4 hosts (avian, bovine, human and swine) and E. coli from 6 hosts (avian, bovine, canine, environmental, human and swine). Based on classical core genome analysis such as core phylogeny, multilocus sequence typing and phylo-grouping, no strong correlations with host were identified. The accessory genome was also investigated for host-based associations, and accessory host associated proteins (HAP) were identified for each of the bacteria/ host groups. These proteins were used to build a machine learning (ML) classifier - support vector machine (SVM) - to predict the isolation host of the bacterial isolates. The majority of the isolates from both species were predicted correctly with prediction accuracy ranging from 67% to 90%. For both bacterial species the most challenging were bovine and swine host groups as these two had many features in common. The approach allowed not only prediction of host based on WGS but also an assessment of how much the genome of particular isolates resembled the features of the genomes of the same species isolated from other hosts. This allowed ’generalist’ and ’specialist’ strains from each host group to be estimated as well as the sequences that indicate successful transmission potential between hosts. This work also showed that diverse collections of E. coli or STm can be used as a baseline for prediction and quantification of zoonotic potential as was demonstrated with E. coli O157 and Salmonella serovar Typhi. Overall this part of the research indicated marked host restriction for both STm and E. coli, with only limited isolate subsets exhibiting host promiscuity based on predicted protein content. ML can be successfully applied to interrogate source attribution of bacterial isolates and has the capacity to predict zoonotic potential. Using the same ML approach, another question was asked about how similar are the known zoonotic pathogens. When studied apart, E. coli O157 can be classified further into human and bovine isolates with only a small proportion of bovine isolates predicted as ’human’, pointing to the specific cattle strains that are potentially a more serious threat to human health. This approach was tested with 2 independent sets of O157 human outbreak strains with traced-back isolates from animals and food. The outbreak strains independent of the origin were scored as ’human’. This finding has profound implications for public health management of disease because interventions in cattle, such a vaccination, could be targeted at herds carrying strains of high zoonotic potential. The final section the thesis research was based on the STm dataset and compared different ML approaches to test which algorithm performed best for host prediction. Dimensionality reduction techniques as well as unsupervised and supervised ML were applied to HAP. Dimensionality reduction techniques and unsupervised ML were not able to split the dataset by host and produced different results which could be challenging to interpret correctly in terms of biological significance of the factors that influenced clustering. On the other hand, all three supervised classifiers resulted in very comparable high levels of prediction (over 95%). Thus, the choice of supervised classifier for host prediction should be based on the knowledge of the end-user as well as on requirements for any further analysis. To conclude, accessory genomes were successfully used for extraction of host associated proteins as well as for prediction of source host and quantification of zoonotic potential for bacteria species that can be isolated from multiple hosts. The methods described here can be applied to other bacteria and overall have implications for monitoring, identification and targeted interventions associated with potentially zoonotic infections. The results are completely dependent on the dataset quality which should be as large and diverse as possible. The research highlights the predictive potential of such algorithms but also the need for bacterial sequences to be gathered with as much useful metadata as possible, including isolation host.