In search of a host: using machine learning for host attribution of Salmonella Typhimurium and host specificity of Escherichia coli bacteriophages
Item Status
Embargo End Date
Date
Authors
Chalka, Antonia
Abstract
Salmonella enterica is a pathogen of global importance as one of the leading bacterial causes of diarrhoeal and invasive disease worldwide. It is also a diverse pathogen, with over 2600 serovars associated with a wide variety of animal hosts, including humans, as well as wild and farm animals, making accurate host detection an important aspect of effective outbreak management. Generalist serovars are able to infect a wide variety of hosts, such as Salmonella Typhimurium which is the major serovar responsible for food-borne salmonellosis, and pose a particular challenge as accurate host attribution is important for effective outbreak management and risk mitigation. Though there has been significant research into the genetic mechanisms associated with Salmonella host specificity, developing a robust and accurate model of host attribution for Salmonella has been challenging.
Machine learning has become an increasingly popular tool in building models for complex datasets, and has seen increased use across biology, from genome annotation to medical imaging. There have been some studies that have built machine learning models for host attribution of Salmonella enterica, with a particular focus on Salmonella Typhimurium. There is though a lot of room for improvement and investigation in this area, in particular as a result of the ever-increasing number of available Salmonella Typhimurium sequences on public databases along with improvements in bioinformatic tools, with the capacity to extract and analyse previously overlooked genomic features such as intergenic regions. In addition, there have been limited efforts to investigate the genomic features deemed important for host attribution when machine learning models have been built.
An initial dataset of 3371 Salmonella Typhimurium sequences with associated host metadata including host of isolation, human, poultry, bovine or swine, was collected from Enterobase and used for the creation of a robust, reproducible pipeline for training host attribution models. The pipeline focused on ensuring its inputs were of high-quality and diverse, extracting SNPs, AMR profiles, protein variants and intergenic regions, and building an array of random forest models from the aforementioned genomic features.
Two types of models were created, the ‘all’ that predicted all hosts, and ‘bps’ which predicted only farm animal hosts (bovine, poultry, swine), with an additional class of model called ‘human’, a binomial human/livestock classifier, added in the finalised pipeline but not created during the initial round of model training. The preliminary models showed that intergenic regions and protein variants were the best-performing features for host attribution (80-95%), followed by SNPs (76-82%) and AMR profiles (67-82%). Poultry was the most accurately predicted host (>90%), followed by swine, then human and bovine.
A subsequent phylogenetic analysis based on core SNPs revealed a distinct poultry cluster, with other clusters being more mixed in terms of host of origin. In an attempt to test the generalizability of the models, and identify phylogenetically independent host-specific genomic features, models were trained based a Leave-Group-Out-Cross-Validation method based on phylogenetic clusters. However, resulting models had varying but generally worse performance, which highlighted the importance of accounting for phylogeny when creating the models, as well as the need to benchmark them against phylogenetic-based predictions. Though some measures to take population structure into modelling were taken, specifically during feature selection, these approaches can be further refined, such as by stratifying the training and validation datasets by phylogenetic clusters, or by borrowing techniques from GWAS, such as by principal component analysis for further dimensionality reduction. With a finalised pipeline, an additional ~1500 human-derived Salmonella Typhimurium sequences were added to the dataset, resulting in an initial dataset of >5000 sequences with 3313 high-quality assemblies used for model training. The resulting models showed similar trends with their predecessors in terms of feature performance, solidifying the usefulness of intergenic regions for host attribution, but with a general increase in accuracy (92-94% kappa for protein variants models, 98-99% kappa for intergenic region models). Models created from protein variants and intergenic regions had a higher accuracy than a phylogenetic model based on assigning host from the nearest neighbour (74% kappa value). The intergenic region and protein variant ‘bps’ models predicted that ~45% of the human-derived Salmonella Typhimurium sequences originated from bovine, ~40% from poultry, and ~14.5% from swine.
Investigating the important features of the resulting models proved difficult. An investigation of intergenic regions revealed a co-association of important protein variants with important intergenic regions. Many genes deemed important were hypothetical genes, lacking annotations and not present in any gene databases. Of existing genes, some, such as hdeB were already associated with host selection, but many were not. Additionally, many of the important genes had small differentials across hosts, making straightforward conclusions about host association difficult. One of the most promising candidates was a protein variant labelled ‘ybal/fsr/kefB’, which was present in almost all the livestock but only approximately half the human-derived Salmonella Typhimurium sequences. Attempts to deconvolve the cluster revealed a complicated structure, where a few long ‘scaffold-like’ genes were grouped with smaller genes of high genetic identity. Additionally, though the pangenome graph indicated that there was a complete absence of the ‘ybal/fsr/kefB’ region (deletion), rather than the presence of a different cluster, confirming its validity was challenging due to the wide range of sequences that would have to be checked to make definitive conclusions, as well as the lack of long-read sequencing.
Our methodologies initially applied in the development of Salmonella host attribution models were repurposed for the prediction of bacteriophage-E. coli interactions in the context of phage therapy, as well as to showcase the adaptability of the model creation and testing pipelines. Our approach involved adapting the genomic feature extraction process to construct predictive models determining the infection score of a particular bacteriophage infecting a specific E. coli isolate. Each bacteriophage was associated with an individual model, and the efficacy of these models relied heavily on the diversity of the training dataset corresponding to each phage. Notably, the variance in infectivity across different bacteriophages significantly impacted model performance and reliability; more diverse datasets correlated with more dependable predictive models.
Unlike the trends observed in our Salmonella models, there were minimal performance differences between models trained on gene families versus intergenic regions. To validate the predictive capacity of these models, an additional set of ten E. coli isolates was employed, revealing minimal discrepancies between the infection scores predicted by the models and the actual experimental outcomes, particularly for the robust phage models.
Overall, this study resulted in the creation of a machine learning host-attribution pipeline, the building of an array of highly accurate machine learning models trained on a robust dataset, which also highlighted the importance of intergenic regions for host attribution. Although there is a lot of room for improvement, a solid foundation has been built in which machine learning can be leveraged to gain a deeper understanding of Salmonella host specificity.
This item appears in the following Collection(s)

