Edinburgh Research Archive

Next-generation computational optimization of protein expression systems

Item Status

Embargo End Date

Authors

Nikolados, Evangelos-Marios

Abstract

A key area in biotechnology is the production of recombinant proteins for applications in the energy, food, and pharmaceutical sectors. In a typical microbial engineering pipeline, cellular hosts are transformed with heterologous genes that code for target protein products, and a key requirement is maximization of titers, productivity, and yield. Such optimization requires the design of genetic elements that ensure high transcriptional and translational efficiency, such as promoter or ribosomal binding sequences. However, prediction of protein production is notoriously challenging and, as a result, strain development suffers from costly rounds of prototyping and characterization, typically relying on heuristic rules to navigate the sequence space towards increased production. This thesis contains two parts and aims at developing computational frameworks for optimization of protein expression systems. The first part presents whole-cell mechanistic models for performing in silico studies of design spaces to find synthetic constructs that maximize productivity while minimizing their impact on endogenous cellular processes. Heterologous gene expression draws resources from host cells. These resources include vital components to sustain growth and replication, and the resulting cellular burden is a widely recognized bottleneck for the design of robust expression systems. Through various cases, I illustrate the power of host-circuit models to predict the impact of design parameters on both burden and gene circuit functionality. Furthermore, I revisit the definition of translation and introduce a novel extension of the mechanistic growth models that captures the effects of varied levels of translation initiation efficiency and its impact on cellular fitness. This model captures recently shown relations between growth rate and protein expression, where low protein producers are unexpectedly associated with impaired growth. Despite the ability of whole-cell models to provide mechanistic descriptions of microbial physiology, they are unable to account for DNA sequence information. To this end, the second part of my thesis focuses on data-driven models that directly map DNA sequences to protein expression levels. Using a large genotype-phenotype screen in Escherichia coli, I emulated scenarios with varying number of DNA sequences for training, and assessed a large panel of machine learning models of increasing complexity, from penalized linear regressors to deep convolutional neural networks. My results suggest that classic, non-deep, models can achieve good prediction accuracy with much smaller datasets than previously thought, and provide robust evidence that convolutional neural networks further improve performance with the same amount of data. Moreover, using tools from Explainable AI, I show that convolutional neural networks can better discriminate between input sequences compared to their non-deep counterparts and, moreover, the convolutional layers provide a mechanism to extract sequence features that are highly predictive of protein expression. Finally, I demonstrate that in limited data scenarios, controlled sequence diversity can improve data efficiency and predictive performance across larger regions of the sequence space. I further validate this conclusion in a separate dataset of promoter sequences in Saccharomyces cerevisiae. Mechanistic models hold promise as a powerful quantitative tool for teasing out the interplay between burden and heterologous protein expression, ultimately allowing for in silico studies of design spaces to find constructs that maximize productivity while minimizing their impact on endogenous cellular processes. At the same time, by systematically mapping the relation between data size, diversity and the choice of machine learning models, I demonstrate the viability of more data-efficient deep learning models, helping to promote their adoption as a platform technology in microbial engineering. Overall, this thesis offers a comprehensive framework for tackling the intricate connectivity and complexity of biology and facilitates the design of biosystems with desired features.

This item appears in the following Collection(s)