Next-generation computational optimization of protein expression systems
Item Status
Embargo End Date
Date
Authors
Nikolados, Evangelos-Marios
Abstract
A key area in biotechnology is the production of recombinant proteins for applications
in the energy, food, and pharmaceutical sectors. In a typical microbial engineering
pipeline, cellular hosts are transformed with heterologous genes that code for target protein products, and a key requirement is maximization of titers, productivity,
and yield. Such optimization requires the design of genetic elements that ensure
high transcriptional and translational efficiency, such as promoter or ribosomal binding sequences. However, prediction of protein production is notoriously challenging
and, as a result, strain development suffers from costly rounds of prototyping and
characterization, typically relying on heuristic rules to navigate the sequence space
towards increased production. This thesis contains two parts and aims at developing
computational frameworks for optimization of protein expression systems.
The first part presents whole-cell mechanistic models for performing in silico studies
of design spaces to find synthetic constructs that maximize productivity while minimizing their impact on endogenous cellular processes. Heterologous gene expression
draws resources from host cells. These resources include vital components to sustain
growth and replication, and the resulting cellular burden is a widely recognized bottleneck for the design of robust expression systems. Through various cases, I illustrate
the power of host-circuit models to predict the impact of design parameters on both
burden and gene circuit functionality. Furthermore, I revisit the definition of translation
and introduce a novel extension of the mechanistic growth models that captures the
effects of varied levels of translation initiation efficiency and its impact on cellular
fitness. This model captures recently shown relations between growth rate and protein
expression, where low protein producers are unexpectedly associated with impaired
growth.
Despite the ability of whole-cell models to provide mechanistic descriptions of microbial physiology, they are unable to account for DNA sequence information. To this end,
the second part of my thesis focuses on data-driven models that directly map DNA
sequences to protein expression levels. Using a large genotype-phenotype screen
in Escherichia coli, I emulated scenarios with varying number of DNA sequences for
training, and assessed a large panel of machine learning models of increasing complexity, from penalized linear regressors to deep convolutional neural networks. My
results suggest that classic, non-deep, models can achieve good prediction accuracy
with much smaller datasets than previously thought, and provide robust evidence that
convolutional neural networks further improve performance with the same amount
of data. Moreover, using tools from Explainable AI, I show that convolutional neural
networks can better discriminate between input sequences compared to their non-deep counterparts and, moreover, the convolutional layers provide a mechanism to
extract sequence features that are highly predictive of protein expression. Finally, I
demonstrate that in limited data scenarios, controlled sequence diversity can improve
data efficiency and predictive performance across larger regions of the sequence
space. I further validate this conclusion in a separate dataset of promoter sequences
in Saccharomyces cerevisiae.
Mechanistic models hold promise as a powerful quantitative tool for teasing out the
interplay between burden and heterologous protein expression, ultimately allowing
for in silico studies of design spaces to find constructs that maximize productivity while minimizing their impact on endogenous cellular processes. At the same
time, by systematically mapping the relation between data size, diversity and the
choice of machine learning models, I demonstrate the viability of more data-efficient
deep learning models, helping to promote their adoption as a platform technology
in microbial engineering. Overall, this thesis offers a comprehensive framework for
tackling the intricate connectivity and complexity of biology and facilitates the design
of biosystems with desired features.
This item appears in the following Collection(s)

