Human genome interaction: models for designing DNA sequences
Scher, Emily Alice
Since the turn of the century, the scope and scale of Synthetic Biology projects have grown dramatically. Instead of limiting themselves to simple genetic circuits, researchers aim for genome-scale organism redesigns, revolutionary gene therapies, and high throughput, industrial scale natural product syntheses. However, the engineering principles adopted by the founders of the field have been applied to Biology in a way that does not fit many modern experiments. This has limited the usefulness of common sequence design paradigms. As experiments have become more complex, the sequence design process has taken up more and more intellectual bandwidth, partially because software tools for DNA design have remained largely unchanged. This thesis will explore software engineering, social science, and machine learning projects aiming to improve the ways in which researchers design novel DNA sequences for Synthetic Biology experiments. Popular DNA design tools will be reviewed, alongside an analysis of the key conceptual metaphors that underlie their workflows. Flaws in the ubiquitous parts-based design model will be demonstrated, and several alternatives will be explored. A tool called Part Crafter (partcrafter.com) will be presented, which aggregates sequence and annotation data from a variety of data sources to allow for rational search over genomic features, as well as the automated production of biological parts for Synthetic Biology experiments. However, Part Crafter’s mode of part creation is more flexible than traditional implementations of parts-based design in the field. Parts are abstracted away from specific manufacturing standards, and as much contextual information as possible is presented alongside parts of interest. Additionally, various types of machine learning models will be presented which predict histone modification occupancy in novel sequences. Current Synthetic Biology design paradigms largely ignore the epigenetic context of designed sequences. A gradient of increasingly complex models will be analysed in order to characterise the complexity of the combinatorial patterns of sequences of these epigenetic proteins. This work was exploratory, serving as a proof of concept for using a variety of increasingly complex models to represent genomic elements, and demonstrating that the parts-based design model is not the only option available to us. The aims of the field of Synthetic Biology become more ambitious every year. In order for the goals of the field to be accomplished, we must be able to better understand the sequences we are designing. The projects presented in this thesis were all completed with the aim of assisting Synthetic Biologists in designing sequences deliberately. By taking into account as much contextual information as possible, including epigenetic factors, researchers will be able to design sequences more quickly and reliably, increasing their chances of achieving the moon shot goals of the field.