Edinburgh Research Archive

Towards efficient and accessible protein design with machine learning

Item Status

Embargo End Date

Authors

Castorina, Leonardo V.

Abstract

Proteins are the architects of life on Earth, driving virtually all biochemical processes, from cell signalling and metabolism to immune defence and industrial catalysts. This remarkable functional diversity arises from just twenty building blocks called amino acids arranged in various sequences and lengths. The combinatorial space of possible protein sequences vastly exceeds the number of atoms in the universe, yet nature has explored only a fraction of it. Protein design aims to navigate this immense design landscape to create proteins with new functions and structures. The emergence of deep learning has significantly improved the speed and accuracy of protein design tools for de novo protein binders, enzymes, and neutralising antibodies. However, the rapid proliferation of new models, each trained on different datasets, presents challenges in evaluating performance and selecting the most suitable tool for a given design task. To address this, we developed PDBench, a fold-balanced benchmark and software toolkit that systematically evaluates model performance across fold types and identifies prediction biases. This benchmark guided the development of TIMED (Three-dimensional Inference Method for Efficient Design), a Convolutional Neural Network model for protein sequence design using voxel-based representations. Special flavours of TIMED account for biochemical constraints such as charge and polarity, allowing for more precise and tunable designs. To improve accessibility, we released TIMED-Design, a user-friendly web interface and command-line tool, democratising access to state-of-the-art protein sequence design models. Beyond traditional sequence-based and atomic representations, we introduced a fragment-based representation that abstracts proteins into evolutionarily conserved functional fragments. This approach significantly reduces computational costs while preserving structural and functional signatures. We demonstrated that fragment-based representations improve clustering performance, accelerate functional searches, and serve as effective blueprints for guiding generative protein design. By making these tools freely available, we aim to empower anyone, beyond researchers, to design the proteins of the future.

This item appears in the following Collection(s)