Towards efficient and accessible protein design with machine learning
Item Status
Embargo End Date
Date
Authors
Castorina, Leonardo V.
Abstract
Proteins are the architects of life on Earth, driving virtually all biochemical processes,
from cell signalling and metabolism to immune defence and industrial catalysts. This
remarkable functional diversity arises from just twenty building blocks called amino
acids arranged in various sequences and lengths. The combinatorial space of possible
protein sequences vastly exceeds the number of atoms in the universe, yet nature has
explored only a fraction of it.
Protein design aims to navigate this immense design landscape to create proteins
with new functions and structures. The emergence of deep learning has significantly
improved the speed and accuracy of protein design tools for de novo protein binders,
enzymes, and neutralising antibodies.
However, the rapid proliferation of new models, each trained on different datasets,
presents challenges in evaluating performance and selecting the most suitable tool for a
given design task. To address this, we developed PDBench, a fold-balanced benchmark
and software toolkit that systematically evaluates model performance across fold types
and identifies prediction biases.
This benchmark guided the development of TIMED (Three-dimensional Inference
Method for Efficient Design), a Convolutional Neural Network model for protein
sequence design using voxel-based representations. Special flavours of TIMED account
for biochemical constraints such as charge and polarity, allowing for more precise and
tunable designs. To improve accessibility, we released TIMED-Design, a user-friendly
web interface and command-line tool, democratising access to state-of-the-art protein
sequence design models.
Beyond traditional sequence-based and atomic representations, we introduced a
fragment-based representation that abstracts proteins into evolutionarily conserved
functional fragments. This approach significantly reduces computational costs while
preserving structural and functional signatures. We demonstrated that fragment-based
representations improve clustering performance, accelerate functional searches, and
serve as effective blueprints for guiding generative protein design.
By making these tools freely available, we aim to empower anyone, beyond researchers,
to design the proteins of the future.
This item appears in the following Collection(s)

