Edinburgh Research Archive

Towards efficient and accessible protein design with machine learning

dc.contributor.advisor
Subr, Kartic
dc.contributor.advisor
Wood, Christopher
dc.contributor.author
Castorina, Leonardo V.
dc.date.accessioned
2025-07-14T12:14:42Z
dc.date.available
2025-07-14T12:14:42Z
dc.date.issued
2025-07-14
dc.description.abstract
Proteins are the architects of life on Earth, driving virtually all biochemical processes, from cell signalling and metabolism to immune defence and industrial catalysts. This remarkable functional diversity arises from just twenty building blocks called amino acids arranged in various sequences and lengths. The combinatorial space of possible protein sequences vastly exceeds the number of atoms in the universe, yet nature has explored only a fraction of it. Protein design aims to navigate this immense design landscape to create proteins with new functions and structures. The emergence of deep learning has significantly improved the speed and accuracy of protein design tools for de novo protein binders, enzymes, and neutralising antibodies. However, the rapid proliferation of new models, each trained on different datasets, presents challenges in evaluating performance and selecting the most suitable tool for a given design task. To address this, we developed PDBench, a fold-balanced benchmark and software toolkit that systematically evaluates model performance across fold types and identifies prediction biases. This benchmark guided the development of TIMED (Three-dimensional Inference Method for Efficient Design), a Convolutional Neural Network model for protein sequence design using voxel-based representations. Special flavours of TIMED account for biochemical constraints such as charge and polarity, allowing for more precise and tunable designs. To improve accessibility, we released TIMED-Design, a user-friendly web interface and command-line tool, democratising access to state-of-the-art protein sequence design models. Beyond traditional sequence-based and atomic representations, we introduced a fragment-based representation that abstracts proteins into evolutionarily conserved functional fragments. This approach significantly reduces computational costs while preserving structural and functional signatures. We demonstrated that fragment-based representations improve clustering performance, accelerate functional searches, and serve as effective blueprints for guiding generative protein design. By making these tools freely available, we aim to empower anyone, beyond researchers, to design the proteins of the future.
en
dc.identifier.uri
https://hdl.handle.net/1842/43673
dc.identifier.uri
http://dx.doi.org/10.7488/era/6205
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
Castorina, L. V., Petrenas, R., Subr, K., and Wood, C. W. (2023). PDBench: Evaluating Computational Methods for Protein-Sequence Design. In Bioinformatics
en
dc.relation.hasversion
Castorina, L. V., Unal, S. M., Subr, K., and Wood, C. W. (2024). TIMED- ¨ Design: Flexible and Accessible Protein Sequence Design with Convolutional Neural Networks. In Protein Engineering, Design and Selection.
en
dc.relation.hasversion
Castorina, L. V., Wood, C. W., and Subr, K. (2025). From Atoms to Fragments: A Coarse Representation for Efficient and Functional Protein Design. In bioRxiv.
en
dc.relation.hasversion
Li, B. M., Castorina, L. V., Valdes Hern ´ andez, M. del C., Clancy, U., Wiseman, S. ´ J., Sakka, E., Storkey, A. J., Jaime Garcia, D., Cheng, Y., Doubal, F., Thrippleton, M. T., Stringer, M., Wardlaw, J. M. (2022). Deep Attention Super-Resolution of Brain Magnetic Resonance Images Acquired under Clinical Protocols. In Frontiers in Computational Neuroscience
en
dc.relation.hasversion
Grazioli, F., Machart, P., Mosch, A., Li, K., ¨ Castorina, L. V., Pfeifer, N., Min, M. R. (2022). Attentive Variational Information Bottleneck for TCR–peptide Interaction Prediction. In Bioinformatics.
en
dc.relation.hasversion
Castorina, L. V., Grazioli, F., Machart, P., Mosch, A., Errica, F. (2025). Assessing ¨ the Generalization Capabilities of TCR Binding Predictors via Peptide Distance Analysis. In PLOS One
en
dc.relation.hasversion
Cotet, T.-S. , Krawczuk, I. , Stocco, F., Ferruz, N. , Gitter, A. , Kurumida, Y., de Almeida Machado, L., Paesani, F., Calia, C. N. , Challacombe, C. A., Haas, N. , Qamar, A., Correia, B. E., Pacesa, M. , Nickel, L., Subr, K., Castorina, L. V. , Campbell, M. J., Ferragu, C., Kidger, P. , Hallee, L., Wood, C. W., Stam, M. J., Kluonis, T., Unal, S. M., Belot, E. , Naka, A., Bustos, A., Torrubia, A., Chu, H., Adaptyv Competition Organizers1 (2025). Crowdsourced Protein Design: Lessons From the Adaptyv EGFR Binder Competition. In bioRxiv
en
dc.subject
protein design
en
dc.subject
AI
en
dc.subject
AI protein design models
en
dc.subject
PDBench
en
dc.subject
performance evaluation
en
dc.subject
3D representations of proteins
en
dc.subject
prediction biases
en
dc.title
Towards efficient and accessible protein design with machine learning
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Castorina2025.pdf
Size:
26.14 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)