Edinburgh Research Archive

Machine learning in drug discovery: advancing protein-ligand binding affinity predictions

dc.contributor.advisor
Mey, Antonia
dc.contributor.advisor
Weisse, Andrea
dc.contributor.author
Gorantla, Rohan
dc.date.accessioned
2025-07-14T14:16:13Z
dc.date.available
2025-07-14T14:16:13Z
dc.date.issued
2025-07-14
dc.description.abstract
Binding affinity quantifies the strength of the interaction between a protein and a small drug-like molecule. Accurately determining binding affinity helps identify promising drug candidates in the early stage of drug discovery, particularly in hit discovery and lead optimization phases, where screening several millions to even billions of compounds is required. Hit discovery involves identifying potential compounds (known as ‘hits’) that show initial activity against the choice of disease-causing protein target. Lead optimization focuses on refining these hits to improve their binding affinity and other drug-like properties. Experimental assays are the gold standard for determining binding affinity, but they are not practical for rapidly screening millions of drug-like compounds against potential targets. Accurate in silico prediction of protein-ligand binding affinity can significantly expedite drug discovery by streamlining the identification and optimization of viable drug candidates, reducing huge experimental costs and time. Over the last fifty years, a wide range of in silico binding affinity prediction strategies have been developed. They consist of both structure-based and ligand-based approaches. However, these methodologies often fall short in large-scale screenings. So called docking methods, while capable of high-throughput screenings, often lack the desired accuracy for a binding affinity prediction. In contrast, alchemical free energy-based (AFE) techniques, a simulation-based technique, offer improved accuracy but are computationally demanding. The rapid progression of machine learning, coupled with increased accessibility to binding affinity data, opens avenues to deep learning-based methods for improving the accuracy and speed of binding affinity predictions. This thesis focuses on exploring and developing machine learning methods for predicting protein-ligand binding affinity. The first part of the thesis investigates how current deep learning models learn from input protein and ligand data to predict binding affinity. Systematic experiments using publicly available kinase datasets are conducted to assess the impact of protein encodings and ligand encodings derived from convolutional and/or graph neural networks by inputting variations of protein and ligand data. The results indicated that protein encodings have minimal impact on binding predictions, while ligand-based features play a more substantial role in model performance. The second part of the thesis focuses on addressing key challenges at the model, data, and evaluation levels of the deep learning framework for predicting binding affinity. To overcome challenges at the model level, this work introduces a deep learning framework for predicting binding affinity using pretrained protein and ligand language models, called BALM. Utilizing pretrained language models for proteins and ligands, the BALM method predicts binding affinity by optimizing the distance between protein and ligand encodings using a cosine similarity metric. At the data and evaluation levels, the research demonstrates novel strategies for training and testing these models to ensure they provide meaningful and reliable predictions compared to traditional methods and experimental measurements. While zero-shot prediction on unseen targets may not always be reliable, the few-shot finetuning of the BALM model is shown to be reliable for screening new targets, demonstrating better performance than docking. The final part of this thesis focuses on integrating machine learning with physics-based simulation methods, such as alchemical free energy calculations. This integration aims to reduce computational costs and time during lead optimization. Specifically, Active Learning (AL) is used to intelligently select compounds for AFE calculations, making the identification of top binders more efficient. AL is an iterative process that learns binding affinities from an unlabelled dataset and helps prioritize compounds for detailed evaluation, minimizing the need to compute AFE for all compounds in a large pool. In this approach, machine learning models such as Gaussian process regression and pretrained graph neural network-based models act as surrogate models during each AL iteration. They provide predictions of binding affinity to inform the next set of AFE calculations, allowing for efficient compound selection. Various recommendations on model choice, batch size, and strategies for exploring or exploiting the chemical spaces based on the ligand pool used are provided. Both models show similar recall in identifying top binders on large datasets, but the Gaussian process model performs better than the pretrained graph network model when the training data is limited. Using a larger initial batch size, especially with diverse datasets, improved recall for both models and enhanced overall correlation metrics. However, smaller batch sizes were more effective for later iterations.
en
dc.identifier.uri
https://hdl.handle.net/1842/43674
dc.identifier.uri
http://dx.doi.org/10.7488/era/6206
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
From Proteins to Ligands: Decoding Deep Learning Methods for Binding Affinity Prediction Gorantla, R., Kubincova, A., Weiße, A. Y., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 7, 2496–2507
en
dc.relation.hasversion
Benchmarking active learning protocols for ligand binding affinity prediction Gorantla, R., Kubincova, A., Suutari, B., Cossins, B. P., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 6, 1955–1965
en
dc.relation.hasversion
R. Gorantla, R. K. Singh, R. Pandey and M. Jain, 2019 IEEE BIBE, 2019, pp. 397–404.
en
dc.relation.hasversion
R. K. Singh and R. Gorantla, PLoS One, 2020, 15, e0220677
en
dc.relation.hasversion
R. K. Singh, R. Gorantla, S. G. R. Allada and P. Narra, PLoS One, 2022, 17, e0276836. [
en
dc.relation.hasversion
R. Gorantla, A. P. Gema, I. X. Yang, ´A. Serrano-Morr´as, B. Suutari, J. J. Jim´enez and A. S. J. S. Mey, bioRxiv, 2024
en
dc.subject
protein binding
en
dc.subject
binding affinity
en
dc.subject
machine learning
en
dc.subject
large language models
en
dc.subject
Active Learning
en
dc.subject
reliability
en
dc.subject
cost-effective
en
dc.title
Machine learning in drug discovery: advancing protein-ligand binding affinity predictions
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en

Files

Original bundle

Now showing 1 - 1 of 1
Name:
Gorantla2025.pdf
Size:
141.09 MB
Format:
Adobe Portable Document Format
Description:

This item appears in the following Collection(s)