Machine learning in drug discovery: advancing protein-ligand binding affinity predictions

Gorantla, Rohan

Machine learning in drug discovery: advancing protein-ligand binding affinity predictions

Simple item page

dc.contributor.advisor

Mey, Antonia

dc.contributor.advisor

Weisse, Andrea

dc.contributor.author

Gorantla, Rohan

dc.date.accessioned

2025-07-14T14:16:13Z

dc.date.available

2025-07-14T14:16:13Z

dc.date.issued

2025-07-14

dc.description.abstract

Binding affinity quantifies the strength of the interaction between a protein and a small drug-like molecule. Accurately determining binding affinity helps identify promising drug candidates in the early stage of drug discovery, particularly in hit discovery and lead optimization phases, where screening several millions to even billions of compounds is required. Hit discovery involves identifying potential compounds (known as ‘hits’) that show initial activity against the choice of disease-causing protein target. Lead optimization focuses on refining these hits to improve their binding affinity and other drug-like properties. Experimental assays are the gold standard for determining binding affinity, but they are not practical for rapidly screening millions of drug-like compounds against potential targets. Accurate in silico prediction of protein-ligand binding affinity can significantly expedite drug discovery by streamlining the identification and optimization of viable drug candidates, reducing huge experimental costs and time. Over the last fifty years, a wide range of in silico binding affinity prediction strategies have been developed. They consist of both structure-based and ligand-based approaches. However, these methodologies often fall short in large-scale screenings. So called docking methods, while capable of high-throughput screenings, often lack the desired accuracy for a binding affinity prediction. In contrast, alchemical free energy-based (AFE) techniques, a simulation-based technique, offer improved accuracy but are computationally demanding. The rapid progression of machine learning, coupled with increased accessibility to binding affinity data, opens avenues to deep learning-based methods for improving the accuracy and speed of binding affinity predictions. This thesis focuses on exploring and developing machine learning methods for predicting protein-ligand binding affinity. The first part of the thesis investigates how current deep learning models learn from input protein and ligand data to predict binding affinity. Systematic experiments using publicly available kinase datasets are conducted to assess the impact of protein encodings and ligand encodings derived from convolutional and/or graph neural networks by inputting variations of protein and ligand data. The results indicated that protein encodings have minimal impact on binding predictions, while ligand-based features play a more substantial role in model performance. The second part of the thesis focuses on addressing key challenges at the model, data, and evaluation levels of the deep learning framework for predicting binding affinity. To overcome challenges at the model level, this work introduces a deep learning framework for predicting binding affinity using pretrained protein and ligand language models, called BALM. Utilizing pretrained language models for proteins and ligands, the BALM method predicts binding affinity by optimizing the distance between protein and ligand encodings using a cosine similarity metric. At the data and evaluation levels, the research demonstrates novel strategies for training and testing these models to ensure they provide meaningful and reliable predictions compared to traditional methods and experimental measurements. While zero-shot prediction on unseen targets may not always be reliable, the few-shot finetuning of the BALM model is shown to be reliable for screening new targets, demonstrating better performance than docking. The final part of this thesis focuses on integrating machine learning with physics-based simulation methods, such as alchemical free energy calculations. This integration aims to reduce computational costs and time during lead optimization. Specifically, Active Learning (AL) is used to intelligently select compounds for AFE calculations, making the identification of top binders more efficient. AL is an iterative process that learns binding affinities from an unlabelled dataset and helps prioritize compounds for detailed evaluation, minimizing the need to compute AFE for all compounds in a large pool. In this approach, machine learning models such as Gaussian process regression and pretrained graph neural network-based models act as surrogate models during each AL iteration. They provide predictions of binding affinity to inform the next set of AFE calculations, allowing for efficient compound selection. Various recommendations on model choice, batch size, and strategies for exploring or exploiting the chemical spaces based on the ligand pool used are provided. Both models show similar recall in identifying top binders on large datasets, but the Gaussian process model performs better than the pretrained graph network model when the training data is limited. Using a larger initial batch size, especially with diverse datasets, improved recall for both models and enhanced overall correlation metrics. However, smaller batch sizes were more effective for later iterations.

en

dc.identifier.uri

https://hdl.handle.net/1842/43674

dc.identifier.uri

http://dx.doi.org/10.7488/era/6206

dc.language.iso

en

dc.publisher

The University of Edinburgh

en

dc.relation.hasversion

From Proteins to Ligands: Decoding Deep Learning Methods for Binding Affinity Prediction Gorantla, R., Kubincova, A., Weiße, A. Y., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 7, 2496–2507

en

dc.relation.hasversion

Benchmarking active learning protocols for ligand binding affinity prediction Gorantla, R., Kubincova, A., Suutari, B., Cossins, B. P., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 6, 1955–1965

en

dc.relation.hasversion

R. Gorantla, R. K. Singh, R. Pandey and M. Jain, 2019 IEEE BIBE, 2019, pp. 397–404.

en

dc.relation.hasversion

R. K. Singh and R. Gorantla, PLoS One, 2020, 15, e0220677

en

dc.relation.hasversion

R. K. Singh, R. Gorantla, S. G. R. Allada and P. Narra, PLoS One, 2022, 17, e0276836. [

en

dc.relation.hasversion

R. Gorantla, A. P. Gema, I. X. Yang, ´A. Serrano-Morr´as, B. Suutari, J. J. Jim´enez and A. S. J. S. Mey, bioRxiv, 2024

en

dc.subject

protein binding

en

dc.subject

binding affinity

en

dc.subject

machine learning

en

dc.subject

large language models

en

dc.subject

Active Learning

en

dc.subject

reliability

en

dc.subject

cost-effective

en

dc.title

Machine learning in drug discovery: advancing protein-ligand binding affinity predictions

en

dc.type

Thesis or Dissertation

en

dc.type.qualificationlevel

Doctoral

en

dc.type.qualificationname

PhD Doctor of Philosophy

en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gorantla2025.pdf
Size:: 141.09 MB
Format:: Adobe Portable Document Format
Description:

Download

This item appears in the following Collection(s)

Informatics thesis and dissertation collection