Machine learning in drug discovery: advancing protein-ligand binding affinity predictions
dc.contributor.advisor
Mey, Antonia
dc.contributor.advisor
Weisse, Andrea
dc.contributor.author
Gorantla, Rohan
dc.date.accessioned
2025-07-14T14:16:13Z
dc.date.available
2025-07-14T14:16:13Z
dc.date.issued
2025-07-14
dc.description.abstract
Binding affinity quantifies the strength of the interaction between a protein and
a small drug-like molecule. Accurately determining binding affinity helps identify
promising drug candidates in the early stage of drug discovery, particularly in hit discovery
and lead optimization phases, where screening several millions to even billions
of compounds is required. Hit discovery involves identifying potential compounds
(known as ‘hits’) that show initial activity against the choice of disease-causing protein
target. Lead optimization focuses on refining these hits to improve their binding
affinity and other drug-like properties. Experimental assays are the gold standard
for determining binding affinity, but they are not practical for rapidly screening
millions of drug-like compounds against potential targets. Accurate in silico prediction
of protein-ligand binding affinity can significantly expedite drug discovery by
streamlining the identification and optimization of viable drug candidates, reducing
huge experimental costs and time.
Over the last fifty years, a wide range of in silico binding affinity prediction
strategies have been developed. They consist of both structure-based and ligand-based
approaches. However, these methodologies often fall short in large-scale
screenings. So called docking methods, while capable of high-throughput screenings,
often lack the desired accuracy for a binding affinity prediction. In contrast,
alchemical free energy-based (AFE) techniques, a simulation-based technique, offer
improved accuracy but are computationally demanding. The rapid progression of
machine learning, coupled with increased accessibility to binding affinity data, opens
avenues to deep learning-based methods for improving the accuracy and speed of
binding affinity predictions.
This thesis focuses on exploring and developing machine learning methods for
predicting protein-ligand binding affinity. The first part of the thesis investigates
how current deep learning models learn from input protein and ligand data to predict
binding affinity. Systematic experiments using publicly available kinase datasets are
conducted to assess the impact of protein encodings and ligand encodings derived
from convolutional and/or graph neural networks by inputting variations of protein
and ligand data. The results indicated that protein encodings have minimal impact
on binding predictions, while ligand-based features play a more substantial role in
model performance.
The second part of the thesis focuses on addressing key challenges at the model,
data, and evaluation levels of the deep learning framework for predicting binding
affinity. To overcome challenges at the model level, this work introduces a deep learning
framework for predicting binding affinity using pretrained protein and ligand
language models, called BALM. Utilizing pretrained language models for proteins
and ligands, the BALM method predicts binding affinity by optimizing the distance
between protein and ligand encodings using a cosine similarity metric. At the
data and evaluation levels, the research demonstrates novel strategies for training
and testing these models to ensure they provide meaningful and reliable predictions
compared to traditional methods and experimental measurements. While zero-shot
prediction on unseen targets may not always be reliable, the few-shot finetuning of
the BALM model is shown to be reliable for screening new targets, demonstrating
better performance than docking.
The final part of this thesis focuses on integrating machine learning with physics-based
simulation methods, such as alchemical free energy calculations. This integration
aims to reduce computational costs and time during lead optimization.
Specifically, Active Learning (AL) is used to intelligently select compounds for AFE
calculations, making the identification of top binders more efficient. AL is an iterative process that learns binding affinities from an unlabelled dataset and helps
prioritize compounds for detailed evaluation, minimizing the need to compute AFE
for all compounds in a large pool. In this approach, machine learning models such
as Gaussian process regression and pretrained graph neural network-based models
act as surrogate models during each AL iteration. They provide predictions of
binding affinity to inform the next set of AFE calculations, allowing for efficient
compound selection. Various recommendations on model choice, batch size, and
strategies for exploring or exploiting the chemical spaces based on the ligand pool
used are provided. Both models show similar recall in identifying top binders on
large datasets, but the Gaussian process model performs better than the pretrained
graph network model when the training data is limited. Using a larger initial batch
size, especially with diverse datasets, improved recall for both models and enhanced
overall correlation metrics. However, smaller batch sizes were more effective for later
iterations.
en
dc.identifier.uri
https://hdl.handle.net/1842/43674
dc.identifier.uri
http://dx.doi.org/10.7488/era/6206
dc.language.iso
en
en
dc.publisher
The University of Edinburgh
en
dc.relation.hasversion
From Proteins to Ligands: Decoding Deep Learning Methods for Binding Affinity Prediction Gorantla, R., Kubincova, A., Weiße, A. Y., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 7, 2496–2507
en
dc.relation.hasversion
Benchmarking active learning protocols for ligand binding affinity prediction Gorantla, R., Kubincova, A., Suutari, B., Cossins, B. P., & Mey, A. S. J. Chem. Inf. Model. 2024, 64, 6, 1955–1965
en
dc.relation.hasversion
R. Gorantla, R. K. Singh, R. Pandey and M. Jain, 2019 IEEE BIBE, 2019, pp. 397–404.
en
dc.relation.hasversion
R. K. Singh and R. Gorantla, PLoS One, 2020, 15, e0220677
en
dc.relation.hasversion
R. K. Singh, R. Gorantla, S. G. R. Allada and P. Narra, PLoS One, 2022, 17, e0276836. [
en
dc.relation.hasversion
R. Gorantla, A. P. Gema, I. X. Yang, ´A. Serrano-Morr´as, B. Suutari, J. J. Jim´enez and A. S. J. S. Mey, bioRxiv, 2024
en
dc.subject
protein binding
en
dc.subject
binding affinity
en
dc.subject
machine learning
en
dc.subject
large language models
en
dc.subject
Active Learning
en
dc.subject
reliability
en
dc.subject
cost-effective
en
dc.title
Machine learning in drug discovery: advancing protein-ligand binding affinity predictions
en
dc.type
Thesis or Dissertation
en
dc.type.qualificationlevel
Doctoral
en
dc.type.qualificationname
PhD Doctor of Philosophy
en
Files
Original bundle
1 - 1 of 1
- Name:
- Gorantla2025.pdf
- Size:
- 141.09 MB
- Format:
- Adobe Portable Document Format
- Description:
This item appears in the following Collection(s)

