Use of AI for the development of two new early drug discovery techniques: deep and transfer learning for LogP prediction and dimensionality reduction for sequence-based virtual screening
Prediction of small molecule physiochemical properties and their biological targets is extremely valuable in the effort to reduce costs and attrition rates within drug discovery. In-silico techniques are now routinely used to guide medicinal chemistry efforts, to prioritise compounds for synthesis and allow early rejection of anything undesirable in the early stages of drug discovery. The ‘big data’ revolution is impacting all fields of science with modern artificial intelligence and machine learning techniques becoming easily accessible due to advancements in computing power and toolkits. The application of these techniques is fundamentally poised to contribute and change the way biology, medicine and drug discovery are performed. In this thesis, machine learning and cheminformatics methods are applied for the prediction of small molecule physicochemical property and compound activity across a range of kinase targets. A neural network-based logP predictor, MRlogP, has been created and shown to outperform other freely available logP predictors for druglike small molecules. The creation of this predictor was achieved using a novel approach whereby the network was trained using a large amount of predicted data, and then further improved with a small dataset of highly accurate experimental measurements. This work has not only created a freely available performant tool for logP prediction on druglike molecules, but also demonstrated the application of techniques to tackle the lack of availability of high-quality data, commonly encountered in the context of drug discovery. This thesis work has also contributed to the establishment of a new branch of virtual screening with the creation of a sequence-based virtual screening platform, applying and improving upon the Drug Discovery Maps (DDM) technique to suggest active compounds for unexplored and orphan kinase targets without solved structures and known ligands. Potentially active compounds are prioritised using purely the target primary sequence. Identification of potentially privileged scaffolds for human and the Plasmodium falciparum (P.f.) targets demonstrate that this sequence-based virtual screening approach provides a fast and efficient route to apply machine learning methods for the identification of active compounds. Improvements implemented over the original DDM technique are shown to produce significantly better results than literature methods, offering a route to improved inhibitor predictions in the future for orphan and unexplored targets. DDM as a virtual screening platform for novel hit discovery has been validated with activity in a primary imaging assay.