Adapting learning and search algorithms to handle protein structural data with the goal of aiding drug discovery
Abstract
Experimental methods for protein structure determination (e.g., x-ray crystallography, NMR, cryoEM) require access to expensive equipment and are not scalable. Computational methods assist protein structure prediction and analysis on a far larger scale. Recent deep learning advances, the most notable being DeepMind’s AlphaFold2.0 release in 2021, have provided a wealth of structural data for further analysis and open new opportunities for algorithmic development. In my work, I address three different tasks that make use of the available protein structure data: (1) system-specific binding-affinity prediction (in the context of the immune-related peptide-HLA system); (2) generation of representative ensembles from generic protein structure datasets; (3) protein-ligand ensemble docking. To this end, I examine and adapt a range of algorithms including random forest regression models, unsupervised learning methods and stochastic global optimization techniques. I validate the resulting pipelines on available experimental data and apply them to different macromolecular contexts such as the immune-related formation of the peptide-HLA complex; flexibility of the signal transducer PI3K lipid kinase; CDK2 protein kinase and estrogen receptor α. Developed pipelines are open source and freely available and can help guide the search for novel therapeutics.