Zahra Shamsi, Matthew Chan and Diwakar Shukla
In review, 2020.
Publication year: 2020

A reoccurring challenge in bioinformatics is predicting the phenotypic consequence of amino acid variation in proteins. Due to recent advances in sequencing techniques, sufficient genomic data is becoming available to train models that predict the evolutionary statistical energies for each sequence, but there is still inadequate experimental data to directly predict functional effects. One approach to overcome this data scarcity is to apply transfer learning and train more models with available datasets. In this study, we propose a set of transfer learning algorithms, we call TLmutation, which implements a supervised transfer learning algorithm that transfers knowledge from survival data to a protein function of interest in the same protein followed by an unsupervised transfer learning algorithm that extends the knowledge to a homologous protein. We explore the application of our algorithms in three cases. First, we test the supervised transfer on dozens of previously published mutagenesis datasets to complete and refine missing datapoints. We further investigate these datasets to identify which variants build better predictors of variant functions. In the second case, we apply the algorithm to predict higher-order mutations solely from single point mutagenesis data. Finally, we perform the unsupervised transfer learning algorithm to predict mutational effects of homologous proteins from experimental datasets. These algorithms are generalized to transfer knowledge between Markov random field models. We show the benefit of our transfer learning algorithms to utilize informative deep mutational data and provide new insights into protein variant functions. As these algorithms are generalized to transfer knowledge between Markov random field models, we expect these algorithms to be applicable to other disciplines.