Protein Function Prediction with Pretrained Transformers: Performance, Pitfalls, and Practical Guidance
By: Kushal Raj Roy
References
- Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, A. dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido and A. Rives, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science, 2023, 379, 1123–1130.
- J. Su, C. Han, Y. Zhou, J. Shan, X. Zhou and F. Yuan, SaProt: Protein language modeling with structure-aware vocabulary, Proc. Int. Conf. Learn. Represent., 2024.
- T. Hayes, R. Rao, H. Akin, N. J. Sofroniew, D. Oktay, Z. Lin, R. Verkuil, V. Q. Tran, J. Deaton, M. Wiggert, R. Badkundri, I. Shafkat, J. Gong, A. Derry, R. S. Molina, N. Thomas, A. Khan, C. Mishra, C. Kim, L. J. Bartie, M. Nemeth, P. D. Hsu, T. Sercu, S. Candido and A. Rives, Simulating 500 million years of evolution with a language model, Science, 2025, 383, eadl5946.
- Q. Yu, T. Cui, H. Li, J. C. Li, Y. Luo, L. Xie and L. Ma, Enzyme function prediction using contrastive learning, Science, 2023, 379, 1358–1363.
- P. Notin, N. Rollins, Y. Gal, C. Sander and D. Marks, Machine learning for functional protein design, Nat. Biotechnol., 2024, 42, 216–228.
- UniProt Consortium, UniProt: the universal protein knowledgebase in 2023, Nucleic Acids Res., 2023, 51, D523–D531.
- R. Schmirler, M. Heinzinger and B. Rost, Fine-tuning protein language models boosts predictions across diverse tasks, Nat. Commun., 2024, 15, 7407.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin, Attention is all you need, Adv. Neural Inf. Process. Syst., 2017, 30, 5998–6008.
- M. Heinzinger, K. Weissenow, J. G. Sanchez, A. Henkel, M. Mirdita, M. Steinegger and B. Rost, Bilingual language model for protein sequence and structure, NAR Genomics Bioinformatics, 2024, 6, lqae021.
- J. Vig, A. Madani, L. R. Varshney, C. Xiong, R. Socher and N. Rajani., BERTology meets biology: interpreting attention in protein language models, Proc. Int. Conf. Learn. Represent., 2021.
- J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakiol, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli and D. Hassabis, Highly accurate protein structure prediction with AlphaFold, Nature, 2021, 596, 583–589.
- J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C. C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakiol, Z. Wu, A. Žemgulytė, E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Figurnov, F. B. Fuchs, H. Gladman, R. Jain, Y. A. Khan, C. M. R. Low, K. Perlin, A. Potapenko, P. Savy, S. Singh, A. Stecula, A. Thillaisundaram, C. Tong, S. Yakneen, E. D. Zhong, M. Zielinski, A. Žídek, V. Bapst, P. Kohli, M. Jaderberg, D. Hassabis and J. M. Jumper, Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature, 2024, 630, 493–500.
- A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma and R. Fergus, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, Proc. Natl. Acad. Sci. U. S. A., 2021, 118, e2016239118.
- A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik and B. Rost, ProtTrans: Toward understanding the language of life through self-supervised learning, IEEE Trans. Pattern Anal. Mach. Intell., 2022, 44, 7112–7127.
- J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu and A. Rives, Language models enable zero-shot prediction of the effects of mutations on protein function, Adv. Neural Inf. Process. Syst., 2021, 34, 29287–29303.
- J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, N. Hanikel, S. J. Pellock, A. Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S. V. Torres, A. Lauko, V. De Bortoli, E. Mathieu, R. Barzilay, T. S. Jaakkola, F. DiMaio, M. Baek and D. Baker, De novo design of protein structure and function with RFdiffusion, Nature, 2023, 620, 1089–1100.
- A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr, C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser and N. Naik, Large language models generate functional protein sequences across diverse families, Nat. Biotechnol., 2023, 41, 1099–1106.
- Gene Ontology Consortium, The Gene Ontology knowledgebase in 2023, Genetics, 2023, 224, iyad031.
- K. Guo, Y. Zhou, X. Guo, A. Gao, J. Li, D. Chen, H. Guo, Z. Ma, Q. Liang and M. Jiang, Integrating protein structure and deep learning for protein function prediction in the CAFA5 challenge, bioRxiv, 2024, DOI: 10.1101/2024.02.05.578892.
DOI: https://doi.org/10.2478/ebtj-2026-0005 | Journal eISSN: 2564-615X
Language: English
Page range: 35 - 45
Published on: Apr 30, 2026
Published by: European Biotechnology Thematic Network Association
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year
Keywords:
Related subjects:
© 2026 Kushal Raj Roy, published by European Biotechnology Thematic Network Association
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.