Protein Function Prediction with Pretrained Transformers: Performance, Pitfalls, and Practical Guidance
Abstract
Transformer-based protein language models (PLMs) learn meaningful representations from millions of unlabeled sequences, capturing evolutionary patterns and functional relationships. Recent advances include ESM-2’s systematic scaling to 15 billion parameters, structure-aware vocabularies (SaProt), and multimodal foundation models (ESM-3, 98B parameters). PLMs achieve state-of-the-art performance: Gene Ontology prediction (F-max 0.64–0.68), enzyme classification (81% accuracy), and variant effect prediction (Spearman ρ 0.52–0.55). Deep-layer attention correlates 44–63% with 3D contacts despite no structural training. This review synthesizes recent PLM developments, benchmarks, and practical applications, providing guidance for experimental biologists on model selection and validation strategies.
© 2026 Kushal Raj Roy, published by European Biotechnology Thematic Network Association
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.