When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education

Nora Al-Shawee; Gerry McElvaney; Judith Strawbridge; Muirne Spooner

doi:10.5334/pme.2033

Abstract

Current evaluations of generative artificial intelligence (GenAI) in item-writing within medical education often concentrate on isolated dimensions such as validity or reliability, overlooking the broader theoretical foundations that underpin a trustworthy assessment design. This narrow emphasis risks oversimplifying GenAI’s role and obscuring how its adoption reshapes the relationship between quality, efficiency, and educational value. To address this complexity, this paper presents the Co-Created SBA Design (CCSD) framework, which reconceptualises assessment theory for the GenAI era through the lens of Van der Vleuten’s Utility Index. The framework offers a coherent structure for integrating GenAI into Single Best Answer development, maintaining equilibrium across the Utility Index dimensions while redefining collaboration among educators, higher education institutions, and GenAI, a technological partner that enriches the item-writing process. Within this triadic model, each contributor plays a distinct yet complementary role in sustaining assessment quality. Collectively, their interaction ensures that validity, reliability, educational impact, acceptability, and cost-efficiency remain balanced, supporting both educational integrity and sustainable innovation in medical education.

References

McCoubrie P. Improving the fairness of multiple-choice questions: A literature review. Med Teach. 2004;26(8):709–12. DOI: 10.1080/01421590400013495
Open DOI Search in Google Scholar Back to article
Mirbahai L, Adie JW. Applying the utility index to review single best answer questions in medical education assessment. Archives of Epidemiology and Public Health. 2020;2(1). DOI: 10.15761/AEPH.1000113
Open DOI Search in Google Scholar Back to article
Karthikeyan S, O’Connor E, Hu W. Barriers and facilitators to writing quality items for medical school assessments – a scoping review. BMC Med Educ. 2019;19(1):123. DOI: 10.1186/s12909-019-1544-8
Open DOI Search in Google Scholar Back to article
Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: Systematic review. BMC Med Educ. 2024;24(1):354. DOI: 10.1186/s12909-024-05239-y
Open DOI Search in Google Scholar Back to article
Ahmed A, Kerr E, O’Malley A. Quality assurance and validity of ai-generated single best answer questions. BMC Med Educ. 2025;25(1):300. DOI: 10.1186/s12909-025-06881-w
Open DOI Search in Google Scholar Back to article
Kaya M, Sonmez E, Halici A, Yildirim H, Coskun A. Comparison of ai-generated and clinician-designed multiple-choice questions in emergency medicine exam: A psychometric analysis. BMC Med Educ. 2025;25(1):949. DOI: 10.1186/s12909-025-07528-6
Open DOI Search in Google Scholar Back to article
Wu H, Zerner T, Lee D, Court-Kowalski S, Devitt P, Palmer E. Gpt-4 versus human authors in clinically complex mcq creation: A blinded analysis of item quality. Med Teach. 2025;1–14. DOI: 10.21203/rs.3.rs-4831476/v1
Open DOI Search in Google Scholar Back to article
Van Der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications. Adv Health Sci Educ. 1996;1(1):41–67. DOI: 10.1007/BF00596229
Open DOI Search in Google Scholar Back to article
Tavakol M, Dennick R. Making sense of cronbach’s alpha. Int J Med Educ. 2011;2:53–5. DOI: 10.5116/ijme.4dfb.8dfd
Open DOI Search in Google Scholar Back to article
Heeneman S, de Jong LH, Dawson LJ, Wilkinson TJ, Ryan A, Tait GR, et al. Ottawa 2020 consensus statement for programmatic assessment – 1. Agreement on the principles. Med Teach. 2021;43(10):1139–48. DOI: 10.1080/0142159X.2021.1957088
Open DOI Search in Google Scholar Back to article
Pham H, Besanko J, Devitt P. Examining the impact of specific types of item-writing flaws on student performance and psychometric properties of the multiple choice question. MedEdPublish (2016). 2018;7:225. DOI: 10.15694/mep.2018.0000225.1
Open DOI Search in Google Scholar Back to article
Lee HY, Yune SJ, Lee SY, Im S, Kam BS. The impact of repeated item development training on the prediction of medical faculty members’ item difficulty index. BMC Med Educ. 2024;24(1):599. DOI: 10.1186/s12909-024-05577-x
Open DOI Search in Google Scholar Back to article
Webb EM, Phuong JS, Naeger DM. Does educator training or experience affect the quality of multiple-choice questions? Acad Radiol. 2015;22(10):1317–22. DOI: 10.1016/j.acra.2015.06.012
Open DOI Search in Google Scholar Back to article
Taheri R, Nazemi N, Pennington SE, Clark JA, Dadgostari F. Factors influencing educators’ ai adoption: A grounded meta-analysis review. Computers and Education: Artificial Intelligence. 2025;9:100464. DOI: 10.1016/j.caeai.2025.100464
Open DOI Search in Google Scholar Back to article
Komasawa N, Yokohira M. Generative artificial intelligence (ai) in medical education: A narrative review of the challenges and possibilities for future professionalism. Cureus. 2025;17(6):e86316. DOI: 10.7759/cureus.86316
Open DOI Search in Google Scholar Back to article
Khakpaki A. Advancements in artificial intelligence transforming medical education: A comprehensive overview. Med Educ Online. 2025;30(1):2542807. DOI: 10.1080/10872981.2025.2542807
Open DOI Search in Google Scholar Back to article
Preiksaitis C, Rose C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: Scoping review. JMIR Med Educ. 2023;9:e48785. DOI: 10.2196/48785
Open DOI Search in Google Scholar Back to article
Downing SM. The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract. 2005;10(2):133–43. DOI: 10.1007/s10459-004-4019-5
Open DOI Search in Google Scholar Back to article
Russell RG, Lovett Novak L, Patel M, Garvey KV, Craig KJT, Jackson GP, et al. Competencies for the use of artificial intelligence-based tools by health care professionals. Acad Med. 2023;98(3):348–56. DOI: 10.1097/ACM.0000000000004963
Open DOI Search in Google Scholar Back to article
Storey VC, Yue WT, Zhao JL, Lukyanenko R. Generative artificial intelligence: Evolving technology, growing societal impact, and opportunities for information systems research. Inf Syst Front. 2025;27(5):2081–102. DOI: 10.1007/s10796-025-10581-7
Open DOI Search in Google Scholar Back to article
Ng IKS, Goh WGW, Teo DB, Chong KM, Tan LF, Teoh CM. Clinical reasoning in real-world practice: A primer for medical trainees and practitioners. Postgrad Med J. 2024;101(1191):68–75. DOI: 10.1093/postmj/qgae079
Open DOI Search in Google Scholar Back to article
Gruppen LD. Clinical reasoning: Defining it, teaching it, assessing it, studying it. West J Emerg Med. 2017;18(1):4–7. DOI: 10.5811/westjem.2016.11.33191
Open DOI Search in Google Scholar Back to article
Ngo A, Gupta S, Perrine O, Reddy R, Ershadi S, Remick D. Chatgpt 3.5 fails to write appropriate multiple choice practice exam questions. Acad Pathol. 2024;11(1):100099. DOI: 10.1016/j.acpath.2023.100099
Open DOI Search in Google Scholar Back to article
Messick S. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50(9):741. DOI: 10.1037/0003-066X.50.9.741
Open DOI Search in Google Scholar Back to article
Boscardin CK, Gin B, Golde PB, Hauer KE. Chatgpt and generative artificial intelligence for medical education: Potential impact and opportunity. Acad Med. 2024;99(1):22–7. DOI: 10.1097/ACM.0000000000005439
Open DOI Search in Google Scholar Back to article
Kıyak YS, Coşkun Ö, Budakoğlu I, Uluoğlu C. Chatgpt for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol. 2024;80(5):729–35. DOI: 10.1007/s00228-024-03649-x
Open DOI Search in Google Scholar Back to article
Cross JL, Choma MA, Onofrey JA. Bias in medical ai: Implications for clinical decision-making. PLOS Digit Health. 2024;3(11):e0000651. DOI: 10.1371/journal.pdig.0000651
Open DOI Search in Google Scholar Back to article
Masters K, MacNeil H, Benjamin J, Carver T, Nemethy K, Valanci-Aroesty S, et al. Artificial intelligence in health professions education assessment: Amee guide no. 178. Med Teach. 2025;47(9):1410–24. DOI: 10.1080/0142159X.2024.2445037
Open DOI Search in Google Scholar Back to article
Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst. 2022;35:24824–37.
Search in Google Scholar Back to article
Cho Y, Park GL, Waite GN, Mudigonda A, Szarek JL. Development of a universal prompt as a scalable generative ai-assisted tool for usmle step 1 style multiple-choice question refinement in medical education. Med Sci Educ. 2025;35(2):611–3. DOI: 10.1007/s40670-025-02334-7
Open DOI Search in Google Scholar Back to article
Norcini JJ, McKinley DW. Assessment methods in medical education. Teach Teach Educ. 2007;23(3):239–50. DOI: 10.1016/j.tate.2006.12.021
Open DOI Search in Google Scholar Back to article
Barker AP. Artificial intelligence in health education within higher education institutions. Evid Based Nurs. 2025;28(3):147. DOI: 10.1136/ebnurs-2025-104314
Open DOI Search in Google Scholar Back to article
Greenhalgh T, Robert G, Macfarlane F, Bate P, Kyriakidou O. Diffusion of innovations in service organizations: Systematic review and recommendations. Milbank Q. 2004;82(4):581–629. DOI: 10.1111/j.0887-378X.2004.00325.x
Open DOI Search in Google Scholar Back to article
Moldt JA, Festl-Wietek T, Fuhl W, Zabel S, Claassen M, Wagner S, et al. Assessing ai awareness and identifying essential competencies: Insights from key stakeholders in integrating ai into medical education. JMIR Med Educ. 2024;10:e58355. DOI: 10.2196/58355
Open DOI Search in Google Scholar Back to article
Capan Melser M, Steiner-Hofbauer V, Lilaj B, Agis H, Knaus A, Holzinger A. Knowledge, application and how about competence? Qualitative assessment of multiple-choice questions for dental students. Med Educ Online. 2020;25(1):1714199. DOI: 10.1080/10872981.2020.1714199
Open DOI Search in Google Scholar Back to article
Tolentino R, Baradaran A, Gore G, Pluye P, Abbasgholizadeh-Rahimi S. Curriculum frameworks and educational programs in ai for medical students, residents, and practicing physicians: Scoping review. JMIR Med Educ. 2024;10:e54793. DOI: 10.2196/54793
Open DOI Search in Google Scholar Back to article
D’Souza R, Mathew M, Mishra V, Surapaneni KM. Twelve tips for addressing ethical concerns in the implementation of artificial intelligence in medical education. Med Educ Online. 2024;29(1). DOI: 10.1080/10872981.2024.2330250
Open DOI Search in Google Scholar Back to article
Chadha N, Popil E, Gregory J, Armstrong-Davies L, Justin G. How do we teach generative artificial intelligence to medical educators? Pilot of a faculty development workshop using chatgpt. Med Teach. 2024;1–3. DOI: 10.1080/0142159X.2024.2341806
Open DOI Search in Google Scholar Back to article
Youm J, Corral J. Technological pedagogical content knowledge among medical educators: What is our readiness to teach with technology? Acad Med. 2019;94(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 58th Annual Research in Medical Education Sessions):S69–s72. DOI: 10.1097/ACM.0000000000002912
Open DOI Search in Google Scholar Back to article
Sun GH. Prompt engineering for nurse educators. Nurse Educ. 2024;49(6):293–9. DOI: 10.1097/NNE.0000000000001705
Open DOI Search in Google Scholar Back to article
Kıyak YS, Emekli E. Chatgpt prompts for generating multiple-choice questions in medical education and evidence on their validity: A literature review. Postgrad Med J. 2024. DOI: 10.1093/postmj/qgae065
Open DOI Search in Google Scholar Back to article
Magzoub ME, Zafar I, Munshi F, Shersad F. Ten tips to harnessing generative ai for high-quality mcqs in medical education assessment. Med Educ Online. 2025;30(1):2532682. DOI: 10.1080/10872981.2025.2532682
Open DOI Search in Google Scholar Back to article
Wass R, Golding C. Sharpening a tool for teaching: The zone of proximal development. Teach High Educ. 2014;19(6):671–84. DOI: 10.1080/13562517.2014.901958
Open DOI Search in Google Scholar Back to article
Leung CH. Promoting optimal learning with chatgpt: A comprehensive exploration of prompt engineering in education. Asian Journal of Contemporary Education. 2024;8(2):104–14. DOI: 10.55493/5052.v8i2.5101
Open DOI Search in Google Scholar Back to article
Heston TF, Khun C. Prompt engineering in medical education. International Medical Education. 2023;2(3):198–205. DOI: 10.3390/ime2030019
Open DOI Search in Google Scholar Back to article
Stadler M, Horrer A, Fischer MR. Crafting medical mcqs with generative ai: A how-to guide on leveraging chatgpt. GMS J Med Educ. 2024;41(2):Doc20.
Search in Google Scholar Back to article
Birks S, Gray J, Darling-Pomranz C. Using artificial intelligence to provide a ‘flipped assessment’ approach to medical education learning opportunities. Med Teach. 2025;47(8):1377–84. DOI: 10.1080/0142159X.2024.2434101
Open DOI Search in Google Scholar Back to article

When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education

Abstract

Paradigm

My account