Have a personal or library account? Click to login
When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education Cover

When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education

Open Access
|Mar 2026

Figures & Tables

Table 1

Applying Van der Vleuten’s Utility Index and its application to GenAI in SBA item-writing.

UTILITY DIMENSIONDEFINITIONPOTENTIAL CHALLENGES WITH GENAI CREATED ITEMSPOTENTIAL BENEFIT OF GENAI CO-CREATED MODEL
ValidityValidity refers to the extent to which an assessment accurately measures what it is intended to measure. It is not a singular property, but a collection of interrelated evidences supporting score interpretation.GenAI-created items may lack alignment with curricular objectives or default to recall-level tasks instead of reasoning [5].Structured prompts designed by educators ensure alignment with curricular blueprints and cognitive levels, while institutional oversight provides quality assurance.
ReliabilityReliability refers to the consistency and stability of assessment outcomes. It ensures that results are reproducible and not influenced by item flaws.Inconsistent item difficulty, distractor quality when prompting lacks standardisation, thus GenAI co created-items can fluctuate in quality [6].Co-created workflows standardise prompt structures and embed iterative educator feedback, producing consistent, reproducible items that strengthen internal reliability across item sets.
Educational ImpactEducational impact refers to the effect an assessment has on teaching, learning, and professional development.GenAI-generated items may reinforce superficial learning and fail to model clinical reasoning or judgment [6]Chain-of-thought reasoning integrated into co-created items supports learning, enabling SBAs to function as both assessment and learning tools that model clinical reasoning.
AcceptabilityAcceptability reflects the willingness of stakeholders to adopt and trust the tool, i.e., refers to the trustworthiness and legitimacy of an assessment.Trust may be undermined by concerns about bias, transparency, and academic integrity. Some faculty remain uncertain about quality and security [14, 15]. Educators are faced with a learning curve with prompt engineering skill, that limits the integration.In the CCSD model, Institutional governance and transparent policies ensure ethical compliance, while faculty training builds confidence and shared trust in the use of GenAI for assessment design.
Cost-EfficiencyCost-efficiency is the balance between the resources required to develop the assessment content and the educational value produced. This includes time, training, oversight, and quality assurance.High subscription fees or technical requirements for GenAI platforms may widen equity gaps making it harder for under-resourced institutions to adopt AI tools at the same pace as better-funded counterparts.
Quality training in AI literacy demands time, expertise, and coordination, adding to the upfront costs [16]
Co-created systems improve efficiency over time: prompt exemplars and faculty development reduce editing workload, while institutional licensing supports more consistent access within an institution. However, equitable access, including for under-resourced institutions, depends on broader structural mechanisms such as sector-wide negotiations, national or regional consortia, or open-access educational Large Language Models [16].
Figure 1

The Co-Created SBA Design (CCSD) Framework.

Table 2

Common Types of Edits Required for Reviewing GenAI-Generated Assessment Items.

TYPES OF EDITSDESCRIPTION
Clinical inaccuracyIncorrect or outdated medical content. For example: Suggesting oral amoxicillin as first-line treatment for hospital-acquired pneumonia.
Lack of depth or cognitive challengeQuestions that only assess factual recall. For example: “What is the normal range for potassium?” instead of applying this in a clinical context.
Implausible distractors
Options that are obviously incorrect or unrelated to the scenario. For example a clinical vignette on acute myocardial infarction with possible options: A. Acute myocardial infarction B. Acute pancreatitis C. Costochondritis D. Tension pneumothorax. Another example for implausible distractor, providing “pregnancy test” as an answer option for a male patient
Use of abbreviations
Use of abbreviations that may be ambiguous, outdated, or unfamiliar. For example: “NAD” (which may mean “no abnormality detected” or “nicotinamide adenine dinucleotide”)
Jurisdiction specific terminologyUse of terms that vary by region or practice. “Emergency room” (US) vs. “Accident & Emergency” (UK).
Region specific medication namesDifferences in drug names across regions that may confuse test-takers. For example: paracetamol vs acetaminophen
Ambiguous or vague wordingLack of specificity or clarity in the lead in question. For example: “Select the correct treatment” without clarifying whether it refers to first line, symptomatic, or emergency treatment.
DOI: https://doi.org/10.5334/pme.2033 | Journal eISSN: 2212-277X
Language: English
Submitted on: Aug 1, 2025
|
Accepted on: Dec 22, 2025
|
Published on: Mar 12, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Nora Al-Shawee, Gerry McElvaney, Judith Strawbridge, Muirne Spooner, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.