When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education

Nora Al-Shawee; Gerry McElvaney; Judith Strawbridge; Muirne Spooner

doi:10.5334/pme.2033

Figures & Tables

Table 1

Applying Van der Vleuten’s Utility Index and its application to GenAI in SBA item-writing.

UTILITY DIMENSION	DEFINITION	POTENTIAL CHALLENGES WITH GENAI CREATED ITEMS	POTENTIAL BENEFIT OF GENAI CO-CREATED MODEL
Validity	Validity refers to the extent to which an assessment accurately measures what it is intended to measure. It is not a singular property, but a collection of interrelated evidences supporting score interpretation.	GenAI-created items may lack alignment with curricular objectives or default to recall-level tasks instead of reasoning [5].	Structured prompts designed by educators ensure alignment with curricular blueprints and cognitive levels, while institutional oversight provides quality assurance.
Reliability	Reliability refers to the consistency and stability of assessment outcomes. It ensures that results are reproducible and not influenced by item flaws.	Inconsistent item difficulty, distractor quality when prompting lacks standardisation, thus GenAI co created-items can fluctuate in quality [6].	Co-created workflows standardise prompt structures and embed iterative educator feedback, producing consistent, reproducible items that strengthen internal reliability across item sets.
Educational Impact	Educational impact refers to the effect an assessment has on teaching, learning, and professional development.	GenAI-generated items may reinforce superficial learning and fail to model clinical reasoning or judgment [6]	Chain-of-thought reasoning integrated into co-created items supports learning, enabling SBAs to function as both assessment and learning tools that model clinical reasoning.
Acceptability	Acceptability reflects the willingness of stakeholders to adopt and trust the tool, i.e., refers to the trustworthiness and legitimacy of an assessment.	Trust may be undermined by concerns about bias, transparency, and academic integrity. Some faculty remain uncertain about quality and security [14, 15]. Educators are faced with a learning curve with prompt engineering skill, that limits the integration.	In the CCSD model, Institutional governance and transparent policies ensure ethical compliance, while faculty training builds confidence and shared trust in the use of GenAI for assessment design.
Cost-Efficiency	Cost-efficiency is the balance between the resources required to develop the assessment content and the educational value produced. This includes time, training, oversight, and quality assurance.	High subscription fees or technical requirements for GenAI platforms may widen equity gaps making it harder for under-resourced institutions to adopt AI tools at the same pace as better-funded counterparts. Quality training in AI literacy demands time, expertise, and coordination, adding to the upfront costs [16]	Co-created systems improve efficiency over time: prompt exemplars and faculty development reduce editing workload, while institutional licensing supports more consistent access within an institution. However, equitable access, including for under-resourced institutions, depends on broader structural mechanisms such as sector-wide negotiations, national or regional consortia, or open-access educational Large Language Models [16].

The Co-Created SBA Design (CCSD) Framework.

Table 2

Common Types of Edits Required for Reviewing GenAI-Generated Assessment Items.

TYPES OF EDITS	DESCRIPTION
Clinical inaccuracy	Incorrect or outdated medical content. For example: Suggesting oral amoxicillin as first-line treatment for hospital-acquired pneumonia.
Lack of depth or cognitive challenge	Questions that only assess factual recall. For example: “What is the normal range for potassium?” instead of applying this in a clinical context.
Implausible distractors	Options that are obviously incorrect or unrelated to the scenario. For example a clinical vignette on acute myocardial infarction with possible options: A. Acute myocardial infarction B. Acute pancreatitis C. Costochondritis D. Tension pneumothorax. Another example for implausible distractor, providing “pregnancy test” as an answer option for a male patient
Use of abbreviations	Use of abbreviations that may be ambiguous, outdated, or unfamiliar. For example: “NAD” (which may mean “no abnormality detected” or “nicotinamide adenine dinucleotide”)
Jurisdiction specific terminology	Use of terms that vary by region or practice. “Emergency room” (US) vs. “Accident & Emergency” (UK).
Region specific medication names	Differences in drug names across regions that may confuse test-takers. For example: paracetamol vs acetaminophen
Ambiguous or vague wording	Lack of specificity or clarity in the lead in question. For example: “Select the correct treatment” without clarifying whether it refers to first line, symptomatic, or emergency treatment.

When Assessment Theory Meets Generative AI: Reimagining SBA Design in Medical Education

Figures & Tables

Table 1

Figure 1

Table 2

Paradigm

My account