Table 1
Applying Van der Vleuten’s Utility Index and its application to GenAI in SBA item-writing.
| UTILITY DIMENSION | DEFINITION | POTENTIAL CHALLENGES WITH GENAI CREATED ITEMS | POTENTIAL BENEFIT OF GENAI CO-CREATED MODEL |
|---|---|---|---|
| Validity | Validity refers to the extent to which an assessment accurately measures what it is intended to measure. It is not a singular property, but a collection of interrelated evidences supporting score interpretation. | GenAI-created items may lack alignment with curricular objectives or default to recall-level tasks instead of reasoning [5]. | Structured prompts designed by educators ensure alignment with curricular blueprints and cognitive levels, while institutional oversight provides quality assurance. |
| Reliability | Reliability refers to the consistency and stability of assessment outcomes. It ensures that results are reproducible and not influenced by item flaws. | Inconsistent item difficulty, distractor quality when prompting lacks standardisation, thus GenAI co created-items can fluctuate in quality [6]. | Co-created workflows standardise prompt structures and embed iterative educator feedback, producing consistent, reproducible items that strengthen internal reliability across item sets. |
| Educational Impact | Educational impact refers to the effect an assessment has on teaching, learning, and professional development. | GenAI-generated items may reinforce superficial learning and fail to model clinical reasoning or judgment [6] | Chain-of-thought reasoning integrated into co-created items supports learning, enabling SBAs to function as both assessment and learning tools that model clinical reasoning. |
| Acceptability | Acceptability reflects the willingness of stakeholders to adopt and trust the tool, i.e., refers to the trustworthiness and legitimacy of an assessment. | Trust may be undermined by concerns about bias, transparency, and academic integrity. Some faculty remain uncertain about quality and security [14, 15]. Educators are faced with a learning curve with prompt engineering skill, that limits the integration. | In the CCSD model, Institutional governance and transparent policies ensure ethical compliance, while faculty training builds confidence and shared trust in the use of GenAI for assessment design. |
| Cost-Efficiency | Cost-efficiency is the balance between the resources required to develop the assessment content and the educational value produced. This includes time, training, oversight, and quality assurance. | High subscription fees or technical requirements for GenAI platforms may widen equity gaps making it harder for under-resourced institutions to adopt AI tools at the same pace as better-funded counterparts. Quality training in AI literacy demands time, expertise, and coordination, adding to the upfront costs [16] | Co-created systems improve efficiency over time: prompt exemplars and faculty development reduce editing workload, while institutional licensing supports more consistent access within an institution. However, equitable access, including for under-resourced institutions, depends on broader structural mechanisms such as sector-wide negotiations, national or regional consortia, or open-access educational Large Language Models [16]. |

Figure 1
The Co-Created SBA Design (CCSD) Framework.
Table 2
Common Types of Edits Required for Reviewing GenAI-Generated Assessment Items.
| TYPES OF EDITS | DESCRIPTION |
|---|---|
| Clinical inaccuracy | Incorrect or outdated medical content. For example: Suggesting oral amoxicillin as first-line treatment for hospital-acquired pneumonia. |
| Lack of depth or cognitive challenge | Questions that only assess factual recall. For example: “What is the normal range for potassium?” instead of applying this in a clinical context. |
| Implausible distractors | Options that are obviously incorrect or unrelated to the scenario. For example a clinical vignette on acute myocardial infarction with possible options: A. Acute myocardial infarction B. Acute pancreatitis C. Costochondritis D. Tension pneumothorax. Another example for implausible distractor, providing “pregnancy test” as an answer option for a male patient |
| Use of abbreviations | Use of abbreviations that may be ambiguous, outdated, or unfamiliar. For example: “NAD” (which may mean “no abnormality detected” or “nicotinamide adenine dinucleotide”) |
| Jurisdiction specific terminology | Use of terms that vary by region or practice. “Emergency room” (US) vs. “Accident & Emergency” (UK). |
| Region specific medication names | Differences in drug names across regions that may confuse test-takers. For example: paracetamol vs acetaminophen |
| Ambiguous or vague wording | Lack of specificity or clarity in the lead in question. For example: “Select the correct treatment” without clarifying whether it refers to first line, symptomatic, or emergency treatment. |
