Abstract
The rapid integration of AI in language assessment has provoked heated discussions concerning validity, reliability, and academic integrity of remote testing used for ESL/EFL assessment. While the rapid integration of AI in assessment tools improves accessibility, scalability, and efficiency of resources and processes, the challenges of ensuring equitable testing conditions, preventing academic misconduct, and measuring language proficiency in high-stakes settings raise crucial questions. This is a comparison study between online technology-driven AI-based assessment versus teacher-mediated assessment in a British international school in Abu Dhabi. About 60 ESL students contributed to a 3-month study, with AI-administered tests used alongside traditional, in-person teacher assessments. Mixed-method analysis, drawing on quantitative scoring data, teacher observations, and student feedback, reported that AI-based assessments (i.e., CAT4, GL) often yielded inflated scores and were inconsistent at assessing productive and contextual language skills. Despite recent developments in policy such as Cambridge’s forthcoming hybrid model of assessment, one that reflects a reliance on the digitalization of testing, the findings indicate that there should still be a human-in-the-loop to support validity, fairness, and the integrity of second-language assessment.