Abstract
Facial expression synthesis in virtual environments is critical for educational applications, yet existing systems struggle to balance realism, cultural inclusion, and accessibility. This paper presents a multidimensional framework derived from a systematic review of 127 studies (2014–2024). The framework addresses three key tensions: (1) the realism-accessibility trade-off in generative models, (2) the imperative for cultural inclusion in expression datasets, and (3) the need for pedagogical grounding of expressive agents. The framework’s four dimensions—technical, pedagogical, sociocultural, and operational—offer a replicable blueprint for equitable educational tools. By shifting the evaluation focus from raw technical performance metrics (e.g., F1-score) to contextualized pedagogical utility and establishing a clear theoretical distinction from existing affective computing models, this work provides a nuanced and actionable guide for developers and educators.