Abstract
—Human beings rely primarily on vision to perceive and interact with the external world, with approximately 80% of sensory information input through the visual system. This visual dominance makes the question of "where an individual is looking" not only a key to understanding attention distribution and information processing mechanisms but also a critical factor in optimizing decision-making efficiency and learning outcomes. However, traditional methods for analyzing gaze-related behaviors — such as manual behavioral observation and self-reported evaluation— suffer from inherent limitations: be havioral observation relies on subjective judgment of observers, often missing subtle gaze shifts and failing to achieve real-time tracking; self-evaluation is prone to memory biases and social desirability effects, leading to deviations between reported and actual gaze patterns. These drawbacks highlight the need for a more objective and precise alternative.Gaze estimation, which infers an individual’s visual attention and behavioral intentions by recording and analyzing the spatial position, movement trajectory, and dynamic changes of the eyeball, emerges as an ideal solution. This technology is broadly categorized into model-based (relying on geometric eye models) and appearance-based (using facial/ocular image features) approaches, with appearance-based methods gaining traction due to their non-intrusiveness. Nevertheless, current appearance-based gaze estimation still faces two major challenges: (1) individual differences, such as variations in eye shape, pupil size, eyelid structure, and the presence of glasses, which disrupt consistent feature extraction; (2) environmental interference, including variable lighting, partial facial occlusion, and dynamic head poses, which reduce estimation accuracy. To address these issues, this paper proposes RTACM-Net, a novel gaze estimation network architecture that integrates the strengths of Vision Transformer (ViT) with a multi-scale feature fusion mechanism. Specifically, RTACM-Net employs a lightweight convolutional module to extract local fine-grained features of the ocular region, while leveraging ViT’s multi-head attention mechanism to capture global contextual relationships. This dual-branch design enables the network to balance local feature precision and global context awareness, thereby mitigating the impact of individual differences and environmental noise.Extensive experiments were conducted on two benchmark datasets: MPIIFaceGaze (a large-scale dataset focusing on indoor controlled environments with 21 subjects) and Gaze360 (a challenging dataset covering diverse outdoor/indoor scenes, variable lighting, and large head-pose variations with over 100 subjects). The results show that RTACM-Net : on MPIIFaceGaze, it achieves an average angular error (MAE) of 3.72° ; on Gaze360, it achieves an MAE of 10.46°, Gaze360-Net (11.40°) by 0.94°. These results demonstrate the robustness of RTACM-Net in handling variable individual characteristics and complex environmental conditions. Its practical potential extends to multiple fields: in augmented reality (AR), it can enable adaptive interface rendering; in autonomous driving, it supports dual-task monitorin; in human-robot interaction, it facilitates intuitive service triggering.