Contextualized Vision Transformers (CVT): Adaptive Spectral Embedding and Feature Gating for Precise Text-Graphics Classification
Abstract
The accurate classification of images into graphics-only, text-only, and mixed-content categories is a critical prerequisite for building efficient, content-aware processing pipelines. This initial triage prevents the unnecessary application of computationally expensive operations, such as Optical Character Recognition (OCR), to irrelevant graphical data. To address this challenge, we introduce the Contextualized Vision Transformer (CVT), a novel architecture designed specifically for this nuanced classification task. The CVT addresses the limitations of standard Vision Transformers (ViTs) through three synergistic components. It employs a Learnable Patch Decomposition (LPD) strategy that efficiently extracts patch embeddings. To model complex spatial arrangements, it introduces an Adaptive Spectral Embedding (ASE) module, which replaces static positional encodings with a dynamic, learnable representation. Crucially, a Contextual Feature Gating (CFG) module enhances feature discriminability by adaptively recalibrating patch-level features, selectively amplifying the salient text or graphic regions. Comprehensive K-fold cross-validation demonstrates the robustness and generalizability of the proposed model. The CVT achieves statistically significant improvements in accuracy, precision, and recall compared to state-of-the-art Vision Transformer baselines. Experimental results highlight the effectiveness of this architecture in providing a fast and accurate solution for the vital task of content triage in large-scale visual processing systems.
© 2026 Mridul Ghosh, Konrad Dürrbeck, Roland Fischer, Mária Ždímalová, Tonmoy Mete, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.