Contextualized Vision Transformers (CVT): Adaptive Spectral Embedding and Feature Gating for Precise Text-Graphics Classification

Mridul Ghosh; Konrad Dürrbeck; Roland Fischer; Mária Ždímalová; Tonmoy Mete

doi:10.2478/tmmp-2026-0003

.blurhash-client-img { display: none !important; }

Contextualized Vision Transformers (CVT): Adaptive Spectral Embedding and Feature Gating for Precise Text-Graphics Classification

Tatra Mountains Mathematical Publications

AHEAD OF PRINT

By: Mridul Ghosh, Konrad Dürrbeck, Roland Fischer, Mária Ždímalová and Tonmoy Mete

Open Access

|Mar 2026

Abstract

The accurate classification of images into graphics-only, text-only, and mixed-content categories is a critical prerequisite for building efficient, content-aware processing pipelines. This initial triage prevents the unnecessary application of computationally expensive operations, such as Optical Character Recognition (OCR), to irrelevant graphical data. To address this challenge, we introduce the Contextualized Vision Transformer (CVT), a novel architecture designed specifically for this nuanced classification task. The CVT addresses the limitations of standard Vision Transformers (ViTs) through three synergistic components. It employs a Learnable Patch Decomposition (LPD) strategy that efficiently extracts patch embeddings. To model complex spatial arrangements, it introduces an Adaptive Spectral Embedding (ASE) module, which replaces static positional encodings with a dynamic, learnable representation. Crucially, a Contextual Feature Gating (CFG) module enhances feature discriminability by adaptively recalibrating patch-level features, selectively amplifying the salient text or graphic regions. Comprehensive K-fold cross-validation demonstrates the robustness and generalizability of the proposed model. The CVT achieves statistically significant improvements in accuracy, precision, and recall compared to state-of-the-art Vision Transformer baselines. Experimental results highlight the effectiveness of this architecture in providing a fast and accurate solution for the vital task of content triage in large-scale visual processing systems.

References

CHEN, C.-F. R.—FAN, Q.—PANDA, R.: Crossvit: Cross-attention multi-scale vision transformer for image classi fi cation.In: Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 357–366.
Search in Google Scholar Back to article
CHU, X.—TIAN, Z.—WANG, Y.—ZHANG, B.—REN, H.—WEI, X.—XIA, H.—SHEN, C.: Twins: Revisiting the design of spatial attention in vision transformers, Advances in neural information processing systems 34 (2021), 9355–9366.
Search in Google Scholar Back to article
DOSOVITSKIY, A.: An image is worth 16 x 16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
Search in Google Scholar Back to article
DOSOVITSKIY, A.—BEYER, L.—KOLESNIKOV, A.—WEISSENBORN, D.— ZHAI, X.—UNTERTHINER, T.—DEHGHANI, M.—MINDERER, M.— HEIGOLD, G.— GELLY, S.—HOULSBY, N.—STEINER, P.—ET AL.: An image is worth 16 x 16 words: Transformers for image recognition at scale, ICLR (2020), arXiv:2010.11929.
Search in Google Scholar Back to article
DOSOVITSKIY, A.—BEYER, L.—KOLESNIKOV, A.—WEISSENBORN, D.— ZHAI, X.—UNTERTHINER, T.—DEHGHANI, M.—MINDERER, M.— HEIGOLD, G.— GELLY, S. et al.: An image is worth 16 x 16 words: Transformers for image recognition at scale.In: International Conference on Learning Representations (ICLR) (2021).
Search in Google Scholar Back to article
GHOSH, M.—BAIDYA, G.—MUKHERJEE, H.—OBAIDULLAH, S. M.—ROY, K.: A deep learning-based approach to single/mixed script-type identi fi cation. In: Advanced Computing and Systems for Security: Vol. 13, (2021), pp. 121–132.
Search in Google Scholar Back to article
GHOSH, M.—MUKHERJEE, H.—OBAIDULLAH, S. M.—GAO, X.-Z.—ROY, K.: Scene text understanding: recapitulating the past decade, Artificial Intelligence Review 56 (2023), no. 12, 15301–15373.
Search in Google Scholar Back to article
GHOSH, M.—ROY, S. S.—BANIK, B.—MUKHERJEE, H.—OBAIDULLAH, S. M.— ROY, K.: MOPO-HBT: A movie poster dataset for title extraction and recognition,Multimedia Tools and Applications 83 (2024), no. 18, 54545–54568.
Search in Google Scholar Back to article
GHOSH, M.—ROY, S. S.—MUKHERJEE, H.—OBAIDULLAH, S. M.—SANTOSH, K.—ROY, K.: Automatic text localization in scene images: a transfer learning based approach. In: National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, 2019, pp. 470–479.
Search in Google Scholar Back to article
GHOSH, M.—ROY, S. S.—MUKHERJEE, H.—OBAIDULLAH, S. M.— SANTOSH, K.—ROY, K.: Understanding movie poster: transfer-deep learning approach for graphic-rich text recognition, The Visual Computer 38 (2022), 1645–1664.
Search in Google Scholar Back to article
HU, J.—SHEN, L.—SUN, G.: Squeeze-and-excitation networks.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 7132–7141.
Search in Google Scholar Back to article
LIU, Z.—LIN, Y.—CAO, Y.—HU, H.—WEI, Y.— ZHANG, Z.—LIN, S.—GUO, B.: Swin transformer: Hierarchical vision transformer using shifted windows.In: Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 10012–10022.
Search in Google Scholar Back to article
LIU, Z.—LIN, Y.—CAO, Y.—HU, H.—WEI, Y.— ZHANG, Z.—LIN, S.—GUO, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021), pp. 10012–10022.
Search in Google Scholar Back to article
LIYUAN—CHEN, Y.—WANG, T.—WEIGE: . Incorporating global information into visual transformers (2021).
Search in Google Scholar Back to article
SMITH, A.—JONES, C.—WILLIAMS, E.: Adaptive patch sizes for improved object detection in vision transformers.In: Proceedings of the International Conference on Machine Learning (ICML) (2023). As referenced in the source document; this is a conceptual placeholder entry.
Search in Google Scholar Back to article
SMITH, A.—JONES, C.—WILLIAMS, E.: Adaptive patch sizes for improved object detection in vision transformers.In: Proceedings of the International Conference on Machine Learning (ICML), p. TBD, 2023.
Search in Google Scholar Back to article
TOUVRON, H.—CORD, M.—DOUZE, M.—MASSA, F.—SABLAYROLLES, A.— JÉGOU, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML), PMLR (2021), pp. 10347–10357.
Search in Google Scholar Back to article
TOUVRON, H.—CORD, M.—DOUZE, M.—MASSA, F.— SABLAYROLLES, A.— JÉGOU, H.: Training data-efficient image transformers & istillation through attention. In: International conference on machine learning, PMLR (2021), pp. 10347–10357.
Search in Google Scholar Back to article
TOUVRON, H.—CORD, M.—JÉGOU, H.—BOUCHARD, G.— SABLAYROLLES, A.—JÉGOU, H.: Training data-efficient image transformers through distillation, ICML (2021).
Search in Google Scholar Back to article
VASWANI, A.—SHAZEER, N.—PARMAR, N.—USZKOREIT, J.— JONES, L.— GOMEZ, A. N.—KAISER, L.—POLOSUKHIN, I.: Attention is all you need. In: Advances in neural information processing systems, Vol. 30, (2017).
Search in Google Scholar Back to article
WANG, W.—XIE, E.—LI, X.—FAN, D.-P.—SONG, K.—LIANG, D.—LU, T.—LUO, P.—SHAO, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 568–578.
Search in Google Scholar Back to article
WANG, W.—XIE, E.—LI, X.—FAN, D.-P.—SONG, K.— LIANG, D.—LU, T.— LUO, P.—SHAO, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.In: Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 568–578.
Search in Google Scholar Back to article
WANG, X.—LI, Y.—ZHANG, Z.—CHEN, H.: Attention drift in deep vision transformers: analysis and mitigation strategies, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024), 1–15.
Search in Google Scholar Back to article
WANG, X.—LI, Y.—ZHANG, Z.—CHEN, H.: attention drift in deep vision transformers: analysis and mitigation strategies, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) TBD (2024), TBD.
Search in Google Scholar Back to article
WOO, S.—PARK, S.—LEE, J.—KWEON, I. S.: ConvNeXt V2: Co-designing and scaling up cnns with masked autoencoders, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 15927–15937.
Search in Google Scholar Back to article
YUAN, L.—CHEN, Y.—WANG, T.—YU, W.—SHI, Y.— JIANG, Z.-H.—TAY, F. E.— FENG, J.—YAN, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet.In: Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 558–567.
Search in Google Scholar Back to article
ZHANG, Y.—ZHANG, C.—LIN, J.—ZHU, L.— YAO, K.—CHEN, Y.: EfficientViT: Memory-efficient vision transformer with cascaded group attention, International Conference on Machine Learning 2023 (2023), 27746–27764.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/tmmp-2026-0003 | Journal eISSN: 1338-9750 | Journal ISSN: 1210-3195

Journal RSS Feed

Language: English

Submitted on: Oct 8, 2025

Accepted on: Nov 27, 2026

Published on: Mar 17, 2026

Published by: Slovak Academy of Sciences, Mathematical Institute

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Text-graphics,

vision transformer,

classification,

learnable patch decomposition

Related subjects:

Mathematics,

General mathematics

© 2026 Mridul Ghosh, Konrad Dürrbeck, Roland Fischer, Mária Ždímalová, Tonmoy Mete, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

AHEAD OF PRINT