Skip to main content
Have a personal or library account? Click to login
Contextualized Vision Transformers (CVT): Adaptive Spectral Embedding and Feature Gating for Precise Text-Graphics Classification Cover

Contextualized Vision Transformers (CVT): Adaptive Spectral Embedding and Feature Gating for Precise Text-Graphics Classification

Open Access
|Mar 2026

References

  1. CHEN, C.-F. R.—FAN, Q.—PANDA, R.: Crossvit: Cross-attention multi-scale vision transformer for image classi fi cation.In: Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 357–366.
  2. CHU, X.—TIAN, Z.—WANG, Y.—ZHANG, B.—REN, H.—WEI, X.—XIA, H.—SHEN, C.: Twins: Revisiting the design of spatial attention in vision transformers, Advances in neural information processing systems 34 (2021), 9355–9366.
  3. DOSOVITSKIY, A.: An image is worth 16 x 16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
  4. DOSOVITSKIY, A.—BEYER, L.—KOLESNIKOV, A.—WEISSENBORN, D.— ZHAI, X.—UNTERTHINER, T.—DEHGHANI, M.—MINDERER, M.— HEIGOLD, G.— GELLY, S.—HOULSBY, N.—STEINER, P.—ET AL.: An image is worth 16 x 16 words: Transformers for image recognition at scale, ICLR (2020), arXiv:2010.11929.
  5. DOSOVITSKIY, A.—BEYER, L.—KOLESNIKOV, A.—WEISSENBORN, D.— ZHAI, X.—UNTERTHINER, T.—DEHGHANI, M.—MINDERER, M.— HEIGOLD, G.— GELLY, S. et al.: An image is worth 16 x 16 words: Transformers for image recognition at scale.In: International Conference on Learning Representations (ICLR) (2021).
  6. GHOSH, M.—BAIDYA, G.—MUKHERJEE, H.—OBAIDULLAH, S. M.—ROY, K.: A deep learning-based approach to single/mixed script-type identi fi cation. In: Advanced Computing and Systems for Security: Vol. 13, (2021), pp. 121–132.
  7. GHOSH, M.—MUKHERJEE, H.—OBAIDULLAH, S. M.—GAO, X.-Z.—ROY, K.: Scene text understanding: recapitulating the past decade, Artificial Intelligence Review 56 (2023), no. 12, 15301–15373.
  8. GHOSH, M.—ROY, S. S.—BANIK, B.—MUKHERJEE, H.—OBAIDULLAH, S. M.— ROY, K.: MOPO-HBT: A movie poster dataset for title extraction and recognition,Multimedia Tools and Applications 83 (2024), no. 18, 54545–54568.
  9. GHOSH, M.—ROY, S. S.—MUKHERJEE, H.—OBAIDULLAH, S. M.—SANTOSH, K.—ROY, K.: Automatic text localization in scene images: a transfer learning based approach. In: National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, 2019, pp. 470–479.
  10. GHOSH, M.—ROY, S. S.—MUKHERJEE, H.—OBAIDULLAH, S. M.— SANTOSH, K.—ROY, K.: Understanding movie poster: transfer-deep learning approach for graphic-rich text recognition, The Visual Computer 38 (2022), 1645–1664.
  11. HU, J.—SHEN, L.—SUN, G.: Squeeze-and-excitation networks.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 7132–7141.
  12. LIU, Z.—LIN, Y.—CAO, Y.—HU, H.—WEI, Y.— ZHANG, Z.—LIN, S.—GUO, B.: Swin transformer: Hierarchical vision transformer using shifted windows.In: Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 10012–10022.
  13. LIU, Z.—LIN, Y.—CAO, Y.—HU, H.—WEI, Y.— ZHANG, Z.—LIN, S.—GUO, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021), pp. 10012–10022.
  14. LIYUAN—CHEN, Y.—WANG, T.—WEIGE: . Incorporating global information into visual transformers (2021).
  15. SMITH, A.—JONES, C.—WILLIAMS, E.: Adaptive patch sizes for improved object detection in vision transformers.In: Proceedings of the International Conference on Machine Learning (ICML) (2023). As referenced in the source document; this is a conceptual placeholder entry.
  16. SMITH, A.—JONES, C.—WILLIAMS, E.: Adaptive patch sizes for improved object detection in vision transformers.In: Proceedings of the International Conference on Machine Learning (ICML), p. TBD, 2023.
  17. TOUVRON, H.—CORD, M.—DOUZE, M.—MASSA, F.—SABLAYROLLES, A.— JÉGOU, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML), PMLR (2021), pp. 10347–10357.
  18. TOUVRON, H.—CORD, M.—DOUZE, M.—MASSA, F.— SABLAYROLLES, A.— JÉGOU, H.: Training data-efficient image transformers & istillation through attention. In: International conference on machine learning, PMLR (2021), pp. 10347–10357.
  19. TOUVRON, H.—CORD, M.—JÉGOU, H.—BOUCHARD, G.— SABLAYROLLES, A.—JÉGOU, H.: Training data-efficient image transformers through distillation, ICML (2021).
  20. VASWANI, A.—SHAZEER, N.—PARMAR, N.—USZKOREIT, J.— JONES, L.— GOMEZ, A. N.—KAISER, L.—POLOSUKHIN, I.: Attention is all you need. In: Advances in neural information processing systems, Vol. 30, (2017).
  21. WANG, W.—XIE, E.—LI, X.—FAN, D.-P.—SONG, K.—LIANG, D.—LU, T.—LUO, P.—SHAO, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 568–578.
  22. WANG, W.—XIE, E.—LI, X.—FAN, D.-P.—SONG, K.— LIANG, D.—LU, T.— LUO, P.—SHAO, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.In: Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 568–578.
  23. WANG, X.—LI, Y.—ZHANG, Z.—CHEN, H.: Attention drift in deep vision transformers: analysis and mitigation strategies, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2024), 1–15.
  24. WANG, X.—LI, Y.—ZHANG, Z.—CHEN, H.: attention drift in deep vision transformers: analysis and mitigation strategies, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) TBD (2024), TBD.
  25. WOO, S.—PARK, S.—LEE, J.—KWEON, I. S.: ConvNeXt V2: Co-designing and scaling up cnns with masked autoencoders, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 15927–15937.
  26. YUAN, L.—CHEN, Y.—WANG, T.—YU, W.—SHI, Y.— JIANG, Z.-H.—TAY, F. E.— FENG, J.—YAN, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet.In: Proceedings of the IEEE/CVF international conference on computer vision (2021), pp. 558–567.
  27. ZHANG, Y.—ZHANG, C.—LIN, J.—ZHU, L.— YAO, K.—CHEN, Y.: EfficientViT: Memory-efficient vision transformer with cascaded group attention, International Conference on Machine Learning 2023 (2023), 27746–27764.
DOI: https://doi.org/10.2478/tmmp-2026-0003 | Journal eISSN: 1338-9750 | Journal ISSN: 1210-3195
Language: English
Submitted on: Oct 8, 2025
Accepted on: Nov 27, 2026
Published on: Mar 17, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Mridul Ghosh, Konrad Dürrbeck, Roland Fischer, Mária Ždímalová, Tonmoy Mete, published by Slovak Academy of Sciences, Mathematical Institute
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

AHEAD OF PRINT