MambaSC: A Feature Matching Method Using Mamba2 with Self and Cross-Attention for Multimodal Images

Rongrui Teng; Yun Liao; Wei Wang; Qing Duan; Junhui Liu; Fangwei Jin

doi:10.2478/jaiscr-2026-0009

Abstract

Multimodal image matching remains a challenging yet essential task in the field of computer vision. In recent years, detector-free methods have emerged as promising approaches, achieving high matching accuracy by leveraging global modeling capabilities. While transformer-based methods are effective, they often suffer from significant computational overhead, limiting their efficiency.To address this, we propose MambaSC, a novel framework that integrates Mamba with self-attention and cross-attention mechanisms to balance accuracy and efficiency. Specifically, MambaSC introduces the M2Backbone for efficient feature extraction and the MSC Module to enhance feature interaction and alignment.Extensive experiments across multiple multimodal image datasets demonstrate that MambaSC consistently outperforms state-of-the-art methods while maintaining computational efficiency, making it a compelling solution for complex multimodal image matching scenarios. Code is available at: https://github.com/LiaoYun0x0/MambaSC.

References

C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:978–994, 2010.
Search in Google Scholar Back to article
H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European Conference on Computer Vision, pages 404–417. Springer, Berlin, Heidelberg, 2006.
Search in Google Scholar Back to article
E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In 2011 International Conference on Computer Vision, pages 2564–2571. IEEE, 2011.
Search in Google Scholar Back to article
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
Search in Google Scholar Back to article
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
Search in Google Scholar Back to article
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023.
Search in Google Scholar Back to article
J. Sun, Z. Shen, Y. Wang, et al. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
Search in Google Scholar Back to article
Khang Truong Giang, Soohwan Song, and Sungho Jo. Topicfm: Robust and interpretable topic-assisted feature matching. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 2447–2455, 2023.
Search in Google Scholar Back to article
Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer. In European conference on computer vision, pages 20–36. Springer, 2022.
Search in Google Scholar Back to article
Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian conference on computer vision, pages 2746–2762, 2022.
Search in Google Scholar Back to article
J. Edstedt, I. Athanasiadis, M. Wadenb¨ack, et al. Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023.
Search in Google Scholar Back to article
J. Edstedt, Q. Sun, G. Bökman, et al. Roma: Revisiting robust losses for dense feature matching. arXiv preprint arXiv:2305.15404, 2023.
Search in Google Scholar Back to article
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
Search in Google Scholar Back to article
Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andre Araujo. Omniglue: Generalizable feature matching with foundation model guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024.
Search in Google Scholar Back to article
Johan Edstedt, Georg Bökman, Mårten Wadenb¨ack, and Michael Felsberg. Dedode: Detect, don’t describe—describe, don’t detect for local feature matching. In 2024 International Conference on 3D Vision (3DV), pages 148–157. IEEE, 2024.
Search in Google Scholar Back to article
Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024.
Search in Google Scholar Back to article
Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient. In Advances in Neural Information Processing Systems, volume 33, pages 14254–14265, 2020.
Search in Google Scholar Back to article
Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan. Quadtree attention for vision transformers. arXiv preprint arXiv:2201.02767, 2022.
Search in Google Scholar Back to article
P. Truong, M. Danelljan, L. Van Gool, et al. Learning accurate dense correspondences and when to trust them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5714–5724, 2021.
Search in Google Scholar Back to article
Tao Xie, Kun Dai, Ke Wang, Ruifeng Li, and Lijun Zhao. Deepmatcher: a deep transformer-based network for robust and accurate local feature matching. Expert Systems with Applications, 237:121361, 2024.
Search in Google Scholar Back to article
Yongxian Zhang, Chaozhen Lan, Haiming Zhang, Guorui Ma, and Heng Li. Multimodal remote sensing image matching via learning features and attention mechanism. IEEE Transactions on Geo-science and Remote Sensing, 62:1–20, 2024.
Search in Google Scholar Back to article
Wang Zhang, Tingting Li, Yuntian Zhang, Gensheng Pei, Xiruo Jiang, and Yazhou Yao. Ltformer: A light-weight transformer-based self-supervised matching network for heterogeneous remote sensing images. Information Fusion, 109:102425, 2024.
Search in Google Scholar Back to article
Xin Hu, Yan Wu, Zhikang Li, Zhifei Yang, and Ming Li. Multi-feature alignment and matching network for sar and optical image registration. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024.
Search in Google Scholar Back to article
Yongjun Zhang, Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Yansheng Li, and Xiaohu Yan. Multi-modal remote sensing image robust matching based on second-order tensor orientation feature transformation. IEEE Transactions on Geoscience and Remote Sensing, 2025.
Search in Google Scholar Back to article
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
Search in Google Scholar Back to article
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
Search in Google Scholar Back to article
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
Search in Google Scholar Back to article
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model, 2024. URL https://arxiv.org/abs/2403.19887, page 34.
Search in Google Scholar Back to article
Weihao Yu and Xinchao Wang. Mambaout: Do we really need mamba for vision? arXiv preprint arXiv:2405.07992, 2024.
Search in Google Scholar Back to article
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xin-long Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
Search in Google Scholar Back to article
Thomas Roßberg and Michael Schmitt. Estimating ndvi from sentinel-1 sar data using deep learning. In IGARSS 2022-2022 IEEE International Geo-science and Remote Sensing Symposium, pages 1412–1415. IEEE, 2022.
Search in Google Scholar Back to article
Y. Di, Y. Liao, H. Zhou, K. Zhu, Y. Zhang, Q. Duan, et al. Femip: detector-free feature matching for multimodal images with policy gradient. Applied Intelligence, 2023.
Search in Google Scholar Back to article
Matthew Brown and Sabine Süsstrunk. Multi-spectral sift for scene category recognition. In CVPR 2011, pages 177–184. IEEE, 2011.
Search in Google Scholar Back to article
Xue Li, Guo Zhang, Hao Cui, Shasha Hou, Shun-yao Wang, Xin Li, Yujia Chen, Zhijiang Li, and Li Zhang. Mcanet: A joint semantic segmentation framework of optical and sar images for land use classification. International Journal of Applied Earth Observation and Geoinformation, 106:102638, 2022.
Search in Google Scholar Back to article
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012.
Search in Google Scholar Back to article
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Search in Google Scholar Back to article
I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
Search in Google Scholar Back to article
Yide Di, Yun Liao, Hao Zhou, Kaijun Zhu, Qing Duan, Junhui Liu, and Mingyu Lu. Ufm: Unified feature matching pre-training with multi-modal image assistants. PloS one, 20(3):e0319051, 2025.
Search in Google Scholar Back to article
Yun Liao, Xuning Wu, Junhui Liu, Peiyu Liu, Zhixuan Pan, and Qing Duan. Fmcfa: a feature matching method for critical feature attention in multimodal images. Scientific Reports, 15(1):6640, 2025.
Search in Google Scholar Back to article
Yide Di, Yun Liao, Kaijun Zhu, Hao Zhou, Yijia Zhang, Qing Duan, Junhui Liu, and Mingyu Lu. Mivi: Multi-stage feature matching for infrared and visible image. The Visual Computer, 40(3):1839–1851, 2024.
Search in Google Scholar Back to article
Vassileios Balntas, DP Edgar Riba, and K. Mikolajczyk. Learning local feature descriptors with triplets and shallow convolutional neural networks. In Proceedings of the British Machine Vision Conference (BMVC), pages 119.1–119.11. BMVA Press, 2016. Available from: https://dx.doi.org/10.5244/C.30.119.
Search in Google Scholar Back to article
Y. Liao, Y. Di, H. Zhou, A. Li, J. Liu, M. Lu, et al. Feature matching and position matching between optical and sar with local deep feature descriptor. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:448–462, 2022.
Search in Google Scholar Back to article
X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.
Search in Google Scholar Back to article
Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 8092–8101, 2019.
Search in Google Scholar Back to article

MambaSC: A Feature Matching Method Using Mamba2 with Self and Cross-Attention for Multimodal Images

Abstract

Paradigm

My account