Abstract
Multimodal image matching remains a challenging yet essential task in the field of computer vision. In recent years, detector-free methods have emerged as promising approaches, achieving high matching accuracy by leveraging global modeling capabilities. While transformer-based methods are effective, they often suffer from significant computational overhead, limiting their efficiency.To address this, we propose MambaSC, a novel framework that integrates Mamba with self-attention and cross-attention mechanisms to balance accuracy and efficiency. Specifically, MambaSC introduces the M2Backbone for efficient feature extraction and the MSC Module to enhance feature interaction and alignment.Extensive experiments across multiple multimodal image datasets demonstrate that MambaSC consistently outperforms state-of-the-art methods while maintaining computational efficiency, making it a compelling solution for complex multimodal image matching scenarios. Code is available at: https://github.com/LiaoYun0x0/MambaSC.