Single-image indoor localization using cross-domain learning from BIM models

Piotr Ryszko; Dorota Włodarczyk; Małgorzata Jarząbek-Rychard

doi:10.2478/rgg-2026-0004

.blurhash-client-img { display: none !important; }

Single-image indoor localization using cross-domain learning from BIM models

Reports on Geodesy and Geoinformatics

Volume 121 (2026): Issue 1 (June 2026)

By: Piotr Ryszko, Dorota Włodarczyk and Małgorzata Jarząbek-Rychard

Open Access

|May 2026

Abstract

Accurate indoor camera localization is crucial for applications in augmented reality, robotics, and autonomous navigation. While single-image deep learning models for 6-DOF pose regression have shown competitive results on established benchmarks, their development still requires extensive data annotation and hyperparameter tuning. In this work, we investigate the combination of advanced network architectures, transfer learning, and synthetic data to improve single-image indoor pose regression. Our approach employs a ResNet50 backbone pre-trained on the Places365 dataset and further trained and evaluated on established benchmarks. To enhance the training data, synthetic images are generated from 3D BIM models using Unreal Engine, with alignment procedures ensuring accurate correspondence between synthetic and real environments. Real RGB images are preprocessed to resemble synthetic data, enabling effective cross-domain evaluation. Experiments demonstrate that both architectural design and pretraining significantly influence model performance. On the UniMelb dataset (real-to-real scenario), the model achieves 0.21 m and 0.80° errors, surpassing baseline accuracy. We also present cross-validation and synthetic-to-synthetic experiments, providing insights into factors affecting performance and interactions between architecture, pretraining, and dataset characteristics.

References

Acharya, D. (2020). Visual indoor localisation using a 3D building model. PhD thesis, The University of Melbourne.
Search in Google Scholar Back to article
Acharya, D. and Khoshelham, K. (2023). Reverse domain adaptation for indoor camera pose regression. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, X-1/W1-2023:453–460, doi:10.5194/isprs-annals-X-1-W1-2023-453-2023.
Search in Google Scholar Back to article
Acharya, D., Khoshelham, K., and Winter, S. (2019). BIMPoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing, 150:245–258, doi:10.1016/j.isprsjprs.2019.02.020.
Search in Google Scholar Back to article
Acharya, D., Tatli, C. J., and Khoshelham, K. (2023). Synthetic-real image domain adaptation for indoor camera pose regression using a 3D model. ISPRS Journal of Photogrammetry and Remote Sensing, 202:405–421, doi:10.1016/j.isprsjprs.2023.06.013.
Search in Google Scholar Back to article
Acharya, D., Tennakoon, R., Muthu, S., Khoshelham, K., Hoseinnezhad, R., and Bab-Hadiashar, A. (2022). Single-image localisation using 3D models: Combining hierarchical edge maps and semantic segmentation for domain adaptation. Automation in Construction, 136:104152, doi:10.1016/j.autcon.2022.104152.
Search in Google Scholar Back to article
Agarwal, S., Snavely, N., Simon, I., Seitz, S. M., and Szeliski, R. (2009). Building Rome in a day. In 2009 IEEE 12th International Conference on Computer Vision, pages 72–79. doi:10.1109/ICCV.2009.5459148.
Search in Google Scholar Back to article
Bach, T. B., Dinh, T. T., and Lee, J.-H. (2022). FeatLoc: Absolute pose regressor for indoor 2D sparse features with simplistic view synthesizing. ISPRS Journal of Photogrammetry and Remote Sensing, 189:50–62, doi:10.1016/j.isprsjprs.2022.04.021.
Search in Google Scholar Back to article
Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF: Speeded Up Robust Features. In Leonardis, A., Bischof, H., and Pinz, A., editors, Computer Vision – ECCV 2006, pages 404–417, Berlin, Heidelberg. Springer Berlin Heidelberg.
Search in Google Scholar Back to article
Blanton, H. (2021). Revisiting Absolute Pose Regression. PhD thesis, University of Kentucky.
Search in Google Scholar Back to article
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and Krishnan, D. (2016). Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 95–104.
Search in Google Scholar Back to article
Clark, R., Wang, S., Markham, A., Trigoni, N., and Wen, H. (2017). VidLoc: 6-DoF Video-Clip Relocalization. CoRR, abs/1702.06521.
Search in Google Scholar Back to article
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. doi:10.1109/CVPR.2009.5206848.
Search in Google Scholar Back to article
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
Search in Google Scholar Back to article
Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th International Conference on Learning Representations (ICLR) Workshop.
Search in Google Scholar Back to article
Furukawa, Y. and Ponce, J. (2010). Accurate, Dense, and Robust Multiview Stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(8):1362–1376, doi:10.1109/TPAMI.2009.161.
Search in Google Scholar Back to article
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks.
Search in Google Scholar Back to article
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition.
Search in Google Scholar Back to article
Kendall, A. and Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 4762–4769. doi:10.1109/ICRA.2016.7487679.
Search in Google Scholar Back to article
Kendall, A. and Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. CoRR, abs/1704.00390.
Search in Google Scholar Back to article
Kendall, A., Grimes, M., and Cipolla, R. (2015). Convolutional networks for real-time 6-DOF camera relocalization. CoRR, abs/1505.07427.
Search in Google Scholar Back to article
Li, M., Qin, J., Li, D., Chen, R., Liao, X., and Guo, B. (2021). VNLSTMPoseNet: A novel deep ConvNet for real-time 6-DOF camera relocalization in urban streets. Geo-spatial Information Science, 24(3):422–437, doi:10.1080/10095020.2021.1960779.
Search in Google Scholar Back to article
Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60:91–110, doi:10.1023/B:VISI.0000029664.99615.94.
Search in Google Scholar Back to article
Nurutdinova, I. and Fitzgibbon, A. (2015). Towards Pointless Structure from Motion: 3D Reconstruction and Camera Parameters from General 3D Curves. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2363–2371. doi:10.1109/ICCV.2015.272.
Search in Google Scholar Back to article
Peng, X., Sun, B., Ali, K., and Saenko, K. (2015). Learning Deep Object Detectors from 3D Models.
Search in Google Scholar Back to article
Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixé, L. (2019). Understanding the Limitations of CNN-based Absolute Camera Pose Regression. CoRR, abs/1903.07504.
Search in Google Scholar Back to article
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9. doi:10.1109/CVPR.2015.7298594.
Search in Google Scholar Back to article
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need.
Search in Google Scholar Back to article
Walch, F., Hazirbas, C., Leal-Taixé, L., Sattler, T., Hilsenbeck, S., and Cremers, D. (2016). Image-based Localization with Spatial LSTMs. CoRR, abs/1611.07890.
Search in Google Scholar Back to article
Yao, D., Zhu, H., Ren, B., and Zhuang, X. (2024). Improving single image localization through domain adaptation and large kernel attention with synthetic data. Engineering Applications of Artificial Intelligence, 137:108951, doi:10.1016/j.engappai.2024.108951.
Search in Google Scholar Back to article
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. (2018). Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, doi:10.1109/TPAMI.2017.2723009.
Search in Google Scholar Back to article
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2020). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/rgg-2026-0004 | Journal eISSN: 2391-8152 | Journal ISSN: 0867-3179

Journal RSS Feed

Language: English

Page range: 50 - 58

Submitted on: Nov 27, 2025

Accepted on: Apr 4, 2026

Published on: May 6, 2026

Published by: Warsaw University of Technology

In partnership with: Paradigm Publishing Services

Publication frequency: 2 issues per year

Keywords:

indoor localization,

camera pose estimation,

Related subjects:

Computer sciences, other,

Geosciences,

Geodesy,

Cartography and photogrammetry,

Geosciences, other

© 2026 Piotr Ryszko, Dorota Włodarczyk, Małgorzata Jarząbek-Rychard, published by Warsaw University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 121 (2026): Issue 1 (June 2026)