Single-image indoor localization using cross-domain learning from BIM models

Abstract
Accurate indoor camera localization is crucial for applications in augmented reality, robotics, and autonomous navigation. While single-image deep learning models for 6-DOF pose regression have shown competitive results on established benchmarks, their development still requires extensive data annotation and hyperparameter tuning. In this work, we investigate the combination of advanced network architectures, transfer learning, and synthetic data to improve single-image indoor pose regression. Our approach employs a ResNet50 backbone pre-trained on the Places365 dataset and further trained and evaluated on established benchmarks. To enhance the training data, synthetic images are generated from 3D BIM models using Unreal Engine, with alignment procedures ensuring accurate correspondence between synthetic and real environments. Real RGB images are preprocessed to resemble synthetic data, enabling effective cross-domain evaluation. Experiments demonstrate that both architectural design and pretraining significantly influence model performance. On the UniMelb dataset (real-to-real scenario), the model achieves 0.21 m and 0.80° errors, surpassing baseline accuracy. We also present cross-validation and synthetic-to-synthetic experiments, providing insights into factors affecting performance and interactions between architecture, pretraining, and dataset characteristics.
© 2026 Piotr Ryszko, Dorota Włodarczyk, Małgorzata Jarząbek-Rychard, published by Warsaw University of Technology
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.