<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2024.122004</article-id><article-id pub-id-type="publisher-id">JCC-131240</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  A Visual Indoor Localization Method Based on Efficient Image Retrieval
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Mengyan</surname><given-names>Lyu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Xinxin</surname><given-names>Guo</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Kunpeng</surname><given-names>Zhang</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Liye</surname><given-names>Zhang</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>School of Computer Science and Technology, Shandong University of Technology, Zibo, China</addr-line></aff><pub-date pub-type="epub"><day>05</day><month>02</month><year>2024</year></pub-date><volume>12</volume><issue>02</issue><fpage>47</fpage><lpage>66</lpage><history><date date-type="received"><day>17,</day>	<month>January</month>	<year>2024</year></date><date date-type="rev-recd"><day>18,</day>	<month>February</month>	<year>2024</year>	</date><date date-type="accepted"><day>21,</day>	<month>February</month>	<year>2024</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  The task of indoor visual localization, utilizing camera visual information for user pose calculation, was a core component of Augmented Reality (AR) and Simultaneous Localization and Mapping (SLAM). Existing indoor localization technologies generally used scene-specific 3D representations or were trained on specific datasets, making it challenging to balance accuracy and cost when applied to new scenes. Addressing this issue, this paper proposed a universal indoor visual localization method based on efficient image retrieval. Initially, a Multi-Layer Perceptron (MLP) was employed to aggregate features from intermediate layers of a convolutional neural network, obtaining a global representation of the image. This approach ensured accurate and rapid retrieval of reference images. Subsequently, a new mechanism using Random Sample Consensus (RANSAC) was designed to resolve relative pose ambiguity caused by the essential matrix decomposition based on the five-point method. Finally, the absolute pose of the queried user image was computed, thereby achieving indoor user pose estimation. The proposed indoor localization method was characterized by its simplicity, flexibility, and excellent cross-scene generalization. Experimental results demonstrated a positioning error of 0.09 m and 2.14&#176; on the 7Scenes dataset, and 0.15 m and 6.37&#176; on the 12Scenes dataset. These results convincingly illustrated the outstanding performance of the proposed indoor localization method.
 
</p></abstract><kwd-group><kwd>Visual Indoor Positioning</kwd><kwd> Feature Point Matching</kwd><kwd> Image Retrieval</kwd><kwd> Position Calculation</kwd><kwd> Five-Point Method</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>With the advancement of Location-Based Service (LBS) technology, the employment of Global Positioning System (GPS) for positioning can no longer adequately meet the populace’s requisites for indoor location information. Consequently, an escalating array of indoor positioning methodologies has been successively introduced, encompassing Wi-Fi positioning [<xref ref-type="bibr" rid="scirp.131240-ref1">1</xref>] , ultrasound positioning [<xref ref-type="bibr" rid="scirp.131240-ref2">2</xref>] , Ultra-Wideband (UWB) positioning [<xref ref-type="bibr" rid="scirp.131240-ref3">3</xref>] , and geomagnetic positioning [<xref ref-type="bibr" rid="scirp.131240-ref4">4</xref>] , among others. Nevertheless, within intricate indoor environments, these solutions may necessitate substantial manual configuration and supplementary infrastructure, potentially resulting in exorbitant costs and inadequate interference resilience. Conversely, vision-based indoor positioning approaches solely exploit image information to perceive the user’s camera surroundings and compute their position and orientation. This approach exhibits advantages such as effortless deployment and economical costs, and has garnered extensive and profound research attention.</p><p>The existing main research methods for indoor visual localization are classified into structure-based and regression-based localization. After establishing the correlation between the features in the query image and the 3D structural features in the scene model, structure-based localization refers to the application of Perspective-n-Point (PnP) to solve the camera pose by reducing outliers in RANSAC. Match-based localization and scene coordinate regression-based localization are two further categories of structure-based localization techniques. The majority of matching-based localization techniques are transformed into feature descriptor matching jobs, which can then be further separated into direct matching and hierarchical matching based on how far apart the descriptors are represented. The 2D query image feature set and the 3D scene feature points are directly matched using the direct matching approach [<xref ref-type="bibr" rid="scirp.131240-ref5">5</xref>] . By searching the collection of picture features to compare with the scene database image features, hierarchical matching algorithms inadvertently create 2D-3D correspondences. In contrast, scene coordinate regression methods, which have received much attention in recent years, use compact random numbers to directly regress dense scene coordinate maps to directly predict the absolute 3D coordinates of image pixels, which are explicitly improved by using scene structures. The scene structure is generally represented using a 3D point cloud model, which is constructed based on Structure-from-Motion (SFM) or SLAM. However, the scene model is difficult to construct and the geometric alignment between the query image and the 3D model is difficult to solve.</p><p>In regression-based methods, end-to-end direct regression for localization predicts the reference image pose by means of a Convolutional Neural Network (CNN), continuously optimizes the network weights, and outputs the position and orientation information of the image directly to the regressor. Different file types, including single photos, image sequences, and movies, can be used as network inputs. End-to-end direct regression localization needs to be trained for a specific dataset in a multi-scene manner, and needs to be retrained when generalized to a new scene, which is less adaptive and prone to overfitting problems. Relative positional regression methods are based on image retrieval, predicting the relative pose between the query image and the most similar image in the database, and finally obtaining the absolute pose of the query image.</p><p>In the past few years, although the field has reached a fairly mature level, it is still difficult to balance the computational cost, localization accuracy, and robustness. In order to better solve this problem, this paper designs an indoor localization method based on efficient image retrieval and relative position estimation. The proposed localization system consists of two phases, offline and online, the offline phase efficiently extracts and stores all the global features of the offline database images; in the online phase, for the target query, after efficiently returning the query results, the implicit matching between image pairs is utilized to return to the essential matrix to compute the position information. In this paper, we try the current more advanced feature extraction matching model and propose a robust position calculation method based on relative position regression. The scheme is able to achieve high-precision localization on multiple indoor datasets without targeting specific datasets, and the adaptability is much better than other localization systems. The indoor localization system based on efficient image retrieval uses only RGB images and position information, does not rely on 3D models, and uses a server-hosted image database for computational operations. The main contributions are as follows:</p><p>1) Proposing an indoor localization approach based on efficient image retrieval. Rapidly matching query images with database images to obtain a set of similar image pairs for localization calculations.</p><p>2) Vision-based indoor localization algorithms do not require a 3D model and recover the user’s camera position from only a few sets of 2D-2D matches, reducing database processing in the offline phase.</p><p>3) Using a learnable feature detection and matching method to decompose the essence matrix, we propose a new RANSAC mechanism to solve the relative positional ambiguity problem and recover the user’s absolute poses.</p></sec><sec id="s2"><title>2. Related Work</title><sec id="s2_1"><title>2.1. Image Retrieval</title><p>Image Retrieval Techniques Image retrieval tasks aim at querying the content similar to the input image from an image database, as the basis of visual indoor localization, which can effectively improve the efficiency of visual matching. Early research on image retrieval is Text-Based Image Retrieval (TBIR), which mainly includes Page-Rank methods, probabilistic methods, classification or clustering methods, lexical annotation methods, etc. TBIR retrieval is fast and accurate, but it requires a lot of manpower and time, which is not able to satisfy the ever-changing retrieval needs. Content-Based Image Retrieval (CBIR) task extracts image features by mathematically describing the visual content of an image. CBIR early relied on local feature aggregation methods, the most representative of which are visual word representations of images and their extensions, such as Fisher vectors [<xref ref-type="bibr" rid="scirp.131240-ref6">6</xref>] and Vector of Locally Aggregated Descriptors (VLADs) [<xref ref-type="bibr" rid="scirp.131240-ref7">7</xref>] . After 2012, the dominant role of SIFT [<xref ref-type="bibr" rid="scirp.131240-ref8">8</xref>] is gradually replaced by data-driven Deep Neural Networks (DNNs). The representative NetVLAD [<xref ref-type="bibr" rid="scirp.131240-ref9">9</xref>] constructs a global image descriptor for instance-level image retrieval by applying a pooling mechanism on the activation of the last convolutional feature map in a convolutional neural network. Another widely used method, such as MAC [<xref ref-type="bibr" rid="scirp.131240-ref10">10</xref>] , focuses on the region of interest on the feature map and selects just the most active neurons using optimal pooling on each distinct feature map. The retrieval effect of convolutional neural network in deep learning algorithm is the most outstanding, it uses the combination of multiple convolutional layers and pooling layer to get the visual features of the image, and combines with the feedback and classification techniques to achieve better retrieval results. In Literature [<xref ref-type="bibr" rid="scirp.131240-ref11">11</xref>] , SFM information is used to fine-tune a pretrained classification network guided by database images and a pooling layer based on generalized means with learnable parameters is proposed to effectively improve the retrieval performance.</p></sec><sec id="s2_2"><title>2.2. Structure-Based Approach</title><p>Structural feature-based visual localization uses sparse feature matching to obtain 2D-3D correspondences and robust optimization to recover the camera pose. Matching-based localization establishes the connection between the query object and the scene image using feature descriptors, and each 3D point in 3D scene models typically receives one or more local descriptors. Direct matching methods that require searching for query features at each 3D point are very inefficient and show fragile robustness to repeated local features. Coarse-to-fine hierarchical localization is based on image retrieval, which achieves accurate localization of large-scale datasets by searching for the smallest subset of scene models and computing the correspondence between the target query and the smallest subset of scenes. The hierarchical localization process requires accurate extraction of local features of the query for similar scene image matching. Melekhov et al. [<xref ref-type="bibr" rid="scirp.131240-ref12">12</xref>] proposed a DGC-Net localization method, based on the framework of CNN, which exploits the advantages of the optical flow approach from coarse to fine, and achieves dense and subpixel-accurate localization computation in complex environments by extending the optical flow to the case of large transformations, with a strong supervised training in terms of ground-truth labels per pixel. The inherent hierarchical nature of network features is exploited in ASLFeat [<xref ref-type="bibr" rid="scirp.131240-ref13">13</xref>] , which proposes a new multiscale detection mechanism to improve the ability of local shape modeling, to obtain stronger geometric invariance, and to locate the keypoints more accurately.</p><p>The scene coordinate regression approach directly predicts the correspondence between the query image and the 3D scene space, which works well on small datasets but does not scale well to larger, more complex scenes. Literature [<xref ref-type="bibr" rid="scirp.131240-ref14">14</xref>] designed a lightweight visual localization network that uses knowledge distillation to efficiently extract deep local features for accurate localization, however, this approach requires a large number of images and dense point cloud information from Light Detection and Ranging (LiDAR) sensors.</p></sec><sec id="s2_3"><title>2.3. Regression-Based Approach</title><p>The methodology of direct regression-based visual localization involves learning the complete localization pipeline for 2D-3D matching. The PoseNet [<xref ref-type="bibr" rid="scirp.131240-ref15">15</xref>] approach is the first to directly regress camera pose prediction from a single image using a Convolutional Neural Network (CNN). It employs the Structure-from-Motion (SFM) technique to automatically generate training labels, thereby alleviating the burden of manual annotation. ANNet [<xref ref-type="bibr" rid="scirp.131240-ref16">16</xref>] uses discriminator networks and adversarial learning to implicitly learn the joint distribution of images and their corresponding camera poses to further refine the image-based position estimation and further improve the localization accuracy. In Literature [<xref ref-type="bibr" rid="scirp.131240-ref17">17</xref>] , camera pose autoencoder is introduced to improve camera position estimation by using multi-layer perceptron. FeatLoc [<xref ref-type="bibr" rid="scirp.131240-ref18">18</xref>] uses sparse feature descriptors directly to train network models through data augmentation, mitigating the effects of light changes or environmental gradients.</p><p>To boost scalability, relative pose regression is trained on typically numerous unseen scenes. After determining the relative pose of the reference image, absolute bit-position information in the world coordinate system is obtained by spatial coordinate translation. The relative pose regression method utilizes a multi-stage strategy that generalizes well to new scenes. NN-Net [<xref ref-type="bibr" rid="scirp.131240-ref19">19</xref>] pioneered the use of Siamese CNN to predict the relative pose between two input images. Literature [<xref ref-type="bibr" rid="scirp.131240-ref20">20</xref>] proposes a localization method that decouples the scene by regressing the essential matrix without adjusting the parameters. Literature [<xref ref-type="bibr" rid="scirp.131240-ref21">21</xref>] proposes a graphical neural network with image representation nodes and peer-to-peer representation of edge images for relative positional regression.</p><p>In this paper, we periodically combine image retrieval and position calculation to design a visual localization scheme based on efficient image retrieval and generalized camera position solving. For a user query image, a pre-trained retrieval model is first utilized to efficiently return relevant database images, and then the absolute position is solved based on the feature correspondences. The scheme does not use 3D structural model information about the scene and can be easily applied to new indoor scenes.</p></sec></sec><sec id="s3"><title>3. System Models and Methods</title><p>This section presents the overall process framework of indoor localization based on efficient image retrieval.</p><sec id="s3_1"><title>3.1. System Models</title><p>In this paper, a generalized indoor localization model based on a single RGB image is proposed to greatly reduce the image retrieval time with guaranteed retrieval accuracy while obtaining more accurate localization results. The proposed indoor visual localization method consists of two phases: the offline phase of indoor image data acquisition processing and database construction, and the online phase of the user’s image after retrieval of the bit position calculation, as shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p><p>● During the offline phase, all images from indoor scenes are processed using a pre-trained image retrieval model to extract global descriptors, which are then used to construct an offline feature database.</p><p>● During the online phase, the same global feature extraction process is applied to query images. The similarity between the query image’s global feature vector and each feature vector in the offline database is computed. The top five reference images are iteratively selected based on their decreasing similarity scores. For each query-nearest neighbor image pair, the essential matrix E is computed using the five-point algorithm. By solving for the essential matrix E and removing the ambiguity in relative pose, the relative pose between the two images is obtained. Finally, using the retrieved database images with known absolute poses and the relative poses, the absolute pose of the query image, i.e. the indoor user’s pose, is estimated.</p></sec><sec id="s3_2"><title>3.2. Offline Data Preparation</title><p>The offline phase is performed by a camera or other mobile platform for RGB image capture as well as positional recording. Specifically, each RGB image in the database created in the offline phase has its corresponding real pose label: 3D spatial coordinates representing the absolute position (x, y, z) indicating the position and quaternions (w, p, q, r) indicating the absolute orientation. In this paper, quaternions are used to represent the user’s camera orientation, this is because quaternions use only four-dimensional vectors, which perfectly solves</p><p>the singularity problem and requires less storage space compared to the commonly used 3 &#215; 3 rotation matrices to represent the object orientation. The dataset S all contains n different indoor scenes: S = { S 1 , S 2 , ⋯ , S n } . For each scene S<sub>i</sub>, a global representation of that scene is created: the image name, the positional information, and the extracted global descriptors, as in <xref ref-type="table" rid="table1">Table 1</xref>.</p></sec><sec id="s3_3"><title>3.3. Image Retrieval Model</title><p>Regarding the image retrieval module, a novel image feature aggregation method based on MLP is adopted [<xref ref-type="bibr" rid="scirp.131240-ref22">22</xref>] . Through a succession of feature mixers, each individual feature map derived from the chopped Resnet backbone is combined with spatial relationships using this method’s pre-trained neural network. A compact representation space is then used to obtain the projected output, resulting in global image descriptors used for image retrieval. The specific structure of this network model is shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p><p>An RGB image is input and the middle layer feature map F ∈ R c &#215; h &#215; w is first extracted using the pre-trained ResNet model on ImageNet, which differs from the existing technology NetVLAD by treating the tensor F as c 2D features X<sub>i</sub> of size h &#215; w. Then, the feature mapping is built to give each 2D feature a 1D representation: F ∈ R c &#215; n , which is input to a feature blender consisting of L-cascaded MLPs of the same structure, defined as follows:</p><p>X i + W 2 ( σ ( W 1 X i ) ) → X i . (1)</p><p>where W<sub>1</sub> and W<sub>2</sub> are the two fully connected layer weights in the feature blender, and σ refers to the ReLU activation operation. After one Feature-Mixer output Z ∈ R c &#215; n , it continues to be delivered to the second Feature-Mixer block, and finally, two fully connected layers are added for depth projection and line-by-line projection, weighted pooling operations to control the size of the obtained</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> Offline database</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Scene labels</th><th align="center" valign="middle" >S<sub>1</sub></th><th align="center" valign="middle" >...</th><th align="center" valign="middle" >S<sub>1</sub></th></tr></thead><tr><td align="center" valign="middle" >RGB images</td><td align="center" valign="middle" >I 11 , I 12 , ⋯ , I 1 k 1</td><td align="center" valign="middle" >...</td><td align="center" valign="middle" >I n 1 , I n 2 , ⋯ , I n k n</td></tr><tr><td align="center" valign="middle" >Position information</td><td align="center" valign="middle" >P 11 , P 12 , ⋯ , P 1 k 1</td><td align="center" valign="middle" >...</td><td align="center" valign="middle" >P n 1 , P n 2 , ⋯ , P n k n</td></tr><tr><td align="center" valign="middle" >Global descriptors</td><td align="center" valign="middle" >D 11 , D 12 , ⋯ , D 1 k 1</td><td align="center" valign="middle" >...</td><td align="center" valign="middle" >D n 1 , D n 2 , ⋯ , D n k n</td></tr></tbody></table></table-wrap><p>global descriptors. In brief, we obtain 512-dimensional global descriptors for the feature maps obtained from pre-trained ResNet backbone clipping using several MLP Feature-Mixer blocks in a large visual recognition dataset GSV-Cities [<xref ref-type="bibr" rid="scirp.131240-ref23">23</xref>] , retrained using multiple similarity loss.</p><p>Iterative Selection: The commonly observed phenomenon of the top k retrieved images exhibiting highly similar poses can result in suboptimal performance when estimating the camera position for subsequent triangulation. Therefore, we adopt an iterative approach where we iteratively select and sample five retrieved images based on their decreasing similarity scores.</p><p>The dataset images undergo local feature extraction and are globally aggregated into several fixed-length global descriptors. The same method is applied to extract features from the query image, resulting in a feature vector of the same length. The similarity between the query image vector V<sub>q</sub> and the image feature vector V<sub>i</sub> is calculated, and the results are sorted in descending order based on the similarity. A certain number of database images are then output, where i ∈ [ 1 , n ] and n represents the number of images in the database. The specific state is defined as follows:</p><p>V i = [ V i 0 , V i 1 , ⋯ , V i 1024 ] . (2)</p><p>V q = [ V q 0 , V q 1 , ⋯ , V q 1024 ] . (3)</p><p>s c o r e i = V i ∗ V q T . (4)</p></sec><sec id="s3_4"><title>3.4. Feature Detection and Matching</title><p>After image retrieval, relative pose estimation is performed after obtaining a set of iteratively selected query image geotagged image pairs, which consists of four main steps: extraction of keypoints and descriptors, feature-point matching, and false-match rejection; recovery of relative poses between image pairs from the essentiality matrix; and recovery of the absolute camera poses of the query images. Firstly, SuperPoint [<xref ref-type="bibr" rid="scirp.131240-ref24">24</xref>] + LightGlue feature point detection and matching is applied to each query-database image pair. LightGlue is a simple and effective improvement to SuperGlue [<xref ref-type="bibr" rid="scirp.131240-ref25">25</xref>] , which is adaptive and can be flexibly adjusted according to the difficulty of the image pairs, and it is also more efficient and accurate in terms of memory and computation. Note that the user query image and the image in the database may be captured by different cameras, and the performance of the localization could be impacted by the variations in the intrinsic properties of the cameras. Therefore, to get more precise localization findings, the camera needs to be precalibrated.</p></sec><sec id="s3_5"><title>3.5. Position Estimation</title><p>By employing image retrieval, the database image with the highest similarity score to the query image is returned. Based on the epipolar constraint, the rotational and translational relationship between the three-dimensional camera coordinate systems of two images can be computed through feature point matching between the two images. The epipolar constraint reflects the pose relationship between the query camera and the database camera, as shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>. In this context, R represents the relative rotation matrix between the two cameras, and t represents the relative translation vector. O<sub>Q</sub>X<sub>Q</sub>Y<sub>Q</sub>Z<sub>Q</sub> denotes the camera coordinate system of the query image, while O<sub>D</sub>X<sub>D</sub>Y<sub>D</sub>Z<sub>D</sub> represents the camera coordinate system of the database image.</p><sec id="s3_5_1"><title>3.5.1. Relative Pose Estimation</title><p>For a calibrated camera, the geometric relationship between two images, I<sub>q</sub> and I<sub>i</sub>, can be estimated using feature point matching based on the epipolar constraint. E can be used to describe this relationship and the expression for E is as follows:</p><p>E = [ t ] x R . (5)</p><p>The matrices t and R represent the relative translation and relative rotation between the two images, [ ]<sub>x</sub> is an antisym metric matrix, and its calculation formula is as follows:</p><p>[ x 1 x 2 x 3 ] = [ 0 − x 3 x 2 x 3 0 − x 1 − x 2 x 1 0 ] . (6)</p><p>A pair of matched point pairs x<sub>1</sub> and x<sub>2</sub> under the normalized plane, based on the pair of polar geometric constraints is formulated as follows:</p><p>[ u 1 u 2 , u 1 v 2 , u 1 , v 1 u 2 , v 1 v 2 , v 1 , u 2 , v 2 , 1 ] E = 0 . (7)</p><p>Similarly representing other point pairs, the relative poses R, t between the target-database images can be found using Singular Value Decomposition (SVD), and in general, E can be solved for four poses: (R, t), (R, −t), (R', t), (R', −t). This paper proposes a novel RANSAC method to address pose ambiguity, instead of the traditional feature matching-based approaches that find the correct relative pose among four candidates. Specifically, considering the positions of points triangulated from multiple directions as t 1 , t 2 , ⋯ , t n the sign of any angle t<sub>i</sub> can</p><p>be inverted without changing, hence only the rotation needs to be determined. As a result, the absolute pose of the target image can be determined by n ≥ 2 image pairs.</p><p>The transformation matrix ground truth T<sub>12</sub> between the two images is defined as follows:</p><p>T 12 = [ R 1 T R 2 R 1 T ( t 1 − t 2 ) 0 1 ] . (8)</p><p>where the absolute poses of image I<sub>1</sub> are and the absolute poses of image I<sub>2</sub> are, it should be noted that the relative transformation R 1 T R 2 , R 1 T ( t 1 − t 2 ) of I<sub>1</sub> from to I<sub>2</sub> is the transformation in the I<sub>2</sub> camera coordinate system.</p></sec><sec id="s3_5_2"><title>3.5.2. Absolute Pose Estimation of Query Images</title><p>According to triangulation, there are four possible relative rotations between the query image I<sub>q</sub> and its two nearest neighbor database images I<sub>i</sub> and I<sub>j</sub>: R<sub>i</sub>, R<sub>i</sub>', R<sub>j</sub>, R<sub>j</sub>', corresponding to the four absolute poses of I<sub>q</sub>: R<sub>i</sub>R<sub>Ii</sub>, R'<sub>i</sub>R<sub>Ii</sub>, R<sub>j</sub>R<sub>Ij</sub>, R'<sub>j</sub>R<sub>Ij</sub>. R<sub>Ii</sub>, R<sub>Ij</sub> are the ground truth of database images I<sub>i</sub>, I<sub>j</sub>. In theory, among the four absolute poses, two of them are identical, while the rest differ significantly. This means that a hypothesis for an absolute pose is determined based on two nearest neighboring images. In actuality, the relative rotation from each pair that corresponds to the two absolute postures with the smallest angular difference is taken into account as the real one.</p><p>Using the two picture pairings (I<sub>q</sub>, I<sub>i</sub>), (I<sub>q</sub>, I<sub>j</sub>), calculate the query image’s absolute rotation and the bitmap of the query image is calculated from the intersection of the two rays by triangulation. The rays l<sub>1</sub>, l<sub>2</sub> are denoted as:</p><p>l 1 = c I i + λ i R I i T R i t i . (9)</p><p>l 2 = c I j + λ j R I j T R j t j . (10)</p><p>where λ i , λ j ∈ R define the positions of points along the rays. Only when the centers of the three cameras are noncollinear are the results of triangulation specified. c I = − R I T t I is the global coordinate of the camera center. In our experiments, we use the five queried nearest neighbor database images to calculate the final pose.</p><p>In other words, given a pair of images (I<sub>q</sub>, I<sub>i</sub>), (I<sub>q</sub>, I<sub>j</sub>), a pose hypothesis ( R I q , t I q ) is obtained. For any pair of query database images (I<sub>q</sub>, I<sub>m</sub>), four potential solutions need to be found to determine the rotation matrix R<sub>m</sub> that best approximates R m R I m is closest to R I q , and the relative translation from I<sub>q</sub> to I<sub>m</sub> is defined as:</p><p>t p r e = R I m ( c I q − c I m ) . (11)</p><p>α = cos − 1 ( t m T t p r e ‖ t m ‖ 2 ‖ t p r e ‖ 2 ) . (12)</p><p>Equation (12) represents the definition of threshold α, whereby it is considered an inlier when the angle between the reference image and the predicted translation direction is less than α, as depicted in <xref ref-type="fig" rid="fig4">Figure 4</xref>. By counting all the inliers corresponding to the pose hypotheses in all image pairs, the hypothesis with the highest number of inliers is selected as the output.</p></sec><sec id="s3_5_3"><title>3.5.3. Evaluation Metrics</title><p>In vision-based indoor localization tasks, evaluating the performance of the proposed user camera pose estimation method involves comparing the poses computed by the estimation method with the ground truth poses, and measuring the proximity of the estimation results to the ground truth. Specifically, the pose accuracy is measured by the deviation between the estimated pose and the ground truth pose, i.e. the absolute pose error of the query image.</p><p>Absolute attitude error is measured by a combination of absolute position error and orientation error, where the position error is expressed as the Euclidean distance in m between the estimated position of the query image and the recorded true value, as denoted below:</p><p>t a b s _ e r r = ‖ t a b s _ g t − t a b s _ p r e ‖ 2 (13)</p><p>The absolute directional error is expressed in degrees and represents the minimum angle of rotation required to align the directional true value and the calculated direction, as expressed below:</p><p>r o t a b s _ e r r = 2 arccos | q a b s _ g t q a b s _ p r e | 180 ∘ π (14)</p><p>where the quaternion q<sub>abs</sub>_<sub>gt</sub> is the truth value of the recorded query image orientation and the quaternion q<sub>abs_pre</sub> is the computed orientation of the query image: rot<sub>abs_pre</sub> is the error between the predicted absolute orientation and the truth value, and arccosis the inverse cosine computed in the inverse trigonometric function.</p></sec></sec></sec><sec id="s4"><title>4. Experiments</title><p>This section presents an evaluation of an indoor visual localization method based</p><p>on efficient image retrieval and matrix factorization using the essential matrix. The effectiveness and versatility of the RANSAC-based indoor localization approach are demonstrated through measurements of absolute position error in meters and absolute azimuthal error in degrees on two publicly available indoor datasets.</p><sec id="s4_1"><title>4.1. Datasets</title><p>7Scenes [<xref ref-type="bibr" rid="scirp.131240-ref26">26</xref>] was recorded by a handheld Kinect RGB-D camera and contains 7 scenes with a total of 43,000 images. All scenes were shot in an office building, and each scene usually consists of a room with a spatial extent of less than 4 meters, which contains many blurred and untextured features that are very challenging.</p><p>12Scenes [<xref ref-type="bibr" rid="scirp.131240-ref27">27</xref>] is a dataset of four large scenes (12 rooms) captured using the Structure.io depth sensor and iPad color camera, pushing the boundaries of RGB-D and RGB camera repositioning, and recording a significantly larger environment than the 7Scenes dataset. A total of 22,628 images were recorded after removing the 233 with anomalous bit-pose labels.</p><p>The images of the two datasets are shown in <xref ref-type="fig" rid="fig5">Figure 5</xref> and <xref ref-type="fig" rid="fig6">Figure 6</xref>.</p></sec><sec id="s4_2"><title>4.2. Image Retrieval Performance</title><p>Unlike the classical image global aggregation methods, in this paper, a feature map of size 256 &#215; 20 &#215; 20 obtained by clipping the ResNet intermediate layer (the second ResNet residual block) pre-trained in ImageNet is re-trained on a large visual recognition dataset using the global spatial feature relations using four MLP feature mixing blocks with multi-similarity loss.</p><p>The network, with an initial learning rate of 0.05, momentum of 0.9, and weight decay of 0.001, was optimized using stochastic Stochastic Gradient Descent (SGD) and was trained for a total of 80 periods. The image retrieval model extracts the feature mapping from the middle layer, reducing the number of parameters by at least half (the last layer contains the majority of the pre-training backbone’s parameters).</p><p>The image retrieval model trained in this paper can be better adapted to large scene datasets with significant variations for use as a large-scale visual scene recognition task, as tested using the Pitts250k-test database [<xref ref-type="bibr" rid="scirp.131240-ref28">28</xref>] and the MSLS database [<xref ref-type="bibr" rid="scirp.131240-ref29">29</xref>] with a wide range of illumination viewpoint variations. <xref ref-type="table" rid="table2">Table 2</xref> reports the performance of the proposed MLP-based image retrieval method for recall@k, and it can be seen that the proposed method is used in Pitts250k-test recall@1 up to 93.2% is significantly improved compared to both the widely used Generalized Mean (GeM) [<xref ref-type="bibr" rid="scirp.131240-ref11">11</xref>] method. The paper [<xref ref-type="bibr" rid="scirp.131240-ref30">30</xref>] converts the training into a classification problem, avoiding the expensive mining required by the commonly used comparison learning and achieving better results.</p><p>It is worth noting that in the proposed visual localization based image retrieval process, since the feature extraction and matching cycles can seriously affect the results of pose estimation, the image pairs should share as many feature points as possible, which makes it essential for the query image and the retrieved image to share some common regions, i.e. the purpose of retrieval is to spend a shorter time to find, with guaranteed retrieval results, the global level that is most similar, rather than local information. At the same time, for the query, it is not the most similar 5 database images that are returned, but only the 5 database images with enough visual overlap need to be satisfied. The MLP-based image retrieval model largely shortens the offline database construction time as well as the online retrieval time to satisfy the query requirements of the localization process. Therefore, the number of matching features between images is calculated</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> Comparison of several methodologies in well-known benchmarks trained with ResNet-50 on the same dataset. The proposed MLP based image retrieval method gets the best performance</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Methods</th><th align="center" valign="middle"  colspan="3"  >Pitts250k-test</th><th align="center" valign="middle"  colspan="3"  >MSLS</th></tr></thead><tr><td align="center" valign="middle" >R@1</td><td align="center" valign="middle" >R@5</td><td align="center" valign="middle" >R@10</td><td align="center" valign="middle" >R@1</td><td align="center" valign="middle" >R@5</td><td align="center" valign="middle" >R@10</td></tr><tr><td align="center" valign="middle" >GeM [<xref ref-type="bibr" rid="scirp.131240-ref11">11</xref>]</td><td align="center" valign="middle" >82.9</td><td align="center" valign="middle" >92.1</td><td align="center" valign="middle" >94.3</td><td align="center" valign="middle" >76.5</td><td align="center" valign="middle" >85.7</td><td align="center" valign="middle" >88.2</td></tr><tr><td align="center" valign="middle" >CosePlace [<xref ref-type="bibr" rid="scirp.131240-ref30">30</xref>]</td><td align="center" valign="middle" >91.5</td><td align="center" valign="middle" >96.9</td><td align="center" valign="middle" >97.9</td><td align="center" valign="middle" >84.5</td><td align="center" valign="middle" >90.1</td><td align="center" valign="middle" >91.8</td></tr><tr><td align="center" valign="middle" >Ours</td><td align="center" valign="middle" >93.2</td><td align="center" valign="middle" >97.9</td><td align="center" valign="middle" >98.6</td><td align="center" valign="middle" >84.1</td><td align="center" valign="middle" >91.8</td><td align="center" valign="middle" >94.3</td></tr></tbody></table></table-wrap><p>to evaluate the results of image retrieval. The number of satisfactory matches with a confidence level greater than 0.2 is returned by SuperPoint feature extraction and LightGlue matching, as shown in <xref ref-type="table" rid="table3">Table 3</xref>.</p><p><xref ref-type="fig" rid="fig7">Figure 7</xref> shows the visualization of Superpoint feature ex traction with LightGlue matching on 7Scenes and 12Scenes. Then the appropriate threshold value is chosen by grid search selection: distinguish between inner and outer points, remove the outer points, update the set of inner points and then compute the essence matrix of the image pairs using the 5-point method in RANSAC for the obtained well-matched point pairs, and then remove the positional ambiguity and obtain the relative position to recover the absolute position according to the proposed RANSAC mechanism.</p><p>The computation of the MLP feature mixer based image retrieval method is mainly a matrix multiplication of the fully connected layers, which accelerates the computation and reduces the memory usage as compared to the complex self-attention, and it takes only 7 milliseconds to generate a global description of an image.</p><p>During the online period, after the query-database image pairs are obtained by image retrieval, a suitable threshold value is selected by grid search selection: inner and outer points are distinguished, outer points are removed, and the set of inner points is updated to the obtained good matching point pairs to compute the essence matrix of the image pairs by using the 5-point method in RANSAC, and then the relative pose is obtained to recover the absolute pose by removing the pose ambiguity according to the proposed RANSAC mechanism.</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> Average number of good matches in two indoor datasets</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >NN-search</th><th align="center" valign="middle" >1</th><th align="center" valign="middle" >2</th><th align="center" valign="middle" >3</th><th align="center" valign="middle" >4</th><th align="center" valign="middle" >5</th><th align="center" valign="middle" >Average</th></tr></thead><tr><td align="center" valign="middle" >7Scenes</td><td align="center" valign="middle" >508.6</td><td align="center" valign="middle" >450.7</td><td align="center" valign="middle" >424.2</td><td align="center" valign="middle" >387.1</td><td align="center" valign="middle" >332.6</td><td align="center" valign="middle" >420.6</td></tr><tr><td align="center" valign="middle" >12Scenes</td><td align="center" valign="middle" >488.3</td><td align="center" valign="middle" >412.6</td><td align="center" valign="middle" >387.1</td><td align="center" valign="middle" >340.5</td><td align="center" valign="middle" >297.3</td><td align="center" valign="middle" >371.4</td></tr></tbody></table></table-wrap></sec><sec id="s4_3"><title>4.3. Localization Accuracy</title><p>We compared the proposed model with many recent camera repositioning methods. These methods are classified into two major categories: 1) Absolute Pose Regression (APR) localization methods, and 2) Relative Pose Regression (RPR) localization methods. The proposed methods belong to the 2nd category. <xref ref-type="table" rid="table4">Table 4</xref> showcases the localization performance of the proposed methodology on the 7Scenes test dataset.</p><p>Experimental results show that our method reduces both positional and angular errors. The localization method proposed in this paper method basically minimizes the position and orientation errors in each indoor scene of the 7Scenes dataset, with an average position error of 0.09 m and an average orientation error of 2.14˚ on this dataset. However, due to the large difference in the environment of each indoor scene, the difference in localization error also shows a large difference, with the best performance in Heads, with a position error of only 3 cm and an orientation error of 2.16˚, while for the Stairs scene the localization performs poorly, with an error as high as 0.21 m and 3.47˚, which we believe is most likely due to the excessive repetitive structures in the staircase images, and the distinguishability of the extracted feature points poorly. We will investigate this in future work.</p><p>12Scenes is another indoor scene dataset, and the recorded indoor environments are significantly larger than 7Scenes. Since there are fewer studies related to the 12Scenes dataset, this paper spends a lot of time reproducing the classical relative bit-pose regression networks, NN-Net [<xref ref-type="bibr" rid="scirp.131240-ref19">19</xref>] as well as NC-EssNet [<xref ref-type="bibr" rid="scirp.131240-ref20">20</xref>] , in strict accordance with the criteria of the paper, as shown in <xref ref-type="table" rid="table5">Table 5</xref>.</p><p>As shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>, the indoor images recorded by the 12Scenes dataset have poor lighting conditions and more blurred images, which pose a challenge to the localization task. The experimental results of the 12Scenes dataset in <xref ref-type="table" rid="table5">Table 5</xref> are obviously worse than the localization performance of 7Scenes, but the results show that the proposed indoor localization method still obtains a significant improvement compared to other methods, with an average position error of</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> Median position and rotation errors for different relocation methods on the 7Scenes dataset</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Scene</th><th align="center" valign="middle" >NN-Net [<xref ref-type="bibr" rid="scirp.131240-ref19">19</xref>]</th><th align="center" valign="middle" >NC-Esset [<xref ref-type="bibr" rid="scirp.131240-ref20">20</xref>]</th><th align="center" valign="middle" >GRNet [<xref ref-type="bibr" rid="scirp.131240-ref21">21</xref>]</th><th align="center" valign="middle" >FeatLoc [<xref ref-type="bibr" rid="scirp.131240-ref18">18</xref>]</th><th align="center" valign="middle" >Ours</th></tr></thead><tr><td align="center" valign="middle" >Chess</td><td align="center" valign="middle" >0.13 m, 6.5˚</td><td align="center" valign="middle" >0.12 m, 5.6˚</td><td align="center" valign="middle" >0.08 m, 12.4˚</td><td align="center" valign="middle" >0.07 m, 3.66˚</td><td align="center" valign="middle" >0.05 m, 1.51˚</td></tr><tr><td align="center" valign="middle" >Fire</td><td align="center" valign="middle" >0.26,12.7˚</td><td align="center" valign="middle" >0.26,12.7˚</td><td align="center" valign="middle" >0.21 m, 7.5˚</td><td align="center" valign="middle" >0.17 m, 5.95˚</td><td align="center" valign="middle" >0.07 m, 1.96˚</td></tr><tr><td align="center" valign="middle" >Heads</td><td align="center" valign="middle" >0.14 m, 12.3˚</td><td align="center" valign="middle" >0.14 m, 10.7˚</td><td align="center" valign="middle" >0.13 m, 8.7˚</td><td align="center" valign="middle" >0.10 m, 7.57˚</td><td align="center" valign="middle" >0.03 m, 2.16˚</td></tr><tr><td align="center" valign="middle" >Office</td><td align="center" valign="middle" >0.21 m, 7.4˚</td><td align="center" valign="middle" >0.20 m, 6.7˚</td><td align="center" valign="middle" >0.15 m, 4.1˚</td><td align="center" valign="middle" >0.16 m, 5.20˚</td><td align="center" valign="middle" >0.07 m, 1.64˚</td></tr><tr><td align="center" valign="middle" >Pumpkin</td><td align="center" valign="middle" >0.24 m, 6.4˚</td><td align="center" valign="middle" >0.22 m, 5.7˚</td><td align="center" valign="middle" >0.15 m, 3.5˚</td><td align="center" valign="middle" >0.11 m, 3.86˚</td><td align="center" valign="middle" >0.10 m, 2.06˚</td></tr><tr><td align="center" valign="middle" >RedKitchen</td><td align="center" valign="middle" >0.24 m, 8.0˚</td><td align="center" valign="middle" >0.22 m, 6.3˚</td><td align="center" valign="middle" >0.19 m, 3.7˚</td><td align="center" valign="middle" >0.20 m, 6.43˚</td><td align="center" valign="middle" >0.08 m, 2.16˚</td></tr><tr><td align="center" valign="middle" >Stairs</td><td align="center" valign="middle" >0.27 m, 11.8˚</td><td align="center" valign="middle" >0.31 m, 7.9˚</td><td align="center" valign="middle" >0.22 m, 6.5˚</td><td align="center" valign="middle" >0.16 m, 8.57˚</td><td align="center" valign="middle" >0.21 m, 3.47˚</td></tr><tr><td align="center" valign="middle" >Average</td><td align="center" valign="middle" >0.21 m, 9.3˚</td><td align="center" valign="middle" >0.21 m, 7.5˚</td><td align="center" valign="middle" >0.16 m, 5.2˚</td><td align="center" valign="middle" >0.14 m, 5.89˚</td><td align="center" valign="middle" >0.09 m, 2.14˚</td></tr></tbody></table></table-wrap><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Median position and rotation errors for different relocation methods on the 12Scenes dataset</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >Scene</th><th align="center" valign="middle" >Volume</th><th align="center" valign="middle" >NN-Net [<xref ref-type="bibr" rid="scirp.131240-ref19">19</xref>]</th><th align="center" valign="middle" >NC-EssNet [<xref ref-type="bibr" rid="scirp.131240-ref20">20</xref>]</th><th align="center" valign="middle" >FeatLoc [<xref ref-type="bibr" rid="scirp.131240-ref18">18</xref>]</th><th align="center" valign="middle" >Ours</th></tr></thead><tr><td align="center" valign="middle" >apt1_kitchen</td><td align="center" valign="middle" >33 m<sup>3</sup></td><td align="center" valign="middle" >0.22 m, 6.76˚</td><td align="center" valign="middle" >0.15 m, 11.53˚</td><td align="center" valign="middle" >0.32 m, 5.19˚</td><td align="center" valign="middle" >0.14 m, 7.58˚</td></tr><tr><td align="center" valign="middle" >apt1_living</td><td align="center" valign="middle" >30 m<sup>3</sup></td><td align="center" valign="middle" >0.25 m, 5.45˚</td><td align="center" valign="middle" >0.19 m, 6.57˚</td><td align="center" valign="middle" >0.26 m, 0.14˚</td><td align="center" valign="middle" >0.18 m, 5.89˚</td></tr><tr><td align="center" valign="middle" >apt2_bed</td><td align="center" valign="middle" >14 m<sup>3</sup></td><td align="center" valign="middle" >0.46 m, 6.13˚</td><td align="center" valign="middle" >0.21 m, 6.70˚</td><td align="center" valign="middle" >0.37 m, 5.39˚</td><td align="center" valign="middle" >0.08 m, 4.32˚</td></tr><tr><td align="center" valign="middle" >apt2_kitchen</td><td align="center" valign="middle" >21 m<sup>3</sup></td><td align="center" valign="middle" >0.83 m, 36.03˚</td><td align="center" valign="middle" >0.18 m, 9.39˚</td><td align="center" valign="middle" >0.73 m, 6.37˚</td><td align="center" valign="middle" >0.09 m, 8.61˚</td></tr><tr><td align="center" valign="middle" >apt2_living</td><td align="center" valign="middle" >42 m<sup>3</sup></td><td align="center" valign="middle" >0.23 m, 5.24˚</td><td align="center" valign="middle" >0.17 m, 8.18˚</td><td align="center" valign="middle" >0.40 m, 5.71˚</td><td align="center" valign="middle" >0.13 m, 5.73˚</td></tr><tr><td align="center" valign="middle" >apt2_luke</td><td align="center" valign="middle" >53 m<sup>3</sup></td><td align="center" valign="middle" >0.54 m, 6.26˚</td><td align="center" valign="middle" >0.23 m, 8.03˚</td><td align="center" valign="middle" >0.33 m, 4.85˚</td><td align="center" valign="middle" >0.17 m, 7.83˚</td></tr><tr><td align="center" valign="middle" >office1_gates362</td><td align="center" valign="middle" >29 m<sup>3</sup></td><td align="center" valign="middle" >0.27 m, 5.27˚</td><td align="center" valign="middle" >0.16 m, 5.47˚</td><td align="center" valign="middle" >0.52 m, 5.22˚</td><td align="center" valign="middle" >0.14 m, 4.73˚</td></tr><tr><td align="center" valign="middle" >office1_gates381</td><td align="center" valign="middle" >44 m<sup>3</sup></td><td align="center" valign="middle" >0.44 m, 7.27˚</td><td align="center" valign="middle" >0.28 m, 12.00˚</td><td align="center" valign="middle" >0.42 m, 6.23˚</td><td align="center" valign="middle" >0.24 m, 7.93˚</td></tr><tr><td align="center" valign="middle" >office1_lounge</td><td align="center" valign="middle" >38 m<sup>3</sup></td><td align="center" valign="middle" >0.53 m, 5.72˚</td><td align="center" valign="middle" >0.31 m, 7.01˚</td><td align="center" valign="middle" >0.39 m, 4.50˚</td><td align="center" valign="middle" >0.26 m, 6.26˚</td></tr><tr><td align="center" valign="middle" >office1_manolis</td><td align="center" valign="middle" >50 m<sup>3</sup></td><td align="center" valign="middle" >0.27 m, 5.66˚</td><td align="center" valign="middle" >0.19 m, 6.81˚</td><td align="center" valign="middle" >0.30 m, 4.67˚</td><td align="center" valign="middle" >0.15 m, 6.82˚</td></tr><tr><td align="center" valign="middle" >office2_5a</td><td align="center" valign="middle" >38 m<sup>3</sup></td><td align="center" valign="middle" >0.29 m, 5.50˚</td><td align="center" valign="middle" >0.20 m, 5.09˚</td><td align="center" valign="middle" >0.31 m, 4.32˚</td><td align="center" valign="middle" >0.11 m, 5.03˚</td></tr><tr><td align="center" valign="middle" >office2_5b</td><td align="center" valign="middle" >79 m<sup>3</sup></td><td align="center" valign="middle" >0.29 m, 5.07˚</td><td align="center" valign="middle" >0.21 m, 5.78˚</td><td align="center" valign="middle" >0.23 m, 4.14˚</td><td align="center" valign="middle" >0.15 m, 5.74˚</td></tr><tr><td align="center" valign="middle" >Average</td><td align="center" valign="middle" >39 m<sup>3</sup></td><td align="center" valign="middle" >0.35 m, 8.28˚</td><td align="center" valign="middle" >0.21 m, 8.04˚</td><td align="center" valign="middle" >0.38 m, 5.04˚</td><td align="center" valign="middle" >0.15 m, 6.37˚</td></tr></tbody></table></table-wrap><p>0.15 m and an average orientation error of 6.37˚. Compared with the other visual localization methods mentioned above, the best results are achieved in terms of position error, but the performance is slightly worse in solving the camera orientation.</p><p><xref ref-type="fig" rid="fig8">Figure 8</xref> and <xref ref-type="fig" rid="fig9">Figure 9</xref> show the performance of the proposed indoor localization</p><p>method more visually. Among them, 82.47% of the query images in the 7Scenes dataset have a localization error of less than 0.5 meters and 80% of the query images have an orientation error of less than 4 degrees, while more than 90% of the query images in the 12Scenes dataset have a localization accuracy of less than meters and a maximum orientation error of less than 12 degrees. Compared with the 7Scenes localization results, the 12Secens database has poorer localization results due to the fact that the indoor environment recorded in 12Scenes is significantly larger than that in the 7Scenes dataset, and also the images in the dataset have poorer lighting conditions and more blurred images, which poses a challenge to the localization task.</p></sec></sec><sec id="s5"><title>5. Conclusions</title><p>In this paper, an indoor visual localization method based on efficient image retrieval and relative position calculation is proposed. A novel image retrieval method based on CNN cropping and MLP aggregation is used to generate compact global descriptions by learning global spatial relations iteratively for the feature mapping of the pre-trained network, while the computational process of the retrieval method based on MLP aggregation is highly efficient due to the fact that, unlike the self-attention mechanism where the complexity scales into a quadratic scale, the fully-connected layer is mainly a matrix multiplication operation. The offline phase takes only 7 ms to generate a global description of an image.</p><p>The online localization phase performs efficient image retrieval by pre-training the retrieval model to obtain matching images with pose labels to construct query-database image pairs. A set of CNN features with original images and poses can represent the whole region. Then, a feature-point correspondence strategy is applied to solve the relative pose ambiguity problem by a novel RANSAC mechanism to estimate the exact location and orientation of the query image. Experimental results conducted on two publicly available indoor localization datasets show that our monocular vision-based indoor pose estimation method produces highly accurate localization results. The proposed indoor method is suitable for scenes lacking depth information and has excellent cross-scene generalization capabilities without the need for complex preprocessing in the offline phase and without relying on the 3D scene structural model. In this paper, we argue that image data with poor lighting conditions and blurred images can have a large negative impact on the localization results, which serves as a reminder for our future work.</p></sec><sec id="s6"><title>Acknowledgements</title><p>This paper was supported by the National Natural Science Foundation of China under the Youth Foundation Program (62001272).</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s8"><title>Cite this paper</title><p>Lyu, M.Y., Guo, X.X., Zhang, K.P. and Zhang, L.Y. (2024) A Visual Indoor Localization Method Based on Efficient Image Retrieval. Journal of Computer and Communications, 12, 47-66. https://doi.org/10.4236/jcc.2024.122004</p></sec></body><back><ref-list><title>References</title><ref id="scirp.131240-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Lim, C.H., Wan, Y., Ng, B.P. and See, C.M.S. (2007) A Real-Time Indoor Wifi Localization System Utilizing Smart Antennas. IEEE Transactions on Consumer Electronics, 53, 618-622. https://doi.org/10.1109/TCE.2007.381737</mixed-citation></ref><ref id="scirp.131240-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Hazas, M. and Hopper, A. (2006) Broadband Ultrasonic Location Systems for Improved Indoor Positioning. IEEE Transactions on Mobile Computing, 5, 536-547. https://doi.org/10.1109/TMC.2006.57</mixed-citation></ref><ref id="scirp.131240-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">De Angelis, A., Dwivedi, S. and H&amp;auml;ndel, P. (2013) Characterization of a Flexible Uwb Sensor for Indoor Localization. IEEE Transactions on Instrumentation and Measurement, 62, 905-913. https://doi.org/10.1109/TIM.2013.2243501</mixed-citation></ref><ref id="scirp.131240-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Subbu, K.P., Gozick, B. and Dantu, R. (2013) LocateMe: Magnetic-Fields-Based Indoor Localization Using Smartphones. ACM Transactions on Intelligent Systems and Technology, 4, 1-27. https://doi.org/10.1145/2508037.2508054</mixed-citation></ref><ref id="scirp.131240-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Liu, L., Li, H. and Dai, Y. (2017) Efficient Global 2D-3D Matching for Camera Localization in a Large-Scale 3D Map. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 2391-2400. https://doi.org/10.1109/ICCV.2017.260</mixed-citation></ref><ref id="scirp.131240-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Perronnin, F. and Dance, C. (2007) Fisher Kernels on Visual Vocabularies for Image Categorization. 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, 17-22 June 2007, 1-8. https://doi.org/10.1109/CVPR.2007.383266</mixed-citation></ref><ref id="scirp.131240-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Jégou, H., Douze, M., Schmid, C. and Pérez, P. (2010) Aggregating Local Descriptors into a Compact Image Representation. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 3304-3311. https://doi.org/10.1109/CVPR.2010.5540039</mixed-citation></ref><ref id="scirp.131240-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Lowe, D.G. (2004) Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60, 91-110. https://doi.org/10.1023/B:VISI.0000029664.99615.94</mixed-citation></ref><ref id="scirp.131240-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. and Sivic, J. (2017) NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 1437-1451. https://doi.org/10.1109/TPAMI.2017.2711011</mixed-citation></ref><ref id="scirp.131240-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Babenko, A. and Lempitsky, V. (2015) Aggregating Local Deep Features for Image Retrieval. Proceedings of the IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 1269-1277.</mixed-citation></ref><ref id="scirp.131240-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Radenovi&amp;cacute;, F., Tolias, G. and Chum, O. (2018) Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 1655-1668. https://doi.org/10.1109/TPAMI.2018.2846566</mixed-citation></ref><ref id="scirp.131240-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Melekhov, I., Tiulpin, A., Sattler, T., et al. (2019) DGC-NET: Dense Geometric Correspondence Network. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 7-11 January 2019, 1034-1042. https://doi.org/10.1109/WACV.2019.00115</mixed-citation></ref><ref id="scirp.131240-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Luo, Z., Zhou, L., Bai, X., et al. (2020) ASLFeat: Learning Local Features of Accurate Shape and Localization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 6588-6597. https://doi.org/10.1109/CVPR42600.2020.00662</mixed-citation></ref><ref id="scirp.131240-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Shi, C., Li, J., Gong, J., Yang, B. and Zhang, G. (2022) An Improved Lightweight Deep Neural Network with Knowledge Distillation for Local Feature Extraction and Visual Localization Using Images and LiDAR Point Clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 184, 177-188. https://doi.org/10.1016/j.isprsjprs.2021.12.011</mixed-citation></ref><ref id="scirp.131240-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Kendall, A., Grimes, M. and Cipolla, R. (2015) PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2938-2946. https://doi.org/10.1109/ICCV.2015.336</mixed-citation></ref><ref id="scirp.131240-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Bui, M., Baur, C., Navab, N., Ilic, S. and Albarqouni, S. (2019) Adversarial Networks for Camera Pose Regression and Refinement. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, 27-28 October 2019, 3778-3787. https://doi.org/10.1109/ICCVW.2019.00470</mixed-citation></ref><ref id="scirp.131240-ref17"><label>17</label><mixed-citation publication-type="book" xlink:type="simple">Shavit, Y. and Keller, Y. (2022) Camera Pose Auto-Encoders for Improving Pose Regression. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022, Springer, Cham, 140-157. https://doi.org/10.1007/978-3-031-20080-9_9</mixed-citation></ref><ref id="scirp.131240-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Bach, T.B., Dinh, T.T. and Lee, J.H. (2022) FeatLoc: Absolute Pose Regressor for Indoor 2D Sparse Features with Simplistic View Synthesizing. ISPRS Journal of Photogrammetry and Remote Sensing, 189, 50-62. https://doi.org/10.1016/j.isprsjprs.2022.04.021</mixed-citation></ref><ref id="scirp.131240-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Laskar, Z., Melekhov, I., Kalia, S. and Kannala, J. (2017) Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, 22-29 October 2017, 920-929. https://doi.org/10.1109/ICCVW.2017.113</mixed-citation></ref><ref id="scirp.131240-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Zhou, Q., Sattler, T., Pollefeys, M. and Leal-Taixe, L. (2020) To Learn or Not to Learn: Visual Localization from Essential Matrices. 2020 IEEE International Conference on Robotics and Automation, Paris, 31 May-31 August 2020, 3319-3326. https://doi.org/10.1109/ICRA40945.2020.9196607</mixed-citation></ref><ref id="scirp.131240-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Turkoglu, M.O., Brachmann, E., Schindler, K., Brostow, G.J. and Monszpart, A. (2021) Visual Camera Re-Localization Using Graph Neural Networks and Relative Pose Supervision. 2021 International Conference on 3D Vision (3DV), London, 1-3 December 2021, 145-155. https://doi.org/10.1109/3DV53792.2021.00025</mixed-citation></ref><ref id="scirp.131240-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Ali-Bey, A., Chaib-Draa, B. and Giguere, P. (2023) Mixvpr: Feature Mixing for Visual Place Recognition. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, 2-7 January 2023, 2997-3006. https://doi.org/10.1109/WACV56688.2023.00301</mixed-citation></ref><ref id="scirp.131240-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Alibey, A., Chaibdraa, B. and Giguère, P. (2022) GSV-Cities: Toward Appropriate Supervised Visual Place Recognition. Neurocomputing, 513, 194-203. https://doi.org/10.1016/j.neucom.2022.09.127</mixed-citation></ref><ref id="scirp.131240-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">DeTone, D., Malisiewicz, T. and Rabinovich, A. (2018) Superpoint: Self-Supervised Interest Point Detection and Description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, 18-22 June 2018, 224-236. https://doi.org/10.1109/CVPRW.2018.00060</mixed-citation></ref><ref id="scirp.131240-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Sarlin, P.E., DeTone, D., Malisiewicz, T. and Rabinovich, A. (2020) Superglue: Learning Feature Matching with Graph Neural Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 4937-4946. https://doi.org/10.1109/CVPR42600.2020.00499</mixed-citation></ref><ref id="scirp.131240-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Shotton, J., Glocker, B., Zach, C., et al. (2013) Scene Coordinate Regression Forests for Camera Relocalization in RGB-D images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 23-28 June 2013, 2930-2937. https://doi.org/10.1109/CVPR.2013.377</mixed-citation></ref><ref id="scirp.131240-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Valentin, J., Dai, A., Nie&amp;szlig;ner, M., et al. (2016) Learning to Navigate the Energy Landscape. 2016 Fourth International Conference on 3D Vision (3DV), Stanford, 25-28 October 2016, 323-332. https://doi.org/10.1109/3DV.2016.41</mixed-citation></ref><ref id="scirp.131240-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Torii, A., Sivic, J., Pajdla, T. and Okutomi, M. (2013) Visual Place Recognition with Repetitive Structures. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, 23-28 June 2013, 883-890. https://doi.org/10.1109/CVPR.2013.119</mixed-citation></ref><ref id="scirp.131240-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Warburg, F., Hauberg, S., Lopez-Antequera, M., et al. (2020) Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 2623-2632. https://doi.org/10.1109/CVPR42600.2020.00270</mixed-citation></ref><ref id="scirp.131240-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Berton, G., Masone, C. and Caputo, B. (2022) Rethinking Visual Geo-Localization for Large-Scale Applications. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 4868-4878. https://doi.org/10.1109/CVPR52688.2022.00483</mixed-citation></ref></ref-list></back></article>