top of page

Simultaneous Tracking, Tagging and Mapping for Augmented Reality

Publication
keywords

Augmented reality; Computer vision; Object detection; Deep learning;

We present a method of simultaneous tracking, tagging and mapping (STTM) for the augmented reality (AR) by feeding off the deep-SORT-based object taggingand lightweight unsupervised deep loop closure.

1. traditional SLAM only have tracking, mapping and loop closure detection procedures, we add tagging with neural network.

2.  lightweight CNN-based loop closure is more robust and suitable for wearable AR devices 

Contribution

fig: traditional SLAM framework

The picture above outlines the pipeline of the proposed STTM. Our method starts with measurement preprocessing. All necessary values for bootstrapping the subsequent nonlinear optimization-based visual-inertial odometry (VIO) are obtained by initialization. The VIO module closely combines the position and pose data of the inertial measurement unit (IMU) with the re-tracked feature from the closed loop detection to complete the re-location. Finally, the pose graph module performs the global pose graph optimization to eliminate cumulative error and to enable the map reuse.

The pipeline is devided into three major parts tracking ,tagging and mapping.  We use lightweight deep loopclosure  to implement object tracking and loop closure. This lightweight and real time algorithm makes it fits the wearabledevices.  The object tagging is implemented by deep SORT. [3] Finally, the pose graph module performs the mappingand global pose graph optimization which eliminates cumulative error and enable map reuse.

Simultaneous Tracking, Tagging and Mapping

for Augmented Reality

 

Background : SLAM

In computational geometry and robotics, simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. 

Applications of SLAM include robots, UAVs, autonomous vehicles and augmented reality. Typical examples include:

  • Automated car piloting on off-road terrain

  • Search and rescue in high-risk or difficult-to-navigate environments

  • Augmented reality (AR) applications where virtual objects are involved in real-world scenes

  • Visual surveillance systems

  • Medicine for minimally invasive surgery (MIS)

  • Construction site build monitoring and maintenance

Application
Lightweight Unsupervised Deep Loop Closure

So we use a novel unsupervised deep neural network architecture of a feature embedding for visual loop closure that is both reliable and compact. It is designed to map high-dimensional raw images into a low-dimensional descriptor space. We choose an unsupervised, convolutional autoencoder network, which is designed for loop closure. 

SOCIAL NETWORKS

REVIEWS

Let's use an unsupervised deep neural network 

  • Unsupervised

  • Fewer parameters and more lightweight

  • Good tolerance of the scene

blue
Real trajectories
Red
The computed trajectories
Green
The trajectory  by loop detection

​Requirement : 

Real-time and lightweight

 

Popular algorithms :

Bag of words

(Redundant information )

Loop Closure

Our first problem is how to keep the AR application real-time and lightweight. Loop closure detection is an important part in SLAM framework, which is an act of correctly asserting that a device has returned to a previously visited location.

As described in the picture on the right,loop closure detection eliminates the cumulative error between the real trajectory and the predicted trajectory

We found that in the traditional SLAM framework, the popular loop detection algorithms cannot achieve high speed, high accuracy, and taking up little space at the same time. For example, the bag-of-words algorithm introduces a lot of redundant information, resulting in a waste of space. 

​Too much redundant information
brought by Bag of Words algorithm

​Tagging

Our second innovation supplementing and improving existing SLAM by adding deep-SORT-based object tracking. We wanted to add semantic information on top of that, so we added object tagging to the framework. There are a lot of object tracking algorithms like YoloV4, optical flow, but they are either slow, take up a lot of space, or handle the occlusion problem poorly.

Take YOLO V4 as an example. Though it  has great optimization of YOLO algorithm from data processing, network training, loss function and other aspects.We find it hard to track small objects like birds.

Optical flow is another popular object tracking algorithm, you don’t need to know the scene and the optical flow carriesrich information like speed etc. But there are two basic assumptions difficult to satisfy. The first one is Constant brightness and the second one is continuous time or small range of motion.

​Deep SORT

So that's why we use Deep SORT. is based on the Kalman filtering and frame-by-frame data association with Hungarian method, performs smoothly at high frame rates. Meanwhile, with a convolutional neural network (CNN), it can improve the robustness against misses and occlusions and it is also able to recognize the pre-trained objects, including the table, chair, window, wall, ceiling, floor etc. Compared with other object tracking algorithms, it exhibits superior performance in terms of the accuracy and reliability. Specifically, it has relatively low identity switches (781) and mostly lost rate (8.2%) on the MOT16 benchmark.

​accuracy and reliability are high according to the benchmark 

Development of target detection technology

​Tracking

In the initialization stage of our experiment, a virtual box generated by the extracted feature point information is inserted into the coordinate, as shown in the left image above. Then, an estimated trajectory with a closed loop is recorded for testing the STTM, as shown in right image above, by which the x/y/z coordinates of the camera, total distance of travel, and number of features can be tracked in real time.

A virtual box is inserted into the coordinate during the initialization stage

Indoor experimental result for proposed STTM. Total trajectory length is 27.23 m. A closed loop is recorded for testing the STTM.

​Mapping

LiDAR scanner uses ToF (time-of-flight) techniques to acquire distance between the object and the camera. It is used to generate the 3D point cloud with the depth information. Those points are organized using a certain data structure. Using the class provided in ARKit, we can obtain a point’s ID, spatial information, etc.  Also, we can set the confidence and threshold to remove some of the noise and incomplete data.

After that, we use polygonal algorithm to generate meshes based on the point cloud. We choose Delaunay triangulation algorithm. The Delaunay triangulation has certain properties:

1. it runs in O(N log⁡N) time;

2. The Delaunay triangulation maximizes the minimum angle

​Tagging + Mapping

Finally, we add the function of Tagging to our Mapping process. When we click the objects on the screen, STTM is able to tell what it is. We used some pre-trained objects in ARKit such as Ceiling, Door, Seat, Table, etc. Those frequently appear in indoor environment.

Conclusion

1. As opposed to the conventional SLAMs, the proposed STTM is capable of creating mesh map as well as tagging the recognized objects.

2.Lightweight CNN-based loop closure is much faster and more accurate than Bag-of-Word algorithm. 

Outdoor experience of loop closure

This is our outdoor experiment. Total trajectory length is 1.2 km. When the device returned to a previously visited location,  the loop closure detection was performed and corrected the current trajectory.

bottom of page