Semantic Object Reconstruction via Casual Handheld Scanning

ACM Transactions on Graphics (Proceedings of SIGGRAPH ASIA)

Ruizhen Hu1         Cheng Wen1         Oliver Van Kaick2         Luanmin Chen1          Di Lin1       Daniel Cohen-or1,3         Hui Huang1∗

1Shenzhen University     2Carleton University   3Tel Aviv University

Fig. 1. A semantic reconstruction of an object obtained with our method (top), using a semantic labeling of frames (one example shown in the botomright) computed for RGB and depth input images (botom-let and middle).


We introduce a learning-based method to reconstruct objects acquired in a casual handheld scanning setting with a depth camera. Our method is based on two core components. First, a deep network that provides a semantic segmentation and labeling of the frames of an input RGBD sequence. Second, an alignment and reconstruction method that employs the semantic labeling to reconstruct the acquired object from the frames. We demonstrate that the use of a semantic labeling improves the reconstructions of the objects, when compared to methods that use only the depth information of the frames.

Moreover, since training a deep network requires a large amount of labeled data, a key contribution of our work is an active self-learning framework to simplify the creation of the training data. Speciically, we iteratively predict the labeling of frames with the neural network, reconstruct the object from the labeled frames, and evaluate the conidence of the labeling, to incrementally train the neural network while requiring only a small amount of user-provided annotations. We show that this method enables the creation of data for training a neural network with high accuracy, while requiring only little manual efort.

Fig. 2. Overview of our active self-learning method for object reconstruction. We learn how to segment and label a sequence of RGBD frames (a), to improve the quality of object reconstruction (e). Specifically, we employ an active self-learning approach to create the necessary data for the learning while involving minimal user efort. The active learning asks for user input on strategically-selected frames (green arrow) and then invokes a self-learning component on the annotated frames. The self-learning is an automatic learning approach consisting of cycles of prediction, reconstruction, and confidence estimation for creating additional training data from the remaining frames in the sequence (black + blue arrows). Please refer to Section 3 for details on these steps.

Fig. 7. Selected segmentations and labelings of frames obtained with our deep network. Each example shows the RGB and depth inputs, and the prediction. Note the semantic correctness and low noise level of the results.

Fig. 9. Comparison between the label prediction provided by the neural network and the labels obtained ater fusion and back-projection (in the red boxes). Note the improvement in the quality of segments ater fusion.

Fig. 11. Reconstruction results obtained with the method of Nießner et al. [2013], which does not consider semantic information (let of each example), compared to the results of our method that incorporates semantic information (right, in the red boxes). Note how our reconstructions are smoother, and have less missing data and less misalignments.

We thank the anonymous reviewers for their valuable comments. This work was supported in parts by NSFC (61602311, 61522213, 61761146002, 61702338), 973 Program (2015CB352501), GD Science and Technology Program (2015A030312015), Shenzhen Innovation Program (JCYJ20170302153208613, KQJSCX20170727101233642), ISFNSFC Joint Research (2472/17), and NSERC Canada (2015-05407).


title = {Semantic Object Reconstruction via Casual Handheld Scanning},
author = {Ruizhen Hu and Cheng Wen and Oliver Van Kaick and Luanmin Chen and Di Lin and Daniel Cohen-Or and Hui Huang},
journal = {ACM Transactions on Graphics (Proc. SIGGRAPH ASIA)},
volume = {37},
number = {6},
pages = {219:1--219:12},  
year = {2018},