Structure-aware Indoor Scene Reconstruction via Two Levels of Abstraction

ISPRS Journal of Photogrammetry and Remote Sensing 2021

Hao Fang1    Cihui Pan2    Hui Huang1*

1Shenzhen University    2Beike

Fig. 1. Goal of our approach. Our framework starts from a raw mesh as input data (a). The indoor scene is reconstructed as a watertight and compact structure mesh (b) and a detailed scene mesh (c) preserving different levels of abstraction. Note that the texture map can be attached to the scene mesh for visualization use with method of Waechter et al., 2014 (d).


In this paper, we propose a novel approach that reconstructs the indoor scene in a structure-aware manner and produces two meshes with different levels of abstraction. To be precise, we start from the raw triangular mesh of indoor scene and decompose it into two parts: structure and non-structure objects. On the one hand, structure objects are defined as significant permanent parts in the indoor environment such as floors, ceilings and walls. In the proposed algorithm, structure objects are abstracted by planar primitives and assembled into a polygonal structure mesh. This step produces a compact structure-aware watertight model that decreases the complexity of original mesh by three orders of magnitude. On the other hand, non-structure objects are movable objects in the indoor environment such as furniture and interior decoration. Meshes of these objects are repaired and simplified according to their relationship with respect to structure primitives. Finally, the union of all the non-structure meshes and structure mesh comprises the scene mesh. Note that structure mesh and scene mesh preserve various levels of abstraction and can be used for different applications according to user preference. Our experiments on both LIDAR and RGBD data scanned from simple to large scale indoor scenes indicate that the proposed framework generates structure-aware results while being robust and scalable. It is also compared qualitatively and quantitatively against popular mesh approximation, floorplan generation and piecewise-planar surface reconstruction methods to demonstrate its performance.

Fig. 2. Overview of our approach. Our algorithm starts from a dense triangular raw mesh generated from the point cloud of indoor scene (a). Then the whole scene is abstracted by 225 planar primitives to represent all parts of the input mesh (b). Among them, 109 planar primitives are selected that best approximate the structure objects of the indoor scene (c) and 27 isolated objects are extracted from nonstructure parts (e). After that, the 109 structure planar primitives are assembled together to form a structure mesh of all structure objects (d). Finally, the scene mesh is the union of the structure mesh and all the 27 non-structure objects which are repaired and simplified (f). Note that the back faces of mesh in (a), (d) and (f) are not shown, and ceiling planes are eliminated in (c) to better visualize the inside environment. Planar primitives in (b) and (c) are approximated by alpha-shape of corresponding triangular facets, each primitive is illustrated by a random color.

Fig. 3. Scene decomposition. First, the input mesh (a) is over-segmented into a large number of planar primitives (b). After that, all the pairs of adjacent quasicoplanr primitives are merged to a bigger one iteratively until a meaningful plane configuration is attained (c). Next, ceiling and floor planes (d), wall planes (e), as well as small structure planes like yellow ones in (f) are detected in a hierarchical manner and compose the structure planes. Finally, isolated non-structure objects are extracted by detecting connected triangular facets in the original mesh.

Fig. 9. Qualitative comparisons with shape approximation methods on RGBD (left) and LIDAR (right) scenes. With the similar number of facets (about 1200), simplified meshes returned by QEM, VSA and Structure preserve most of the large planar structures inside the indoor environment. However, these simplified models shrink at small structures since the existence of noise retained in the input raw meshes (see the cropped region). In contrast, our method produces more compact and structure-aware models where most of these small but important structures are successfully reconstructed.

Fig. 10. Quantitative comparisons with shape approximation methods on complete (left) and partial scenes (right). For complete scene, Structure produces the model that are closest to input raw mesh (see the colored points). While in case of large missing data, our method is robust enough to output a watertight mesh with the best geometric accuracy, while all the three shape approximation methods are disable to repair the holes. Besides, it takes dozens of seconds for our method to process a whole scene which is faster than Structure by one order of magnitude.

Fig. 12. Qualitative comparisons with FloorPlan generation method FloorSP on LIDAR (row 1–2) and RGBD (row 3–5) data. In case of noisy and strong non- Manhattan scenes, FloorSP generates non-manifold (row 3) and self-intersection (row 4 and 5) models. Besides, some walls are also miss-detected (row 1) and incorrectly aligned (row 2, 3 and 5). In contrast, our method is more robust to recover most of the wall structures even for rooms with curvature walls (row 2 and 4).

Fig. 13. Quantitative comparisons against FloorSP on RGBD (top) and LIDAR (middle and bottom) data. Our method produces 3D models that are closer to input wall points (see the colored points) than 2.5D models assembled by walls with a virtual thickness (10 cm) of FloorPlans by FloorSP. In particular, our method exhibits a lower error by recovering small structure details contained in the original mesh such as two close walls.

Fig. 16. Ablation study. While turning off the scene decomposition step (top row), all the detected planes are considered as structure ones and are employed for structure-aware reconstruction. This choice makes the structure mesh contain both structure and non-structure parts. While turning off the local primitive slicing strategy (middle row), all the structure primitives are sliced everywhere inside the bounding box. This method increases the computational time and the size of polyhedral cells exponentially, and leads to a non-compact model with lots of protrusions. In contrast, turning on both of these two ingredients (bottom row), a compact and structure-aware model is reconstructed within an acceptable time. In addition, our scene mesh reveals the best geometric accuracy thanks to the separation of non-structure objects from structure parts.

Fig. 17. Performance on scalable scenes. Given the input raw mesh (top left), our pipeline generates two models with different levels of abstraction: a compact structure mesh ℳs (top right) and a detailed scene mesh ℳt (bottom left). The textured scene mesh is also presented in practice (bottom right).


This work was supported in parts by NSFC (U2001206), GD Science and Technology Program (2020A0505100064, 2015A030312015), GD Talent Program (2019JC05X328), DEGP Key Project (2018KZDXM058), Shenzhen Science and Technology Program (RCJC20200714114435012) and Beike fund. The authors would like to thank Beike for providing various types of indoor scenes, Jiacheng Chen for their code and datasets, Liangliang Nan for the comparison tools, as well as Jing Zhao and Mofang Cheng for technical advices.


title={Structure-aware Indoor Scene Reconstruction via Two Levels of Abstraction},
author={Hao Fang and Cihui Pan and Hui Huang},
journal={ISPRS Journal of Photogrammetry and Remote Sensing},

Downloads(faster for people in China)

Downloads(faster for people in other places)