Cascaded Feature Network for Semantic Segmentation of RGB-D Images

 ICCV 2017

Di Lin1    Guangyong Chen2     Daniel Cohen-Or 1,3     Pheng-Ann Heng 2    Hui Huang 1*
1Shenzhen University        2The Chinese University of Hong Kong        3Tel Aviv University

Figure 1: There is correlation between depth and sceneresolution: the near field (highlighted in blue rectangle) consists of high scene-resolution, while the far field (highlighted in red rectangle) has low scene resolution.


Fully convolutional network (FCN) has been successfully applied in semantic segmentation of scenes represented with RGB images. Images augmented with depth channel provide more understanding of the geometric information of the scene in the image. The question is how to best exploit this additional information to improve the segmentation performance. In this paper, we present a neural network with multiple branches for segmenting RGB-D images. Our approach is to use the available depth to split the image into layers with common visual characteristic of objects/scenes, or common “scene-resolution”. We introduce context-aware receptive field (CaRF) which provides a better control on the relevant contextual information of the learned features. Equipped with CaRF, each branch of the network semantically segments relevant similar scene-resolution, leading to a more focused domain which is easier to learn. Furthermore, our network is cascaded with features from one branch augmenting the features of adjacent branch. We show that such cascading of features enriches the contextual information of each branch and enhances the overall performance. The accuracy that our network achieves outperforms the stateof-the-art methods on two public datasets.

Figure 2: The overview of our cascaded feature network (CFN). Given the color image, we use CNN to compute the convolutional feature map. The discrete depth image is layered, where each layer represents a scene-resolution and is used to match the image regions to corresponding network branches that share the same convolutional feature map. Each branch has context-aware receptive field (CaRF), which produces contextual representation to combine with the feature from adjacent branch. The predictions of all branches are combined to achieve the eventual segmentation result.

Figure 3: The two-level Context-aware Receptive Field (CaRF): (a) the image partitioned into super-pixels with different sizes; (b) at each node of the coarse grid we aggregate the features that reside in the same super-pixel; (c) the content of adjacent super-pixels is aggregated; (d) the aggregated content in a feature map represents a CaRF. The two-level CaRF is repeatedly applied to the images partitioned by super-pixels with diverse sizes. Note that the feature map has smaller resolution than the image due to down-sampling of network.

Figure 4: The network can have separate branches (a), combined branches (b) or cascaded branches (c). For clarity, we illustrate it with two branches only. Each network can be extended to have more branches.

Figure 5: A sample of the comparison to the baseline model [24] and our CFN. The first two and last rows are scenes taken from NYUDv2 [33] and SUN- GBD [35] dataset, respectively


We thank the reviewers for their constructive comments. This work was supported in part by National 973 Program (2015CB352501, 2015CB351706), NSFC (61522213, 61379090, 61232011, 61233012, U1613219), Guangdong Science and Technology Program (2014TX01X033, 2015A030312015, 2016A050503036), Shenzhen Innovation Program (JCYJ20151015151249564) and Natural Science Foundation of SZU (827-000196).


  title={Cascaded Feature Network for Semantic Segmentation of RGB-D Images},
  author={Di Lin and Guangyong Chen and Daniel Cohen-Or and Pheng-Ann Heng and Hui Huang},

Copyright © 2016-2018 Visual Computing Research Center