Structure-aware Data Consolidation

IEEE Transactions on Pattern Analysis and Machine Intelligence 2018

Shihao Wu1      Peter Bertholet1      Hui Huang2*      Daniel Cohen-Or3      Minglun Gong4      Matthias Zwicker5

1University of Bern    2Shenzhen University    3Tel-Aviv University    4Memorial University of Newfoundland    5University of Maryland

Fig. 1. A challenging example with two intertwined clusters, corrupted with noise. After projecting the input data (left) onto the underlying structure using our structure-aware filtering (SAF) approach (red points in the middle), spectral clustering or DBSCAN (right) detect the proper clusters.


WE present a structure-aware filtering (SAF) method that consolidates noisy data by projecting it onto underlying, lower dimensional structures. To reveal structures in noisy inputs, SAF concentrates sample points toward latent and lower dimensional data manifolds, while maintaining an even distribution of samples across these manifolds. We achieve this by adding a regularization to the weighted data averaging in conventional mean shift [1]. A theoretical analysis under a Gaussian noise model is provided, which reveals the parameter settings needed to balance between data concentration on the manifolds and even distribution across them. Empirical experiments show that SAF can significantly boost the performance of state-ofthe-art clustering and dimensionality reduction approaches.

In clustering applications, data may form arbitrary, lower dimensional structures embedded in a feature space. A general strategy to address this problem is to project the data into lower dimensional subspaces where the clusters are more apparent. Often numerous subspaces are required, for example if each cluster manifests itself in a different subspace. The problem is more challenging, however, when the clusters form non-linear structures as demonstrated in Figure 1. Here each cluster has a curvy non-convex structure and the two clusters are intertwined such that they are not separated in any linear subspace. It becomes even more difficult in the presence of irrelevant features or data measurement uncertainties, which appear as noise. As shown in Figure 1, standard techniques such as spectral clustering or DBSCAN may fail to cluster such data.

Our structure-aware filtering technique (SAF) excels when the data forms low-dimensional structures that are contaminated by higher-dimensional noise. It is most effective when the low-dimensional manifolds are highly non-linear, like the curvy clusters in Figure 1, which SAF recovers succinctly (red points). After processing the data with SAF, standard clustering techniques are successful as demonstrated in Figure 1. A key advantage is that SAF does not require any local parametric representations of the underlying manifolds. Testing and evaluating our method on various benchmarks shows that SAF improves performance of many standard clustering techniques.

Clustering is also related to (nonlinear) dimensionality reduction, as many clustering techniques strive to construct a lower dimensional embedding of the data before clustering. In comparison, our approach does not project the data to lower dimensional spaces directly. Instead, it projects noisy data onto lower dimensional manifolds, but each data point still maintains its high dimensional features. It can therefore be viewed as a ‚Äúdimension consolidation‚ÄĚ technique. We demonstrate that our strategy can be used as a pre-process to standard dimensionality reduction, and our consolidation indeed leads to more robust results of a number of standard clustering approaches as well.

In summary, the main contribution of this paper is to demonstrate how to boost the performance of common clustering and dimensionality reduction techniques by applying a novel structure-aware data consolidation approach as a pre-process. Our theoretical analysis proofs that, under simplifying assumptions (isotropic Gaussian noise with known variance, planar manifolds in arbitrary dimensions), the proposed structure-aware filtering converges to the underlying data manifolds. Empirical evidence shows that our analysis provides valid guidance to the selection of algorithmic parameters in practical applications.

Fig. 2. Clustering without (SAF) and with anisotropic repulsion (A-SAF).

Fig. 3. Top row: we illustrate changes of point densities during the iterative consolidation process with different h values. Bottom row: empirical estimate and theoretical prediction of the variances ! of the intermediate point distributions. We show results using mean and median repulsion with blue and red crosses, respectively. While theoretical analysis with median repulsion is difficult, it empirically follows our prediction.

Fig. 4. Convergence in different dimensions and with median repulsion. We add Gaussian noise ( = 0:15) to 2D circles (first row, single circle and two concentric ones) and 4D spheres (second row, single sphere and two concentric ones). We set the kernel size h = 0:1, and show results with different  values. Equation (3) predicts convergence for  < 0:31. The third row shows histograms of distances to the center of the 4D spheres. The values next to the red bell curves are their empirical means and variances, demonstrating convergence to thin manifolds as predicted for  < 0:31.

Fig. 5. Performance of dimensionality reduction without (top) and with MD [21] (second row), MFD [26] (third row), and SAF (bottom) consolidation.

Fig. 6. Performance under different noise levels and for different datasets. We compare the clustering scores of our SAF consolidation to those of direct clustering, MD [21] and MFD [26]. SAF performs the best, especially when the noise level is high where the structures are better preserved by SAF. See also the supplemental material for visual comparisons.

Download and Reference

Note that the DATA and CODE are free for Research and Education Use ONLY. 

Please cite our paper (add the bibtex below) if you use any part of our ALGORITHM, CODE, DATA or RESULTS in any publication.

We have released our source code at Code Ocean: Structure-aware Filtering


We thank the anonymous reviewers for their constructive comments. This work was supported in part by Swiss National Science Foundation (169151), National Science Foundation of China (61522213, 61379090), NSFC-ISF Joint Research Program (6171101005, 2472/17), Natural Sciences and Engineering Research Council of Canada (2017-06086), Guangdong Science and Technology Program (2015A030312015), and Natural Science Foundation of Shenzhen University (827-000196). Hui Huang is the corresponding author of this paper.

@article {SAF,
    title = {Structure-aware Data Consolidation},
    author = {Shihao Wu and Peter Bertholet and Hui Huang and Daniel Cohen-Or and Minglun Gong and Matthias Zwicker},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
    volume = {40},
    issue = {10},
    pages = {2529--2537},
    year = {2018}

Downloads (faster for people in China)

Downloads (faster for people in other places)