Fig. 1. A challenging example with two intertwined clusters, corrupted with noise. After projecting the input data (left) onto the underlying structure using our structure-aware filtering (SAF) approach (red points in the middle), spectral clustering or DBSCAN (right) detect the proper clusters.
WE present a structure-aware filtering (SAF) method that consolidates noisy data by projecting it onto underlying, lower dimensional structures. To reveal structures in noisy inputs, SAF concentrates sample points toward latent and lower dimensional data manifolds, while maintaining an even distribution of samples across these manifolds. We achieve this by adding a regularization to the weighted data averaging in conventional mean shift . A theoretical analysis under a Gaussian noise model is provided, which reveals the parameter settings needed to balance between data concentration on the manifolds and even distribution across them. Empirical experiments show that SAF can significantly boost the performance of state-ofthe-art clustering and dimensionality reduction approaches.
In clustering applications, data may form arbitrary, lower dimensional structures embedded in a feature space. A general strategy to address this problem is to project the data into lower dimensional subspaces where the clusters are more apparent. Often numerous subspaces are required, for example if each cluster manifests itself in a different subspace. The problem is more challenging, however, when the clusters form non-linear structures as demonstrated in Figure 1. Here each cluster has a curvy non-convex structure and the two clusters are intertwined such that they are not separated in any linear subspace. It becomes even more difficult in the presence of irrelevant features or data measurement uncertainties, which appear as noise. As shown in Figure 1, standard techniques such as spectral clustering or DBSCAN may fail to cluster such data.
In summary, the main contribution of this paper is to demonstrate how to boost the performance of common clustering and dimensionality reduction techniques by applying a novel structure-aware data consolidation approach as a pre-process. Our theoretical analysis proofs that, under simplifying assumptions (isotropic Gaussian noise with known variance, planar manifolds in arbitrary dimensions), the proposed structure-aware filtering converges to the underlying data manifolds. Empirical evidence shows that our analysis provides valid guidance to the selection of algorithmic parameters in practical applications.
Fig. 2. Clustering without (SAF) and with anisotropic repulsion (A-SAF).
Fig. 3. Top row: we illustrate changes of point densities during the iterative consolidation process with different h values. Bottom row: empirical estimate and theoretical prediction of the variances ! of the intermediate point distributions. We show results using mean and median repulsion with blue and red crosses, respectively. While theoretical analysis with median repulsion is difficult, it empirically follows our prediction.
Fig. 4. Convergence in different dimensions and with median repulsion. We add Gaussian noise ( = 0:15) to 2D circles (first row, single circle and two concentric ones) and 4D spheres (second row, single sphere and two concentric ones). We set the kernel size h = 0:1, and show results with different values. Equation (3) predicts convergence for < 0:31. The third row shows histograms of distances to the center of the 4D spheres. The values next to the red bell curves are their empirical means and variances, demonstrating convergence to thin manifolds as predicted for < 0:31.
Fig. 5. Performance of dimensionality reduction without (top) and with MD  (second row), MFD  (third row), and SAF (bottom) consolidation.
Fig. 6. Performance under different noise levels and for different datasets. We compare the clustering scores of our SAF consolidation to those of direct clustering, MD  and MFD . SAF performs the best, especially when the noise level is high where the structures are better preserved by SAF. See also the supplemental material for visual comparisons.
Download and Reference
We have released our source code at Code Ocean:
[To reference our algorithm, code or data in a publication, please include the bibtex below and a link to this website.]